From 58b1ab1b0aa04431c49b25e1751024ae2444218c Mon Sep 17 00:00:00 2001
From: Jeremy Singer-Vine <jsvine@gmail.com>
Date: Fri, 13 May 2022 16:04:03 -0400
Subject: [PATCH] Add (experimental) page.search(...) feature

First proposed here: https://github.com/jsvine/pdfplumber/issues/201

Adding this feature involved refactoring and re-engineering a good chunk
of the text-layout-extraction code. As part of that, this commit
introduces two new classes, in utils.py: LayoutEngine and TextLayout.
They should be considered provisional, and may change name/approach in
the future.
---
 CHANGELOG.md        |   3 +-
 README.md           |   1 +
 pdfplumber/page.py  |  39 ++++++-
 pdfplumber/table.py |   6 +-
 pdfplumber/utils.py | 271 +++++++++++++++++++++++++++++++-------------
 tests/test_utils.py |  58 +++++++++-
 6 files changed, 285 insertions(+), 93 deletions(-)
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 3deac509..7af829e2 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -8,6 +8,7 @@ All notable changes to this project will be documented in this file. The format
 
 - Add `"matrix"` property to `char` objects, representing the current transformation matrix.
 - Add `pdfplumber.ctm` submodule with class `CTM`, to calculate scale, skew, and translation of the current transformation matrix.
+- Add `page.search(...)`, an *experimental feature* that allows you to search a page's text via regular expressions and non-regex strings, returning the text, any regex matches, the bounding box coordinates, and the char objects themselves. ([#201](https://github.com/jsvine/pdfplumber/issues/201))
 
 ## [0.6.2] - 2022-05-06
 
@@ -28,8 +29,6 @@ All notable changes to this project will be documented in this file. The format
 - Remove `utils.filter_objects(...)` and move the functionality to within the `FilteredPage.objects` property calculation, the only part of the library that used it. ([9587cc7](https://github.com/jsvine/pdfplumber/commit/9587cc7d2292a1eae7a0150ab406f9365944266f))
 - Remove code that sets `pdfminer.pdftypes.STRICT = True` and `pdfminer.pdfinterp.STRICT = True`, since that [has now been the default for a while](https://github.com/pdfminer/pdfminer.six/commit/9439a3a31a347836aad1c1226168156125d9505f). ([9587cc7](https://github.com/jsvine/pdfplumber/commit/9587cc7d2292a1eae7a0150ab406f9365944266f))
 
-### Fixed
-
 ## [0.6.1] - 2022-04-23
 
 ### Changed
diff --git a/README.md b/README.md
index 1ecf45fd..7b810188 100644
--- a/README.md
+++ b/README.md
@@ -108,6 +108,7 @@ The `pdfplumber.Page` class is at the core of `pdfplumber`. Most things you'll d
 |`.dedupe_chars(tolerance=1)`| Returns a version of the page with duplicate chars — those sharing the same text, fontname, size, and positioning (within `tolerance` x/y) as other characters — removed. (See [Issue #71](https://github.com/jsvine/pdfplumber/issues/71) to understand the motivation.)|
 |`.extract_text(x_tolerance=3, y_tolerance=3, layout=False, x_density=7.25, y_density=13, **kwargs)`| Collates all of the page's character objects into a single string.<ul><li><p>When `layout=False`: Adds spaces where the difference between the `x1` of one character and the `x0` of the next is greater than `x_tolerance`. Adds newline characters where the difference between the `doctop` of one character and the `doctop` of the next is greater than `y_tolerance`.</p></li><li><p>When `layout=True` (*experimental feature*): Attempts to mimic the structural layout of the text on the page(s), using `x_density` and `y_density` to determine the minimum number of characters/newlines per "point," the PDF unit of measurement. All remaining `**kwargs` are passed to `.extract_words(...)` (see below), the first step in calculating the layout.</p></li></ul>|
 |`.extract_words(x_tolerance=3, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, horizontal_ltr=True, vertical_ttb=True, extra_attrs=[])`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. The parameters `horizontal_ltr` and `vertical_ttb` indicate whether the words should be read from left-to-right (for horizontal words) / top-to-bottom (for vertical words). Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) Passing a list of `extra_attrs`  (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](https://github.com/jsvine/pdfplumber/blob/develop/README.md#char-properties), and the resulting word dicts will indicate those attributes.|
+|`.search(pattern, regex=True, case=True, **kwargs)`|*Experimental feature* that allows you to search a page's text, returning a list of all instances that match the query. For each instance, the response dictionary object contains the matching text, any regex group matches, the bounding box coordinates, and the char objects themselves. `pattern` can be a compiled regular expression, an uncompiled regular expression, or a non-regex string. If `regex` is `False`, the pattern is treated as a non-regex string. If `case` is `False`, the search is performed in a case-insensitive manner.|
 |`.extract_tables(table_settings)`| Extracts tabular data from the page. For more details see "[Extracting tables](#extracting-tables)" below.|
 |`.to_image(**conversion_kwargs)`| Returns an instance of the `PageImage` class. For more details, see "[Visual debugging](#visual-debugging)" below. For conversion_kwargs, see [here](http://docs.wand-py.org/en/latest/wand/image.html#wand.image.Image).|
 |`.close()`| By default, `Page` objects cache their layout and object information to avoid having to reprocess it. When parsing large PDFs, however, these cached properties can require a lot of memory. You can use this method to flush the cache and release the memory. (In version `<= 0.5.25`, use `.flush_cache()`.)|
diff --git a/pdfplumber/page.py b/pdfplumber/page.py
index 9b932a61..4ef6e32a 100644
--- a/pdfplumber/page.py
+++ b/pdfplumber/page.py
@@ -1,5 +1,17 @@
 import re
-from typing import TYPE_CHECKING, Any, Callable, Dict, Generator, List, Optional, Tuple
+from functools import lru_cache
+from typing import (
+    TYPE_CHECKING,
+    Any,
+    Callable,
+    Dict,
+    Generator,
+    List,
+    Optional,
+    Pattern,
+    Tuple,
+    Union,
+)
 
 from pdfminer.converter import PDFPageAggregator
 from pdfminer.layout import (
@@ -287,10 +299,29 @@ def sorter(x: Table) -> Tuple[int, T_num, T_num]:
 
         return largest.extract(**extract_kwargs)
 
+    @lru_cache
+    def get_text_layout(self, **kwargs: Any) -> utils.TextLayout:
+        defaults = dict(x_shift=self.bbox[0], y_shift=self.bbox[1])
+        full_kwargs: Dict[str, Any] = {**defaults, **kwargs}
+        return utils.chars_to_layout(self.chars, **full_kwargs)
+
+    def search(
+        self,
+        pattern: Union[str, Pattern[str]],
+        regex: bool = True,
+        case: bool = True,
+        **kwargs: Any,
+    ) -> List[Dict[str, Any]]:
+        text_layout = self.get_text_layout(**kwargs)
+        return text_layout.search(pattern, regex=regex, case=case)
+
     def extract_text(self, **kwargs: Any) -> str:
-        return utils.extract_text(
-            self.chars, x_shift=self.bbox[0], y_shift=self.bbox[1], **kwargs
-        )
+        if kwargs.get("layout") is True:
+            del kwargs["layout"]
+            text_layout = self.get_text_layout(**kwargs)
+            return text_layout.to_string()
+        else:
+            return utils.extract_text(self.chars, **kwargs)
 
     def extract_words(self, **kwargs: Any) -> T_obj_list:
         return utils.extract_words(self.chars, **kwargs)
diff --git a/pdfplumber/table.py b/pdfplumber/table.py
index 301a8921..7ccd8c81 100644
--- a/pdfplumber/table.py
+++ b/pdfplumber/table.py
@@ -105,7 +105,7 @@ def words_to_edges_h(
     Find (imaginary) horizontal lines that connect the tops
     of at least `word_threshold` words.
     """
-    by_top = utils.cluster_objects(words, "top", 1)
+    by_top = utils.cluster_objects(words, itemgetter("top"), 1)
     large_clusters = filter(lambda x: len(x) >= word_threshold, by_top)
     rects = list(map(utils.objects_to_rect, large_clusters))
     if len(rects) == 0:
@@ -149,8 +149,8 @@ def words_to_edges_v(
     center of at least `word_threshold` words.
     """
     # Find words that share the same left, right, or centerpoints
-    by_x0 = utils.cluster_objects(words, "x0", 1)
-    by_x1 = utils.cluster_objects(words, "x1", 1)
+    by_x0 = utils.cluster_objects(words, itemgetter("x0"), 1)
+    by_x1 = utils.cluster_objects(words, itemgetter("x1"), 1)
 
     def get_center(word: T_obj) -> T_num:
         return float(word["x0"] + word["x1"]) / 2
diff --git a/pdfplumber/utils.py b/pdfplumber/utils.py
index 3e02fa49..121ce786 100644
--- a/pdfplumber/utils.py
+++ b/pdfplumber/utils.py
@@ -1,4 +1,5 @@
 import itertools
+import re
 from collections.abc import Sequence
 from operator import itemgetter
 from typing import (
@@ -9,7 +10,11 @@
     Generator,
     Iterable,
     List,
+    Match,
     Optional,
+    Pattern,
+    Tuple,
+    TypeVar,
     Union,
 )
 
@@ -58,26 +63,19 @@ def make_cluster_dict(values: Iterable[T_num], tolerance: T_num) -> Dict[T_num,
     return dict(itertools.chain(*nested_tuples))
 
 
-def _itemgetter(attr: Union[str, Callable[[T_obj], T_num]]) -> Callable[[T_obj], T_num]:
-    if isinstance(attr, (str, tuple)):
-        return itemgetter(attr)
-    else:
-        return attr
+R = TypeVar("R")
 
 
 def cluster_objects(
-    objs: T_obj_iter, attr: Union[str, Callable[[T_obj], T_num]], tolerance: T_num
-) -> List[T_obj_list]:
-    getter = _itemgetter(attr)
-    objs_list = list(objs)
-    values = map(getter, objs_list)
+    xs: List[R], key_fn: Callable[[R], T_num], tolerance: T_num
+) -> List[List[R]]:
+
+    values = map(key_fn, xs)
     cluster_dict = make_cluster_dict(values, tolerance)
 
     get_0, get_1 = itemgetter(0), itemgetter(1)
 
-    cluster_tuples = sorted(
-        ((obj, cluster_dict.get(getter(obj))) for obj in objs_list), key=get_1
-    )
+    cluster_tuples = sorted(((x, cluster_dict.get(key_fn(x))) for x in xs), key=get_1)
 
     grouped = itertools.groupby(cluster_tuples, key=get_1)
 
@@ -183,8 +181,12 @@ def dedupe_chars(chars: T_obj_list, tolerance: T_num = 1) -> T_obj_list:
     def yield_unique_chars(chars: T_obj_list) -> Generator[T_obj, None, None]:
         sorted_chars = sorted(chars, key=key)
         for grp, grp_chars in itertools.groupby(sorted_chars, key=key):
-            for y_cluster in cluster_objects(grp_chars, "doctop", tolerance):
-                for x_cluster in cluster_objects(y_cluster, "x0", tolerance):
+            for y_cluster in cluster_objects(
+                list(grp_chars), itemgetter("doctop"), tolerance
+            ):
+                for x_cluster in cluster_objects(
+                    y_cluster, itemgetter("x0"), tolerance
+                ):
                     yield sorted(x_cluster, key=pos_key)[0]
 
     deduped = yield_unique_chars(chars)
@@ -331,13 +333,13 @@ def iter_sort_chars(self, chars: T_obj_iter) -> Generator[T_obj, None, None]:
         def upright_key(x: T_obj) -> int:
             return -int(x["upright"])
 
-        for upright_cluster in cluster_objects(chars, upright_key, 0):
+        for upright_cluster in cluster_objects(list(chars), upright_key, 0):
             upright = upright_cluster[0]["upright"]
             cluster_key = "doctop" if upright else "x0"
 
             # Cluster by line
             subclusters = cluster_objects(
-                upright_cluster, cluster_key, self.y_tolerance
+                upright_cluster, itemgetter(cluster_key), self.y_tolerance
             )
 
             for sc in subclusters:
@@ -351,7 +353,9 @@ def upright_key(x: T_obj) -> int:
                 else:
                     yield from to_yield
 
-    def iter_extract(self, chars: T_obj_iter) -> Generator[T_obj, None, None]:
+    def iter_extract_tuples(
+        self, chars: T_obj_iter
+    ) -> Generator[Tuple[T_obj, T_obj_list], None, None]:
         if not self.use_text_flow:
             chars = self.iter_sort_chars(chars)
 
@@ -360,10 +364,91 @@ def iter_extract(self, chars: T_obj_iter) -> Generator[T_obj, None, None]:
 
         for keyvals, char_group in grouped:
             for word_chars in self.iter_chars_to_words(char_group):
-                yield self.merge_chars(word_chars)
+                yield (self.merge_chars(word_chars), word_chars)
 
     def extract(self, chars: T_obj_list) -> T_obj_list:
-        return list(self.iter_extract(chars))
+        return list(word for word, word_chars in self.iter_extract_tuples(chars))
+
+
+class LayoutEngine:
+    def __init__(
+        self,
+        x_density: T_num = DEFAULT_X_DENSITY,
+        y_density: T_num = DEFAULT_Y_DENSITY,
+        x_shift: T_num = 0,
+        y_shift: T_num = 0,
+        y_tolerance: T_num = DEFAULT_Y_TOLERANCE,
+        presorted: bool = False,
+    ):
+        self.x_density = x_density
+        self.y_density = y_density
+        self.x_shift = x_shift
+        self.y_shift = y_shift
+        self.y_tolerance = y_tolerance
+        self.presorted = presorted
+
+    def calculate(
+        self, word_tuples: List[Tuple[T_obj, T_obj_list]]
+    ) -> List[Tuple[str, Optional[T_obj]]]:
+        """
+        Given a list of (word, chars) tuples, return a list of (char-text,
+        char) tuples that can be used to mimic the structural layout of the
+        text on the page(s), using the following approach:
+
+        - Sort the words by (doctop, x0) if not already sorted.
+
+        - Calculate the initial doctop for the starting page.
+
+        - Cluster the words by doctop (taking `y_tolerance` into account), and
+          iterate through them.
+
+        - For each cluster, calculate the distance between that doctop and the
+          initial doctop, in points, minus `y_shift`. Divide that distance by
+          `y_density` to calculate the minimum number of newlines that should come
+          before this cluster. Append that number of newlines *minus* the number of
+          newlines already appended, with a minimum of one.
+
+        - Then for each cluster, iterate through each word in it. Divide each
+          word's x0, minus `x_shift`, by `x_density` to calculate the minimum
+          number of characters that should come before this cluster.  Append that
+          number of spaces *minus* the number of characters and spaces already
+          appended, with a minimum of one. Then append the word's text.
+
+        Note: This approach currently works best for horizontal, left-to-right
+        text, but will display all words regardless of orientation. There is room
+        for improvement in better supporting right-to-left text, as well as
+        vertical text.
+        """
+        rendered: List[Tuple[str, Optional[T_obj]]] = []
+        num_newlines = 0
+        words_sorted = (
+            word_tuples
+            if self.presorted
+            else sorted(word_tuples, key=lambda x: (x[0]["doctop"], x[0]["x0"]))
+        )
+        first_word = words_sorted[0][0]
+        doctop_start = first_word["doctop"] - first_word["top"]
+        for ws in cluster_objects(
+            words_sorted, lambda x: float(x[0]["doctop"]), self.y_tolerance
+        ):
+            y_dist = (
+                ws[0][0]["doctop"] - (doctop_start + self.y_shift)
+            ) / self.y_density
+            num_newlines_prepend = max(
+                min(1, num_newlines), round(y_dist) - num_newlines
+            )
+            rendered += [("\n", None)] * num_newlines_prepend
+            num_newlines += num_newlines_prepend
+
+            line_len = 0
+            for word, chars in sorted(ws, key=lambda x: float(x[0]["x0"])):
+                x_dist = (word["x0"] - self.x_shift) / self.x_density
+                num_spaces_prepend = max(min(1, line_len), round(x_dist) - line_len)
+                rendered += [(" ", None)] * num_spaces_prepend
+                for c in chars:
+                    rendered.append((c["text"], c))
+                line_len += num_spaces_prepend + len(word["text"])
+        return rendered
 
 
 def extract_words(
@@ -387,58 +472,56 @@ def extract_words(
     ).extract(chars)
 
 
-def words_to_layout(
-    words: T_obj_list,
-    x_density: T_num = DEFAULT_X_DENSITY,
-    y_density: T_num = DEFAULT_Y_DENSITY,
-    x_shift: T_num = 0,
-    y_shift: T_num = 0,
-    y_tolerance: T_num = DEFAULT_Y_TOLERANCE,
-    presorted: bool = False,
-) -> str:
-    """
-    Given a set of word objects generated by `extract_words(...)`, return a
-    string that mimics the structural layout of the text on the page(s), using
-    the following approach:
-
-    - Sort the words by (doctop, x0) if not already sorted.
-
-    - Calculate the initial doctop for the starting page.
-
-    - Cluster the words by doctop (taking `y_tolerance` into account), and
-      iterate through them.
-
-    - For each cluster, calculate the distance between that doctop and the
-      initial doctop, in points, minus `y_shift`. Divide that distance by
-      `y_density` to calculate the minimum number of newlines that should come
-      before this cluster. Append that number of newlines *minus* the number of
-      newlines already appended, with a minimum of one.
+class TextLayout:
+    def __init__(
+        self, chars: T_obj_list, extractor: WordExtractor, engine: LayoutEngine
+    ):
+        self.chars = chars
+        self.extractor = extractor
+        self.engine = engine
+        self.word_tuples = list(extractor.iter_extract_tuples(chars))
+        self.layout_tuples = engine.calculate(self.word_tuples)
+        self.as_string = "".join(map(itemgetter(0), self.layout_tuples))
+
+    def to_string(self) -> str:
+        return self.as_string
+
+    def search(
+        self, pattern: Union[str, Pattern[str]], regex: bool = True, case: bool = True
+    ) -> List[Dict[str, Any]]:
+        def match_to_dict(m: Match[str]) -> Dict[str, Any]:
+            subset = self.layout_tuples[m.start() : m.end()]
+            chars = [c for (text, c) in subset if c is not None]
+            x0, top, x1, bottom = objects_to_bbox(chars)
+            return {
+                "text": m.group(0),
+                "groups": m.groups(),
+                "x0": x0,
+                "top": top,
+                "x1": x1,
+                "bottom": bottom,
+                "chars": chars,
+            }
+
+        if isinstance(pattern, Pattern):
+            if regex is False:
+                raise ValueError(
+                    "Cannot pass a compiled search pattern *and* regex=False together."
+                )
+            if case is False:
+                raise ValueError(
+                    "Cannot pass a compiled search pattern *and* case=False together."
+                )
+            compiled = pattern
+        else:
+            if regex is False:
+                pattern = re.escape(pattern)
 
-    - Then for each cluster, iterate through each word in it. Divide each
-      word's x0, minus `x_shift`, by `x_density` to calculate the minimum
-      number of characters that should come before this cluster.  Append that
-      number of spaces *minus* the number of characters and spaces already
-      appended, with a minimum of one. Then append the word's text.
+            flags = re.I if case is False else 0
+            compiled = re.compile(pattern, flags)
 
-    Note: This approach currently works best for horizontal, left-to-right
-    text, but will display all words regardless of orientation. There is room
-    for improvement in better supporting right-to-left text, as well as
-    vertical text.
-    """
-    rendered = ""
-    words_sorted = words if presorted else sorted(words, key=itemgetter("doctop", "x0"))
-    doctop_start = words_sorted[0]["doctop"] - words_sorted[0]["top"]
-    for ws in cluster_objects(words_sorted, "doctop", y_tolerance):
-        y_dist = (ws[0]["doctop"] - (doctop_start + y_shift)) / y_density
-        newlines = rendered.count("\n")
-        rendered += "\n" * max(min(1, newlines), round(y_dist) - newlines)
-        line = ""
-        for word in sorted(ws, key=itemgetter("x0")):
-            x_dist = (word["x0"] - x_shift) / x_density
-            line += " " * max(min(1, len(line)), round(x_dist) - len(line))
-            line += word["text"]
-        rendered += line
-    return rendered
+        gen = re.finditer(compiled, self.as_string)
+        return list(map(match_to_dict, gen))
 
 
 def collate_line(
@@ -454,6 +537,42 @@ def collate_line(
     return coll
 
 
+def chars_to_layout(
+    chars: T_obj_list,
+    x_density: T_num = DEFAULT_X_DENSITY,
+    y_density: T_num = DEFAULT_Y_DENSITY,
+    x_shift: T_num = 0,
+    y_shift: T_num = 0,
+    x_tolerance: T_num = DEFAULT_X_TOLERANCE,
+    y_tolerance: T_num = DEFAULT_Y_TOLERANCE,
+    keep_blank_chars: bool = False,
+    use_text_flow: bool = False,
+    horizontal_ltr: bool = True,  # Should words be read left-to-right?
+    vertical_ttb: bool = True,  # Should vertical words be read top-to-bottom?
+    extra_attrs: Optional[List[str]] = None,
+) -> TextLayout:
+    extractor = WordExtractor(
+        x_tolerance=x_tolerance,
+        y_tolerance=y_tolerance,
+        keep_blank_chars=keep_blank_chars,
+        use_text_flow=use_text_flow,
+        horizontal_ltr=horizontal_ltr,
+        vertical_ttb=vertical_ttb,
+        extra_attrs=extra_attrs,
+    )
+
+    engine = LayoutEngine(
+        x_density=x_density,
+        y_density=y_density,
+        x_shift=x_shift,
+        y_shift=y_shift,
+        y_tolerance=y_tolerance,
+        presorted=True,
+    )
+
+    return TextLayout(chars, extractor, engine)
+
+
 def extract_text(
     chars: T_obj_list,
     layout: bool = False,
@@ -474,7 +593,7 @@ def extract_text(
         return ""
 
     if layout:
-        words = extract_words(
+        calculated_layout = chars_to_layout(
             chars,
             x_tolerance=x_tolerance,
             y_tolerance=y_tolerance,
@@ -483,19 +602,15 @@ def extract_text(
             horizontal_ltr=horizontal_ltr,
             vertical_ttb=vertical_ttb,
             extra_attrs=extra_attrs,
-        )
-        return words_to_layout(
-            words,
             x_density=x_density,
             y_density=y_density,
             x_shift=x_shift,
             y_shift=y_shift,
-            y_tolerance=y_tolerance,
-            presorted=True,
         )
+        return calculated_layout.to_string()
 
     else:
-        doctop_clusters = cluster_objects(chars, "doctop", y_tolerance)
+        doctop_clusters = cluster_objects(chars, itemgetter("doctop"), y_tolerance)
 
         lines = (
             collate_line(line_chars, x_tolerance) for line_chars in doctop_clusters
@@ -604,7 +719,7 @@ def move_object(obj: T_obj, axis: str, value: T_num) -> T_obj:
 
 def snap_objects(objs: T_obj_list, attr: str, tolerance: T_num) -> T_obj_list:
     axis = {"x0": "h", "x1": "h", "top": "v", "bottom": "v"}[attr]
-    clusters = cluster_objects(objs, attr, tolerance)
+    clusters = cluster_objects(objs, itemgetter(attr), tolerance)
     avgs = [sum(map(itemgetter(attr), objs)) / len(objs) for objs in clusters]
     snapped_clusters = [
         [move_object(obj, axis, avg - obj[attr]) for obj in cluster]
diff --git a/tests/test_utils.py b/tests/test_utils.py
index 727f6fac..3058173e 100644
--- a/tests/test_utils.py
+++ b/tests/test_utils.py
@@ -1,6 +1,7 @@
 #!/usr/bin/env python
 import logging
 import os
+import re
 import unittest
 from itertools import groupby
 from operator import itemgetter
@@ -21,8 +22,10 @@
 class Test(unittest.TestCase):
     @classmethod
     def setup_class(self):
-        path = os.path.join(HERE, "pdfs/pdffill-demo.pdf")
-        self.pdf = pdfplumber.open(path)
+        self.pdf = pdfplumber.open(os.path.join(HERE, "pdfs/pdffill-demo.pdf"))
+        self.pdf_scotus = pdfplumber.open(
+            os.path.join(HERE, "pdfs/scotus-transcript-p1.pdf")
+        )
 
     @classmethod
     def teardown_class(self):
@@ -129,21 +132,64 @@ def test_extract_text(self):
         assert self.pdf.pages[0].crop((0, 0, 1, 1)).extract_text() == ""
 
     def test_extract_text_layout(self):
-        pdf = pdfplumber.open(os.path.join(HERE, "pdfs/scotus-transcript-p1.pdf"))
         target = open(os.path.join(HERE, "comparisons/scotus-transcript-p1.txt")).read()
-        text = pdf.pages[0].extract_text(layout=True)
+        page = self.pdf_scotus.pages[0]
+        text = page.extract_text(layout=True)
+        utils_text = utils.extract_text(page.chars, layout=True)
+        assert text == utils_text
         assert text == target
 
     def test_extract_text_layout_cropped(self):
-        pdf = pdfplumber.open(os.path.join(HERE, "pdfs/scotus-transcript-p1.pdf"))
         target = open(
             os.path.join(HERE, "comparisons/scotus-transcript-p1-cropped.txt")
         ).read()
-        p = pdf.pages[0]
+        p = self.pdf_scotus.pages[0]
         cropped = p.crop((90, 70, p.width, 300))
         text = cropped.extract_text(layout=True)
         assert text == target
 
+    def test_search_regex_compiled(self):
+        page = self.pdf_scotus.pages[0]
+        pat = re.compile(r"supreme\s+(\w+)", re.I)
+        results = page.search(pat)
+        assert results[0]["text"] == "SUPREME COURT"
+        assert results[0]["groups"] == ("COURT",)
+        assert results[1]["text"] == "Supreme Court"
+        assert results[1]["groups"] == ("Court",)
+
+        with pytest.raises(ValueError):
+            page.search(re.compile(r"x"), regex=False)
+
+        with pytest.raises(ValueError):
+            page.search(re.compile(r"x"), case=False)
+
+    def test_search_regex_uncompiled(self):
+        page = self.pdf_scotus.pages[0]
+        pat = r"supreme\s+(\w+)"
+        results = page.search(pat, case=False)
+        assert results[0]["text"] == "SUPREME COURT"
+        assert results[0]["groups"] == ("COURT",)
+        assert results[1]["text"] == "Supreme Court"
+        assert results[1]["groups"] == ("Court",)
+
+    def test_search_string(self):
+        page = self.pdf_scotus.pages[0]
+        results = page.search("SUPREME COURT", regex=False)
+        assert results[0]["text"] == "SUPREME COURT"
+        assert results[0]["groups"] == tuple()
+
+        results = page.search("supreme court", regex=False)
+        assert len(results) == 0
+
+        results = page.search("supreme court", regex=False, case=False)
+        assert len(results) == 2
+
+        results = page.search("supreme court", regex=True, case=False)
+        assert len(results) == 2
+
+        results = page.search(r"supreme\s+(\w+)", regex=False)
+        assert len(results) == 0
+
     def test_intersects_bbox(self):
         objs = [
             # Is same as bbox