New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Refactoring PDF loaders: 02 PyMuPDF #29063

Open

pprados wants to merge 7 commits into langchain-ai:master from pprados:pprados/02-pymupdf

+2,058 −173

Contributor

pprados commented Jan 7, 2025 •

edited

Loading

Refactoring PDF loaders step 2: "community: Refactoring PDF loaders to standardize approaches"
Description: Update PyMuPDFParser/Loader
Twitter handle: pprados

This is one part of a larger Pull Request (PR) that is too large to be submitted all at once.
This specific part focuses to prepare the update of all parsers.

For more details, see PR 28970.

@eyurtsev it's the continuation of PDFLoader modifications.

vercel bot commented Jan 7, 2025 •

edited

Loading

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jan 7, 2025 4:18pm

vercel bot deployed to Preview

January 7, 2025 08:55

View deployment

vercel bot deployed to Preview

January 7, 2025 09:15

View deployment

pprados marked this pull request as ready for review

January 7, 2025 09:16

dosubot bot added size:XXL community Ɑ: doc loader labels

ccurme assigned eyurtsev

pprados added 7 commits

January 7, 2025 17:08


          Prepare the integration of new versions of PDFLoader.

21759e2

Add file_path with PurePath
Add CloudBlobLoader in __init__
Replace Dict/List to dict/list


          Fix Line too long


          Fix Line too long

668dc9c


          Fix Line too long

7a5b5c5


          Fix Line too long

6340ded


          Update PyMuPDF


          Fix tu

3beda82

pprados force-pushed the pprados/02-pymupdf branch from 039819c to 3beda82 Compare

January 7, 2025 16:09

vercel bot deployed to Preview

January 7, 2025 16:18

View deployment

Contributor Author

pprados commented Jan 7, 2025

@eyurtsev I rebase the code with master ;-)

eyurtsev reviewed

View reviewed changes

Collaborator

eyurtsev left a comment

Great will take a look in the AM

pprados mentioned this pull request

Refactoring PDF loaders: all #28970

Draft

2 tasks

eyurtsev reviewed

View reviewed changes

Collaborator

eyurtsev left a comment

Left two major comment, a few stylistic comments and some nits.

Let's tackle the two major comments:

Define the standardized structure of metadata
Create a dedicated ImageParser which is a blob parser

libs/community/langchain_community/document_loaders/parsers/pdf.py

    
            @@ -46,6 +58,119 @@
          
                  "JBIG2Decode",

              ]

              logger = logging.getLogger(__name__)

              _format_image_str = "\n\n{image_text}\n\n"

Collaborator

eyurtsev Jan 8, 2025

nit: could we capitalize global constants

https://google.github.io/styleguide/pyguide.html#316-naming

libs/community/langchain_community/document_loaders/parsers/pdf.py

    
              def purge_metadata(metadata: dict[str, Any]) -> dict[str, Any]:

                  """

Collaborator

eyurtsev Jan 8, 2025

nit: https://google.github.io/styleguide/pyguide.html#383-functions-and-methods

We don't enforce this right now, but we try to have the first description on the first line (i.e., no new line)

libs/community/langchain_community/document_loaders/parsers/pdf.py

    
                  for k, v in metadata.items():

                      if type(v) not in [str, int]:

                          v = str(v)

                      if k.startswith("/"):

Collaborator

eyurtsev Jan 8, 2025

bug? The file path could be an absolute path on the local machine -- this looks like an error right now?

libs/community/langchain_community/document_loaders/parsers/pdf.py

    
              _delim = ["\n\n\n", "\n\n"]  # To insert images or table in the middle of the page.

              def __merge_text_and_extras(

Collaborator

eyurtsev Jan 8, 2025

nit: Maybe improve the name so it's a bit more distinct from the _ name? We typically don't use __ in the code

libs/community/langchain_community/document_loaders/parsers/pdf.py

    
                  """

                  Purge metadata from unwanted keys and normalize key names.

                  Args:

Collaborator

eyurtsev Jan 8, 2025

MAJOR:

Could you describe what the wanted keys are and how they will be standardized / normalized? (And why?)

This feels like a big decision if the metadata is to be standardized across all PDF parsers?

MINOR

The function documentation makes it sound like it's mutating the original metadata (which I think it's not doing). A better name like "create_standardized_metadata" or "standardize metadata" and a doc-string that indicates that it's creating a standardized metadata dict from the given will help here

libs/community/langchain_community/document_loaders/parsers/pdf.py

    
                  return _convert_images_to_text

              _prompt_images_to_description = PromptTemplate.from_template(

Collaborator

eyurtsev Jan 9, 2025

Better to use a string here since .format() isn't used in a useful way

libs/community/langchain_community/document_loaders/parsers/pdf.py

    
                          else:

                              yield ""

                  _convert_images_to_text.creator = (  # type: ignore[attr-defined]

Collaborator

eyurtsev Jan 9, 2025

let's avoid assigning attributes to functions

libs/community/langchain_community/document_loaders/parsers/pdf.py

    
                  def _get_page_content(self, doc: fitz.Document, page: fitz.Page, blob: Blob) -> str:

                              self.extract_tables_settings = {

Collaborator

eyurtsev Jan 9, 2025

how were these chosen is it possible to add a comment?

libs/community/langchain_community/document_loaders/parsers/pdf.py

    
              def convert_images_to_description(

                  model: BaseChatModel,

                  *,

                  prompt: BasePromptTemplate = _prompt_images_to_description,

Collaborator

eyurtsev Jan 9, 2025

Suggested change

      
                prompt: BasePromptTemplate = _prompt_images_to_description,
          
                prompt: str = _prompt_images_to_description,

libs/community/langchain_community/document_loaders/parsers/pdf.py

    
            @@ -78,6 +203,192 @@ def extract_from_images_with_rapidocr(
          
                  return text

              # Type to change the function to convert images to text.

              CONVERT_IMAGE_TO_TEXT = Optional[Callable[[Iterable[np.ndarray]], Iterator[str]]]

Collaborator

eyurtsev Jan 9, 2025

MAJOR:

Why not use an ImageBlobParser w/ the regular Blob to Document interface. it'll allow reusing the image logic for images that do not originate from pdfs (e.g., to re-use for a web crawler)

A PDF parser doesn't would accept a parser as part of the initializer

class PDFParser(...):
   def __Init__(self, ...  *, ..., image_blob_parser: Optional[BlobParser] = None):
      pass

If the image_pdf_parser is provided, then it'll be used for OCR purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community Ɑ: doc loader size:XXL