-
Notifications
You must be signed in to change notification settings - Fork 15.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring PDF loaders: 02 PyMuPDF #29063
base: master
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Add file_path with PurePath Add CloudBlobLoader in __init__ Replace Dict/List to dict/list
039819c
to
3beda82
Compare
@eyurtsev I rebase the code with master ;-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great will take a look in the AM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left two major comment, a few stylistic comments and some nits.
Let's tackle the two major comments:
- Define the standardized structure of metadata
- Create a dedicated ImageParser which is a blob parser
@@ -46,6 +58,119 @@ | |||
"JBIG2Decode", | |||
] | |||
|
|||
logger = logging.getLogger(__name__) | |||
|
|||
_format_image_str = "\n\n{image_text}\n\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could we capitalize global constants
|
||
|
||
def purge_metadata(metadata: dict[str, Any]) -> dict[str, Any]: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: https://google.github.io/styleguide/pyguide.html#383-functions-and-methods
We don't enforce this right now, but we try to have the first description on the first line (i.e., no new line)
for k, v in metadata.items(): | ||
if type(v) not in [str, int]: | ||
v = str(v) | ||
if k.startswith("/"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bug? The file path could be an absolute path on the local machine -- this looks like an error right now?
_delim = ["\n\n\n", "\n\n"] # To insert images or table in the middle of the page. | ||
|
||
|
||
def __merge_text_and_extras( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe improve the name so it's a bit more distinct from the _
name? We typically don't use __
in the code
""" | ||
Purge metadata from unwanted keys and normalize key names. | ||
Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAJOR:
- Could you describe what the wanted keys are and how they will be standardized / normalized? (And why?)
This feels like a big decision if the metadata is to be standardized across all PDF parsers?
MINOR
- The function documentation makes it sound like it's mutating the original metadata (which I think it's not doing). A better name like "create_standardized_metadata" or "standardize metadata" and a doc-string that indicates that it's creating a standardized metadata dict from the given will help here
return _convert_images_to_text | ||
|
||
|
||
_prompt_images_to_description = PromptTemplate.from_template( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use a string here since .format() isn't used in a useful way
else: | ||
yield "" | ||
|
||
_convert_images_to_text.creator = ( # type: ignore[attr-defined] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's avoid assigning attributes to functions
|
||
def _get_page_content(self, doc: fitz.Document, page: fitz.Page, blob: Blob) -> str: | ||
self.extract_tables_settings = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how were these chosen is it possible to add a comment?
def convert_images_to_description( | ||
model: BaseChatModel, | ||
*, | ||
prompt: BasePromptTemplate = _prompt_images_to_description, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prompt: BasePromptTemplate = _prompt_images_to_description, | |
prompt: str = _prompt_images_to_description, |
@@ -78,6 +203,192 @@ def extract_from_images_with_rapidocr( | |||
return text | |||
|
|||
|
|||
# Type to change the function to convert images to text. | |||
CONVERT_IMAGE_TO_TEXT = Optional[Callable[[Iterable[np.ndarray]], Iterator[str]]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAJOR:
Why not use an ImageBlobParser w/ the regular Blob to Document interface. it'll allow reusing the image logic for images that do not originate from pdfs (e.g., to re-use for a web crawler)
A PDF parser doesn't would accept a parser as part of the initializer
class PDFParser(...):
def __Init__(self, ... *, ..., image_blob_parser: Optional[BlobParser] = None):
pass
If the image_pdf_parser is provided, then it'll be used for OCR purposes.
Refactoring PDF loaders step 2: "community: Refactoring PDF loaders to standardize approaches"
Description: Update PyMuPDFParser/Loader
Twitter handle: pprados
This is one part of a larger Pull Request (PR) that is too large to be submitted all at once.
This specific part focuses to prepare the update of all parsers.
For more details, see PR 28970.
@eyurtsev it's the continuation of PDFLoader modifications.