Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix classify: there is no more pdf_bytes in UNIPipe #1379

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

MatthewZMD
Copy link

@MatthewZMD MatthewZMD commented Dec 30, 2024

Motivation

The UNIPipe class had an outdated approach of handling raw PDF bytes directly. This needed to be updated to use the Dataset abstraction layer consistently throughout the codebase. This change improves code consistency and better follows object-oriented design principles by working with Dataset objects instead of raw bytes.

Modification

  1. Modified AbsPipe.classify() to work with self.dataset instead of taking pdf_bytes parameter
  2. Updated UNIPipe's constructor and pipe_classify() to properly use the Dataset abstraction
  3. Fixed the initialization order in UNIPipe to set pdf_type after super().init()
  4. Updated the test code to use PymuDocDataset instead of raw bytes

BC-breaking

Yes, this change breaks backward compatibility in two ways:

  1. AbsPipe.classify() no longer accepts pdf_bytes as a parameter
  2. UNIPipe constructor no longer accepts raw bytes

Downstream projects need to:

  1. Create a Dataset instance (preferably PymuDocDataset) for their PDF data
  2. Pass the Dataset instance to UNIPipe instead of raw bytes
  3. Update any direct calls to classify() to work with Dataset objects

Use cases

Basic usage with the new API:

from magic_pdf.data.dataset import PymuDocDataset

# Create dataset from PDF bytes
dataset = PymuDocDataset(pdf_bytes)

# Initialize pipe with dataset
pipe = UNIPipe(dataset, jso_useful_key, img_writer)
pipe.pipe_classify()

Checklist

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues.
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
  • The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
  • CLA has been signed and all committers have signed the CLA in this PR.

Copy link
Contributor

github-actions bot commented Dec 30, 2024

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@MatthewZMD
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

@MatthewZMD
Copy link
Author

recheck

MatthewZMD pushed a commit to MatthewZMD/MinerU-GUI that referenced this pull request Dec 30, 2024
@MatthewZMD
Copy link
Author

I'm curious about the design choice of having the classify() method in the abstract base class AbsPipe. Since AbsPipe is meant to define the interface that concrete pipe implementations must follow, having a concrete classification implementation there seems to mix abstraction with implementation.

Why not make classify() abstract like the other key methods (pipe_classify, pipe_analyze, pipe_parse), allowing each pipe implementation to define its own classification strategy?
If the current classification logic is meant to be shared, would it make more sense as a utility function in a separate module?

What was the reasoning behind putting this implementation in the abstract class? Understanding the motivation would help evaluate if there might be a cleaner design approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant