Fix classify: there is no more pdf_bytes in UNIPipe #1379

MatthewZMD · 2024-12-30T04:45:52Z

Motivation

The UNIPipe class had an outdated approach of handling raw PDF bytes directly. This needed to be updated to use the Dataset abstraction layer consistently throughout the codebase. This change improves code consistency and better follows object-oriented design principles by working with Dataset objects instead of raw bytes.

Modification

Modified AbsPipe.classify() to work with self.dataset instead of taking pdf_bytes parameter
Updated UNIPipe's constructor and pipe_classify() to properly use the Dataset abstraction
Fixed the initialization order in UNIPipe to set pdf_type after super().init()
Updated the test code to use PymuDocDataset instead of raw bytes

BC-breaking

Yes, this change breaks backward compatibility in two ways:

AbsPipe.classify() no longer accepts pdf_bytes as a parameter
UNIPipe constructor no longer accepts raw bytes

Downstream projects need to:

Create a Dataset instance (preferably PymuDocDataset) for their PDF data
Pass the Dataset instance to UNIPipe instead of raw bytes
Update any direct calls to classify() to work with Dataset objects

Use cases

Basic usage with the new API:

from magic_pdf.data.dataset import PymuDocDataset

# Create dataset from PDF bytes
dataset = PymuDocDataset(pdf_bytes)

# Initialize pipe with dataset
pipe = UNIPipe(dataset, jso_useful_key, img_writer)
pipe.pipe_classify()

Checklist

Before PR:

Pre-commit or other linting tools are used to fix the potential lint issues.
Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
CLA has been signed and all committers have signed the CLA in this PR.

github-actions · 2024-12-30T04:46:05Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

MatthewZMD · 2024-12-30T04:46:33Z

I have read the CLA Document and I hereby sign the CLA

MatthewZMD · 2024-12-30T04:47:26Z

recheck

Signed-off-by: Mingde (Matthew) Zeng <[email protected]>

MatthewZMD · 2024-12-30T04:56:45Z

I'm curious about the design choice of having the classify() method in the abstract base class AbsPipe. Since AbsPipe is meant to define the interface that concrete pipe implementations must follow, having a concrete classification implementation there seems to mix abstraction with implementation.

Why not make classify() abstract like the other key methods (pipe_classify, pipe_analyze, pipe_parse), allowing each pipe implementation to define its own classification strategy?
If the current classification logic is meant to be shared, would it make more sense as a utility function in a separate module?

What was the reasoning behind putting this implementation in the abstract class? Understanding the motivation would help evaluate if there might be a cleaner design approach.

MatthewZMD pushed a commit to MatthewZMD/MinerU-GUI that referenced this pull request Dec 30, 2024

Compat with upstream after opendatalab/MinerU#1379 is merged

24b42c5

Signed-off-by: Mingde (Matthew) Zeng <[email protected]>

Fix classify: there is no more pdf_bytes in UNIPipe

994b974

Signed-off-by: Mingde (Matthew) Zeng <[email protected]>

MatthewZMD force-pushed the fix_unipipe branch from 9ae7635 to 994b974 Compare December 30, 2024 04:52

github-actions bot added a commit that referenced this pull request Dec 30, 2024

@MatthewZMD has signed the CLA in #1379

81fcef8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix classify: there is no more pdf_bytes in UNIPipe #1379

Fix classify: there is no more pdf_bytes in UNIPipe #1379

MatthewZMD commented Dec 30, 2024 •

edited

Loading

github-actions bot commented Dec 30, 2024 •

edited

Loading

MatthewZMD commented Dec 30, 2024

MatthewZMD commented Dec 30, 2024

MatthewZMD commented Dec 30, 2024

Fix classify: there is no more pdf_bytes in UNIPipe #1379

Are you sure you want to change the base?

Fix classify: there is no more pdf_bytes in UNIPipe #1379

Conversation

MatthewZMD commented Dec 30, 2024 • edited Loading

Motivation

Modification

BC-breaking

Use cases

Checklist

github-actions bot commented Dec 30, 2024 • edited Loading

MatthewZMD commented Dec 30, 2024

MatthewZMD commented Dec 30, 2024

MatthewZMD commented Dec 30, 2024

MatthewZMD commented Dec 30, 2024 •

edited

Loading

github-actions bot commented Dec 30, 2024 •

edited

Loading