Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HOCRDocProprocessor and HocrVisualParser #519

Merged
merged 14 commits into from
Oct 6, 2020

Conversation

HiromuHota
Copy link
Contributor

@HiromuHota HiromuHota commented Sep 30, 2020

Description of the problems or issues

Is your pull request related to a problem? Please describe.

This is the second patch that follows #518 .

Does your pull request fix any issue.

N/A.

Description of the proposed changes

Add HOCRDocProprocessor and HocrVisualParser

Test plan

I added a few real hOCR example files.

Checklist

  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • I have updated the CHANGELOG.rst accordingly.

@HiromuHota HiromuHota mentioned this pull request Sep 30, 2020
7 tasks
@codecov-commenter
Copy link

codecov-commenter commented Sep 30, 2020

Codecov Report

Merging #519 into master will increase coverage by 0.22%.
The diff coverage is 90.96%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #519      +/-   ##
==========================================
+ Coverage   85.81%   86.03%   +0.22%     
==========================================
  Files          90       92       +2     
  Lines        4582     4769     +187     
  Branches      852      896      +44     
==========================================
+ Hits         3932     4103     +171     
- Misses        467      475       +8     
- Partials      183      191       +8     
Flag Coverage Δ
#unittests 86.03% <90.96%> (+0.22%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
setup.py 0.00% <ø> (ø)
.../fonduer/parser/visual_parser/pdf_visual_parser.py 84.76% <ø> (ø)
...duer/parser/preprocessors/hocr_doc_preprocessor.py 86.66% <86.66%> (ø)
...fonduer/parser/visual_parser/hocr_visual_parser.py 95.23% <95.23%> (ø)
src/fonduer/parser/preprocessors/__init__.py 100.00% <100.00%> (ø)
src/fonduer/parser/visual_parser/__init__.py 100.00% <100.00%> (ø)
src/fonduer/parser/parser.py 93.39% <0.00%> (+0.23%) ⬆️

@HiromuHota HiromuHota marked this pull request as ready for review September 30, 2020 18:06
@HiromuHota HiromuHota added the enhancement New feature or request label Sep 30, 2020
@senwu senwu requested review from lukehsiao and senwu September 30, 2020 23:27
@HiromuHota
Copy link
Contributor Author

Unfortunately, this PR would not fix the #12. Moreover, #12 won't be fixed by its nature.
I'd recommend the hOCR format if visual information is to be parsed because bboxes and words are perfectly linked in the hOCR.
If hOCR cannot be used for any reason, use PdfVisualParser as a last resort.

@senwu
Copy link
Collaborator

senwu commented Oct 3, 2020

Can we get all conflicts resolved first? Thanks!

@HiromuHota
Copy link
Contributor Author

I rebased on the master branch and resolved the conflicts. Thanks.

Copy link
Contributor

@lukehsiao lukehsiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nits.

docs/user/parser.rst Show resolved Hide resolved
docs/user/parser.rst Outdated Show resolved Hide resolved
docs/user/parser.rst Outdated Show resolved Hide resolved
tests/data/hocr_simple/md.hocr Show resolved Hide resolved
tests/parser/test_preprocessor.py Show resolved Hide resolved
@HiromuHota
Copy link
Contributor Author

@lukehsiao thank you so much for reviewing such a big PR and your comments.
Please review my edits and replies. Feel free to ask any further clarification.

Copy link
Contributor

@lukehsiao lukehsiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I think @senwu wanted to take a look to, so I'll wait for him. But LGTM.

Copy link
Collaborator

@senwu senwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@senwu senwu merged commit b44cdcd into HazyResearch:master Oct 6, 2020
@HiromuHota HiromuHota deleted the fix/476_2 branch October 6, 2020 23:21
@HiromuHota HiromuHota mentioned this pull request Oct 6, 2020
@senwu
Copy link
Collaborator

senwu commented Oct 6, 2020

Let's plan to have a tutorial about this PR. This is a really awesome improvement!!! Thoughts? @HiromuHota @lukehsiao

@HiromuHota
Copy link
Contributor Author

I'll definitely update the existing tutorials.
Rather than adding a new tutorial, I'd like to

  • pick one of the existing ones and replace HTMLDocPreprocessor and PdfVisualParser with HOCRDocProprocessor and HocrVisualParser if it is just drop-in replacement,
  • Or enrich the "Parsing Documents into the Data Model" section of intro/Intro_Data_Model.ipynb.

@senwu
Copy link
Collaborator

senwu commented Oct 7, 2020

That works as well. We need to show two things: 1) some basics about what's in it and how to use it, and 2) end to end run with high quality.

@HiromuHota
Copy link
Contributor Author

An update: I've started replacing html with hocr (html -> pdfy -> pdf -> pdftotree -> hocr) in the wikipedia tutorial.
With a minor change at parser, all cells run with no error but #527 up until the LF part, which needs a major change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants