-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve field finding in PDFs: #55
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Collapses multiple text lines with nothing in the middle (and nothing to the left, since often multiple fields can have their labels off to the left) into a text area * detect the size of a text field, and roughly use it to set the font size so the text field is readable on PDFs with smaller font sizes * add a checkbox detector that works for brackets "[ ]" * don't add text boxes if: * there are things above the line that aren't blank * Adds a script to run strip fields and run auto fields over all PDFs in a folder easily
|
Works surprisingly well, but it does take a long time. Have to hackly override some pdfminer classes, so we just completly skip any character that's not a [ ], (because splitting everything into individual characters adds like 20 seconds, so we avoid that), and then just directly bbox intersect (with a horizontal dilation) every left bracket with a right bracket. It's not smart enough to know that "[i]" likely isn't a checkbox, but we can / should fix that with more image searching to see if it's empty (something we could also borrow from box detect, everything in the `get_checkboxes` that isn't in `get_boxes`.
Can still get kinda complicated, need to test still and maybe switch how we iterate over each image, but not sure.
The text finder will sometimes add multiple spaces between '[ ]' when getting text, we now handle 1-3 for the initial search.
Now that we've done more advanced handling of `[` in PDF text, it's easy to just not look for '_'. Specifically solves an issue where the label box completely overlaps the field box, and would cause some labels to be added as labels for more than one field.
Means that it's likely a false positive, we're only looking for empty ones.
Had hardcoded several locations with exact pixel widths. However, the DPI of 200 wasn't good enough for some field lines, and would split the line in two, or mess with kerning on letters, creating connected stretches. Bumped DPI to 250, and fixed hard coded pixel widths
* Printing random file names to not overwrite things and look at multiple pages at once * slightly better field labels (more stripped, don't find single character labels, etc.) * don't duplicate labels! Can still improve by removing them from contention, but for now, just go back to the simple labels * actually change the font size of the text fields, wasn't actually happening
This was referenced Sep 1, 2022
nonprofittechy
approved these changes
Sep 6, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
the text field is readable on PDFs with smaller font sizes
[ ]
or[ ]
boxdetect
), as they're likely false positives (we're only looking for empty checkboxes)Compare
Admittedly cherry-picked, but it's because we were pretty good before and not much changed in many forms except for font size.
We were finding checkboxes, but I tweaked some params there to find more, because many were too small.
Before
After