Improve field finding in PDFs: #55

BryceStevenWilley · 2022-08-19T21:49:09Z

Collapses multiple text lines with nothing in the middle (and nothing to the left, since often multiple fields can have their labels off to the left) into a text area
detect the size of a text field, and roughly use it to set the font size so
the text field is readable on PDFs with smaller font sizes
add a checkbox detector that works for brackets "[ ]" by searching in PDF text for instances of [ ] or [ ]
don't add duplicate field labels!
fixed some issues with label bounding boxes being really big because it includes the field itself, which was causing some text to be labels for multiple fields, even it it wasn't that good of a choice
skip over checkboxes that look like they're filled in (when using boxdetect), as they're likely false positives (we're only looking for empty checkboxes)
don't add text boxes if:
- there are things above the line that aren't blank
Adds a script to run strip fields and run auto fields over all PDFs in a folder easily

Compare

Admittedly cherry-picked, but it's because we were pretty good before and not much changed in many forms except for font size.
We were finding checkboxes, but I tweaked some params there to find more, because many were too small.

Before

After

* Collapses multiple text lines with nothing in the middle (and nothing to the left, since often multiple fields can have their labels off to the left) into a text area * detect the size of a text field, and roughly use it to set the font size so the text field is readable on PDFs with smaller font sizes * add a checkbox detector that works for brackets "[ ]" * don't add text boxes if: * there are things above the line that aren't blank * Adds a script to run strip fields and run auto fields over all PDFs in a folder easily

BryceStevenWilley · 2022-08-19T21:49:55Z

~~Draft until I patch the bracket checkbox finder (in https://github.com/SuffolkLITLab/FormFyxer/tree/better_brackets, will likely remove all of the special boxdetect code).~~ Ready for review now! Currently running on some PDFs to make sure it's not worse on some PDFs, but trying not to get too held back by the results

Works surprisingly well, but it does take a long time. Have to hackly override some pdfminer classes, so we just completly skip any character that's not a [ ], (because splitting everything into individual characters adds like 20 seconds, so we avoid that), and then just directly bbox intersect (with a horizontal dilation) every left bracket with a right bracket. It's not smart enough to know that "[i]" likely isn't a checkbox, but we can / should fix that with more image searching to see if it's empty (something we could also borrow from box detect, everything in the `get_checkboxes` that isn't in `get_boxes`.

Can still get kinda complicated, need to test still and maybe switch how we iterate over each image, but not sure.

The text finder will sometimes add multiple spaces between '[ ]' when getting text, we now handle 1-3 for the initial search.

Now that we've done more advanced handling of `[` in PDF text, it's easy to just not look for '_'. Specifically solves an issue where the label box completely overlaps the field box, and would cause some labels to be added as labels for more than one field.

Means that it's likely a false positive, we're only looking for empty ones.

Had hardcoded several locations with exact pixel widths. However, the DPI of 200 wasn't good enough for some field lines, and would split the line in two, or mess with kerning on letters, creating connected stretches. Bumped DPI to 250, and fixed hard coded pixel widths

* Printing random file names to not overwrite things and look at multiple pages at once * slightly better field labels (more stripped, don't find single character labels, etc.) * don't duplicate labels! Can still improve by removing them from contention, but for now, just go back to the simple labels * actually change the font size of the text fields, wasn't actually happening

BryceStevenWilley requested a review from nonprofittechy August 19, 2022 21:49

BryceStevenWilley added 7 commits August 31, 2022 16:13

Tried to simplify checkbox finding

6f7ca2a

Can still get kinda complicated, need to test still and maybe switch how we iterate over each image, but not sure.

DEBUG for prints, find 1-3 spaces between '[ ]'

0ad3a32

The text finder will sometimes add multiple spaces between '[ ]' when getting text, we now handle 1-3 for the initial search.

Don't use checkboxes that look like they're filled

9800774

Means that it's likely a false positive, we're only looking for empty ones.

BryceStevenWilley marked this pull request as ready for review August 31, 2022 20:14

nonprofittechy approved these changes Sep 6, 2022

View reviewed changes

nonprofittechy merged commit 3169b2a into main Sep 6, 2022

BryceStevenWilley deleted the better_pdf_fields branch September 6, 2022 17:57

BryceStevenWilley mentioned this pull request May 2, 2023

Decrease checkbox detection sensitivity #110

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve field finding in PDFs: #55

Improve field finding in PDFs: #55

BryceStevenWilley commented Aug 19, 2022 •

edited

Loading

BryceStevenWilley commented Aug 19, 2022 •

edited

Loading

Improve field finding in PDFs: #55

Improve field finding in PDFs: #55

Conversation

BryceStevenWilley commented Aug 19, 2022 • edited Loading

Compare

Before

After

BryceStevenWilley commented Aug 19, 2022 • edited Loading

BryceStevenWilley commented Aug 19, 2022 •

edited

Loading

BryceStevenWilley commented Aug 19, 2022 •

edited

Loading