Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve field finding in PDFs: #55

Merged
merged 8 commits into from
Sep 6, 2022
Merged

Improve field finding in PDFs: #55

merged 8 commits into from
Sep 6, 2022

Conversation

BryceStevenWilley
Copy link
Contributor

@BryceStevenWilley BryceStevenWilley commented Aug 19, 2022

  • Collapses multiple text lines with nothing in the middle (and nothing to the left, since often multiple fields can have their labels off to the left) into a text area
  • detect the size of a text field, and roughly use it to set the font size so
    the text field is readable on PDFs with smaller font sizes
  • add a checkbox detector that works for brackets "[ ]" by searching in PDF text for instances of [ ] or [ ]
  • don't add duplicate field labels!
  • fixed some issues with label bounding boxes being really big because it includes the field itself, which was causing some text to be labels for multiple fields, even it it wasn't that good of a choice
  • skip over checkboxes that look like they're filled in (when using boxdetect), as they're likely false positives (we're only looking for empty checkboxes)
  • don't add text boxes if:
    • there are things above the line that aren't blank
  • Adds a script to run strip fields and run auto fields over all PDFs in a folder easily

Compare

Admittedly cherry-picked, but it's because we were pretty good before and not much changed in many forms except for font size.
We were finding checkboxes, but I tweaked some params there to find more, because many were too small.

Before

Screenshot from 2022-08-19 17-45-34

After

Screenshot from 2022-08-19 17-41-41

* Collapses multiple text lines with nothing in the middle (and nothing to the
  left, since often multiple fields can have their labels off to the left) into
  a text area
* detect the size of a text field, and roughly use it to set the font size so
  the text field is readable on PDFs with smaller font sizes
* add a checkbox detector that works for brackets "[ ]"
* don't add text boxes if:
  * there are things above the line that aren't blank
* Adds a script to run strip fields and run auto fields over all PDFs in a
  folder easily
@BryceStevenWilley
Copy link
Contributor Author

BryceStevenWilley commented Aug 19, 2022

Draft until I patch the bracket checkbox finder (in https://github.com/SuffolkLITLab/FormFyxer/tree/better_brackets, will likely remove all of the special boxdetect code). Ready for review now! Currently running on some PDFs to make sure it's not worse on some PDFs, but trying not to get too held back by the results

Works surprisingly well, but it does take a long time. Have to hackly override
some pdfminer classes, so we just completly skip any character that's not a [ ],
(because splitting everything into individual characters adds like 20 seconds,
so we avoid that), and then just directly bbox intersect (with a horizontal
dilation) every left bracket with a right bracket. It's not smart enough to know
that "[i]" likely isn't a checkbox, but we can / should fix that with more image
searching to see if it's empty (something we could also borrow from box detect,
everything in the `get_checkboxes` that isn't in `get_boxes`.
Can still get kinda complicated, need to test still and maybe switch how
we iterate over each image, but not sure.
The text finder will sometimes add multiple spaces between '[ ]' when
getting text, we now handle 1-3 for the initial search.
Now that we've done more advanced handling of `[` in PDF text, it's easy
to just not look for '_'. Specifically solves an issue where the label
box completely overlaps the field box, and would cause some labels to be
added as labels for more than one field.
Means that it's likely a false positive, we're only looking for empty
ones.
Had hardcoded several locations with exact pixel widths. However, the
DPI of 200 wasn't good enough for some field lines, and would split the
line in two, or mess with kerning on letters, creating connected
stretches. Bumped DPI to 250, and fixed hard coded pixel widths
* Printing random file names to not overwrite things and look at
  multiple pages at once
* slightly better field labels (more stripped, don't find single
  character labels, etc.)
* don't duplicate labels! Can still improve by removing them from
  contention, but for now, just go back to the simple labels
* actually change the font size of the text fields, wasn't actually
  happening
@nonprofittechy nonprofittechy merged commit 3169b2a into main Sep 6, 2022
@BryceStevenWilley BryceStevenWilley deleted the better_pdf_fields branch September 6, 2022 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants