-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Form field recognizer should break up long lines if there is text in the middle of the line #30
Comments
This might be a regression in #55: we'll look above the horizontal line, see that something is in the middle of it, and stop. IMO this isn't a good case for auto field detection. The lines are primarily for presentation, not semantic. And for a form like:
we wouldn't be able to place any fields. Would have to really hand tune things to work for just this PDF. Marking wontfix, but open to debate. |
Fair enough--I'm willing to revisit if we run into a lot of similar PDFs in the wild, but this sample wasn't originally intended to be used on a computer. |
Just noting a regression. The attached PDF now has no fields recognized at all. I'm not sure that's desirable. I think we'll see other forms in the wild that have a prompt, a colon and a large empty space that should get turned into a field until it runs into the next word or the end of the page. It's easier to delete a field currently than to manually add one using any of our tools. For more context, I thought I would check Washington State which has a lot of forms without form fields: https://www.courts.wa.gov/forms/?fa=forms.contribute&formID=6 Going to leave this open to encourage us to track any real forms where recognizing a blank space should turn into form fields. |
That was noted in #30 (comment), before this issue was closed.
Looking at those, they all have normal PDF lines; this issue was specifically about lines that extend underneath labels for other fields and for colon-blank space extensions, which I don't see either of in the forms at that link.
Have we actually found any in the wild like that though? I'm realizing that we haven't seen any forms like this, likely because it'd be unusable as a printed document, since you can't tell where information is supposed to be written. But I'm still of the opinion this will be pretty hard to get working reliably. and that the effort would be better spent making an easier way to manually add fields to our tools than overestimating the number of fields by a long shot. Also note that the idea that we should over-estimate because it's easier to delete is in contention with #110, where we're trying to not over-estimate things. It's probably a good idea at this point to get 2 or 3 forms from each jurisdiction and try to evaluate how our tools do with each way of marking fields, so we have a more comprehensive view. I made #117 for that. |
I think in our current workflow, recognizing more fields is better. So maybe a few heuristic tweaks can be added from this test file?
Fax_cover_sheet_no_fields.pdf
turned into this:
fields_file.pdf
The long lines tricked the field recognizer into adding one long field that spanned the whole page. The
:
should be a clue to start a new field.Similarly--the
Comments:
text followed by a big empty space should be a signal to add a form field.BTW, Adobe Acrobat doesn't recognize any fields in this PDF, so our rules might already be smarter.
The text was updated successfully, but these errors were encountered: