Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a heuristic to keep out PDFs that don't seem to have fields #61

Open
nonprofittechy opened this issue May 2, 2023 · 3 comments
Open
Labels
bug Something isn't working

Comments

@nonprofittechy
Copy link
Member

nonprofittechy commented May 2, 2023

  1. If it has no fields [after processing by formfyxer], it's not a form
  2. If it doesn't have at least one page with more than 2 fields detected, it's not a form
@nonprofittechy nonprofittechy added the bug Something isn't working label May 2, 2023
@BryceStevenWilley
Copy link
Contributor

BryceStevenWilley commented May 2, 2023

Counterpoint: I'm not sure we can trust jurisdictions to actually use Acroform PDF with fields. I'm pretty sure I've seen a few jurisdictions have forms without PDF fields.

If you are thinking post field detection, this could work, but it would be a bit less reliable, there are usually enough false positives.

(EDIT: apologies, didn't see the original conversation until now)

@nonprofittechy
Copy link
Member Author

Yes, this would be post field detection. Take a look at https://suffolklitlab.org/form-explorer/list/PA/ for inspiration for this issue. There are definitely lots of false positives, but not usually more than 2 per page.

@nonprofittechy
Copy link
Member Author

nonprofittechy commented May 2, 2023

(EDIT: apologies, didn't see the original conversation until now)

No worries, good reminder to add full context to the GitHub issue. No guarantee we'd remember it in a week or two when we can come back to this task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants