Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate new non-spacy Pdf parsing into main Harmony #39

Closed
woodthom2 opened this issue May 31, 2024 · 1 comment
Closed

Integrate new non-spacy Pdf parsing into main Harmony #39

woodthom2 opened this issue May 31, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@woodthom2
Copy link
Contributor

Description

We have a draft improvement to the PDF parsing logic. This will enable us to eliminate Spacy as a dependency.

The training code is here:
https://github.com/harmonydata/pdf-text-models-amol

The API modification is here
https://github.com/harmonydata/harmonyapi branch nospacy

The modification to the main python library is in

git clone -b updated_files_for_forntend https://github.com/Notysoty/harmony.git 

Please quality control this branch and then merge it into main in all repositories and remove spacy from all requirements.txt and toml files.

Rationale

Pdf extraction needs improvement

@woodthom2
Copy link
Contributor Author

Switched to Sklearn CRF Suite

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant