Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to Docling 2.9.0 or newer for HybridChunker #505

Open
bbrowning opened this issue Jan 27, 2025 · 4 comments
Open

Upgrade to Docling 2.9.0 or newer for HybridChunker #505

bbrowning opened this issue Jan 27, 2025 · 4 comments
Milestone

Comments

@bbrowning
Copy link
Contributor

We need to update our Docling dependency to 2.9.0 or newer as a prerequisite to using the HybridChunker. Preference on exact version should be the newest we can pick up at the time that aligns with what the instructlab/instructlab repo is using for Docling.

@bbrowning bbrowning added this to the 0.8.0 milestone Jan 27, 2025
@bbrowning
Copy link
Contributor Author

Right now instructlab/instructlab is using docling-core[chunking]>=2.10.0 so we should ensure we align on versions with that repository. And, if we need to bump the minimum version (to get for example a fix for #514) then we should coordinate that with the core repo.

@bbrowning
Copy link
Contributor Author

Note that docling 2.12.0 (which adds GPU acceleration) pulls in different PDF models than previous versions used. This may have implication for any downstream users if we bump to that version or newer, so something we'll have to coordinate if we need to move to 2.12.0 or newer.

@aakankshaduggal
Copy link
Member

The docling team has addressed two of our issues DS4SD/docling#822 with https://github.com/DS4SD/docling/releases/tag/v2.17.0 and DS4SD/docling#734 with https://github.com/DS4SD/docling/releases/tag/v2.18.0
We should aim to use v2.18.0 or above because the issues with HTML in markdowns was a pain point for some users.

@bbrowning
Copy link
Contributor Author

That seems reasonable, with the caveat that we should let our downstream users know as some of them will need to pull in those newer docling models in their distributions after we update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants