-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async Support #63
Comments
Thanks for your interest! By async support, I assume you mean something like an async iterator version of The library is definitely not I/O bound, in fact it goes out of its way to not do any explicit I/O operations at all - the input file is read with For this reason I'm not really certain what the usefulness of async support would be outside the case of I am interested to try Kreuzberg... will take a look at it soon! |
Great 👍 As to async - so file reading itself is a blocking operation. Reading a large file in an async program can degrade the entire program. The simplest solution is to expose an API that allows passing a byte string or IO object (ByteIO) instead of a PDF file. This will allow the caller to async read the file without you needing to worry about such details. Regarding worker threads - these don't play nice with python async, for now at least. I'd imagine that somewhere around python 3.15 this will change, but until then async stuff should be single threaded only. The alternative to this is to use something like I should emphasize - async is going to be slower in single file PDF analysis, the only advantage is effective concurrency of multiple files, and nonblocking processing. But this is also important. |
Oh, I understand - yes, as you say, trying to split up a CPU-bound workload with In fact, the only situation in which PLAYA would call I made a conscious choice not to support But, it would be nice if |
Can you try #65 ? I think it should address this issue! |
thanks, sure i can begin integrating playa. I'll make a basic PR on my end that extracts basic metadata with Playa. Would you like to help me? |
Sure, it would help me as well to make sure the documentation/API makes sense (it needs more documentation for sure) |
(I've just merged #65 in any case) |
ok this took me a bit longer than anticipated, but i got something in place: Goldziher/kreuzberg#17 I'd be happy for your input and suggestions on improving the implementation. |
Hi there!
I'm the author of a text extraction library called Kreuzberg (see my profile for link). I'm interested in doing layout extraction to extract metadata, and I found this library.
First off, I loved the readme! Was amused and intrigued. I see the API and library are still on flux, so I wouldn't go building against it just yet - but Im keen on trying when the API stabilizes some more.
Anyhow, my question is regarding async support in this library and whether you had plans to implement anything in there. Since most of what the library does, I imagine, is not I/O bound this shouldn't be complicated or extensive.
I'd be happy to submit a PR if you're interested
The text was updated successfully, but these errors were encountered: