Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async Support #63

Closed
Goldziher opened this issue Feb 20, 2025 · 8 comments · Fixed by #65
Closed

Async Support #63

Goldziher opened this issue Feb 20, 2025 · 8 comments · Fixed by #65

Comments

@Goldziher
Copy link

Hi there!

I'm the author of a text extraction library called Kreuzberg (see my profile for link). I'm interested in doing layout extraction to extract metadata, and I found this library.

First off, I loved the readme! Was amused and intrigued. I see the API and library are still on flux, so I wouldn't go building against it just yet - but Im keen on trying when the API stabilizes some more.

Anyhow, my question is regarding async support in this library and whether you had plans to implement anything in there. Since most of what the library does, I imagine, is not I/O bound this shouldn't be complicated or extensive.

I'd be happy to submit a PR if you're interested

@dhdaines
Copy link
Owner

Thanks for your interest! By async support, I assume you mean something like an async iterator version of PageList.map?

The library is definitely not I/O bound, in fact it goes out of its way to not do any explicit I/O operations at all - the input file is read with mmap or simply read into memory whole.

For this reason I'm not really certain what the usefulness of async support would be outside the case of await on results from worker processes.

I am interested to try Kreuzberg... will take a look at it soon!

@Goldziher
Copy link
Author

Goldziher commented Feb 21, 2025

Thanks for your interest! By async support, I assume you mean something like an async iterator version of PageList.map?

The library is definitely not I/O bound, in fact it goes out of its way to not do any explicit I/O operations at all - the input file is read with mmap or simply read into memory whole.

For this reason I'm not really certain what the usefulness of async support would be outside the case of await on results from worker processes.

I am interested to try Kreuzberg... will take a look at it soon!

Great 👍

As to async - so file reading itself is a blocking operation. Reading a large file in an async program can degrade the entire program.

The simplest solution is to expose an API that allows passing a byte string or IO object (ByteIO) instead of a PDF file.

This will allow the caller to async read the file without you needing to worry about such details.

Regarding worker threads - these don't play nice with python async, for now at least. I'd imagine that somewhere around python 3.15 this will change, but until then async stuff should be single threaded only.

The alternative to this is to use something like anyio.to_procsss.run_sync (see the library anyio for this). But unless you can actually split the workload into separate chunks, this wouldn't work for PDF analysis.

I should emphasize - async is going to be slower in single file PDF analysis, the only advantage is effective concurrency of multiple files, and nonblocking processing. But this is also important.

@dhdaines
Copy link
Owner

Oh, I understand - yes, as you say, trying to split up a CPU-bound workload with async just makes it slower, though it has its uses - I have actually implemented this in Javascript for speech recognition: https://github.com/ReadAlongs/SoundSwallower

In fact, the only situation in which PLAYA would call read on its input file is if it's on a platform where mmap is not supported, or if you pass it a BytesIO instead of an actual file :-)

I made a conscious choice not to support BytesIO as read operations on it are really quite slow - see pdfminer/pdfminer.six#1041

But, it would be nice if playa.open and Document__init__ supported just passing a plain old bytes buffer (which could of course be the underlying buffer of a BytesIO) as input. For the latter this is simple, and I will just make the change right away! For playa.open, because it needs to support multiprocessing in spawn mode, a buffer will get copied to worker processes, which probably isn't all that slow, but...

@dhdaines
Copy link
Owner

Can you try #65 ? I think it should address this issue!

@Goldziher
Copy link
Author

thanks, sure i can begin integrating playa. I'll make a basic PR on my end that extracts basic metadata with Playa. Would you like to help me?

@dhdaines
Copy link
Owner

thanks, sure i can begin integrating playa. I'll make a basic PR on my end that extracts basic metadata with Playa. Would you like to help me?

Sure, it would help me as well to make sure the documentation/API makes sense (it needs more documentation for sure)

@dhdaines
Copy link
Owner

(I've just merged #65 in any case)

@Goldziher
Copy link
Author

ok this took me a bit longer than anticipated, but i got something in place: Goldziher/kreuzberg#17

I'd be happy for your input and suggestions on improving the implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants