Async Support #63

Goldziher · 2025-02-20T16:48:01Z

Hi there!

I'm the author of a text extraction library called Kreuzberg (see my profile for link). I'm interested in doing layout extraction to extract metadata, and I found this library.

First off, I loved the readme! Was amused and intrigued. I see the API and library are still on flux, so I wouldn't go building against it just yet - but Im keen on trying when the API stabilizes some more.

Anyhow, my question is regarding async support in this library and whether you had plans to implement anything in there. Since most of what the library does, I imagine, is not I/O bound this shouldn't be complicated or extensive.

I'd be happy to submit a PR if you're interested

dhdaines · 2025-02-21T03:39:07Z

Thanks for your interest! By async support, I assume you mean something like an async iterator version of PageList.map?

The library is definitely not I/O bound, in fact it goes out of its way to not do any explicit I/O operations at all - the input file is read with mmap or simply read into memory whole.

For this reason I'm not really certain what the usefulness of async support would be outside the case of await on results from worker processes.

I am interested to try Kreuzberg... will take a look at it soon!

Goldziher · 2025-02-21T05:57:42Z

Thanks for your interest! By async support, I assume you mean something like an async iterator version of PageList.map?

The library is definitely not I/O bound, in fact it goes out of its way to not do any explicit I/O operations at all - the input file is read with mmap or simply read into memory whole.

For this reason I'm not really certain what the usefulness of async support would be outside the case of await on results from worker processes.

I am interested to try Kreuzberg... will take a look at it soon!

Great 👍

As to async - so file reading itself is a blocking operation. Reading a large file in an async program can degrade the entire program.

The simplest solution is to expose an API that allows passing a byte string or IO object (ByteIO) instead of a PDF file.

This will allow the caller to async read the file without you needing to worry about such details.

Regarding worker threads - these don't play nice with python async, for now at least. I'd imagine that somewhere around python 3.15 this will change, but until then async stuff should be single threaded only.

The alternative to this is to use something like anyio.to_procsss.run_sync (see the library anyio for this). But unless you can actually split the workload into separate chunks, this wouldn't work for PDF analysis.

I should emphasize - async is going to be slower in single file PDF analysis, the only advantage is effective concurrency of multiple files, and nonblocking processing. But this is also important.

dhdaines · 2025-02-21T12:30:51Z

Oh, I understand - yes, as you say, trying to split up a CPU-bound workload with async just makes it slower, though it has its uses - I have actually implemented this in Javascript for speech recognition: https://github.com/ReadAlongs/SoundSwallower

In fact, the only situation in which PLAYA would call read on its input file is if it's on a platform where mmap is not supported, or if you pass it a BytesIO instead of an actual file :-)

I made a conscious choice not to support BytesIO as read operations on it are really quite slow - see pdfminer/pdfminer.six#1041

But, it would be nice if playa.open and Document__init__ supported just passing a plain old bytes buffer (which could of course be the underlying buffer of a BytesIO) as input. For the latter this is simple, and I will just make the change right away! For playa.open, because it needs to support multiprocessing in spawn mode, a buffer will get copied to worker processes, which probably isn't all that slow, but...

dhdaines · 2025-02-21T13:20:01Z

Can you try #65 ? I think it should address this issue!

Goldziher · 2025-02-21T14:20:04Z

thanks, sure i can begin integrating playa. I'll make a basic PR on my end that extracts basic metadata with Playa. Would you like to help me?

dhdaines · 2025-02-21T15:16:38Z

thanks, sure i can begin integrating playa. I'll make a basic PR on my end that extracts basic metadata with Playa. Would you like to help me?

Sure, it would help me as well to make sure the documentation/API makes sense (it needs more documentation for sure)

dhdaines · 2025-02-21T15:17:20Z

(I've just merged #65 in any case)

Goldziher · 2025-02-21T18:25:40Z

ok this took me a bit longer than anticipated, but i got something in place: Goldziher/kreuzberg#17

I'd be happy for your input and suggestions on improving the implementation.

dhdaines mentioned this issue Feb 21, 2025

feat: accept bytes as input (for async applications) #65

Merged

dhdaines closed this as completed in #65 Feb 21, 2025

Goldziher mentioned this issue Feb 21, 2025

feat: add PDF metadata extraction Goldziher/kreuzberg#17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async Support #63

Async Support #63

Goldziher commented Feb 20, 2025

dhdaines commented Feb 21, 2025

Goldziher commented Feb 21, 2025 •

edited

Loading

dhdaines commented Feb 21, 2025

dhdaines commented Feb 21, 2025

Goldziher commented Feb 21, 2025

dhdaines commented Feb 21, 2025

dhdaines commented Feb 21, 2025

Goldziher commented Feb 21, 2025

Async Support #63

Async Support #63

Comments

Goldziher commented Feb 20, 2025

dhdaines commented Feb 21, 2025

Goldziher commented Feb 21, 2025 • edited Loading

dhdaines commented Feb 21, 2025

dhdaines commented Feb 21, 2025

Goldziher commented Feb 21, 2025

dhdaines commented Feb 21, 2025

dhdaines commented Feb 21, 2025

Goldziher commented Feb 21, 2025

Goldziher commented Feb 21, 2025 •

edited

Loading