feat: add PDF metadata extraction #17

Goldziher · 2025-02-21T18:24:52Z

This PR integrates PDF metadata extraction using playa, relates to: dhdaines/playa#63

dhdaines · 2025-02-21T18:40:30Z

Oh, interesting! I may not have time to look at this in detail today.

One thing to consider is that the various PLAYA objects don't produce anything particularky useful with dataclasses.asdict or NamedTuple._asdict at the moment (since there are internal/incomprehensible fields there), but my intent is to make them do so, since the playa CLI should be able to dump them as JSON. So, some of this functionality should just go in the PLAYA itself.

I would just use Pydantic but I don't want any dependencies :)

Goldziher · 2025-02-21T18:43:34Z

Oh, interesting! I may not have time to look at this in detail today.

One thing to consider is that the various PLAYA objects don't produce anything particularky useful with dataclasses.asdict or NamedTuple._asdict at the moment (since there are internal/incomprehensible fields there), but my intent is to make them do so, since the playa CLI should be able to dump them as JSON. So, some of this functionality should just go in the PLAYA itself.

I would just use Pydantic but I don't want any dependencies :)

I'd be happy to help.

Also I dont think pydantic is required.

dhdaines · 2025-02-21T18:58:08Z

I'd be happy to help.

Also I dont think pydantic is required.

Indeed - particularly since no validation is implied.

The idea would be that instead of calling PathMetadata( ... ) on your side you could simply do path.asdict(). It would probably be good to supply TypedDict annotations for the output of that anyway though.

Goldziher · 2025-02-21T19:36:27Z

I'd be happy to help.

Also I dont think pydantic is required.

Indeed - particularly since no validation is implied.

The idea would be that instead of calling PathMetadata( ... ) on your side you could simply do path.asdict(). It would probably be good to supply TypedDict annotations for the output of that anyway though.

This would be awesome 😎.

Defintely save me a lot of work.

When do you see this on your roadmap?

And should I continue implementation on my end, or hold off for now?

Also, consider using NotRequired fields from typing-extensions (it's a standard back port dependency).

I'd drop python 3.8 since it's deprecated and has issues with some typing.

dhdaines · 2025-02-21T19:54:20Z

It would probably be good to supply TypedDict annotations for the output of that anyway though.

This would be awesome 😎.

Defintely save me a lot of work.

When do you see this on your roadmap?

Probably one of the next things that I'll do, for the next release (next week).

And should I continue implementation on my end, or hold off for now?

You could - though I think the schemas for the metadata might end up being a bit different that what you've got now.

Also, consider using NotRequired fields from typing-extensions (it's a standard back port dependency).

Oh, this is handy ... much better than just adding total=False.

I'd drop python 3.8 since it's deprecated and has issues with some typing.

Even though it's end of life I think it's important to keep supporting it, as it's what came with Ubuntu 20.04 and not everybody has upgraded that yet...

Goldziher · 2025-03-01T12:23:57Z

@dhdaines any updates on your end?

dhdaines · 2025-03-01T12:51:32Z

@dhdaines any updates on your end?

Thanks for the reminder! I'll have some time to look at it today.

Goldziher · 2025-03-01T17:19:03Z

@dhdaines any updates on your end?

Thanks for the reminder! I'll have some time to look at it today.

thanks, please update me when you release the new API. If you can document it - would be awesome.

dhdaines · 2025-03-02T17:19:38Z

Working on it here: dhdaines/playa#68

And a new version of this PR to go with it (not complete): #25

Goldziher added 2 commits February 21, 2025 17:33

feat: initial playa integration

7cb6ac9

feat: add playa metadata extraction

bd4d2ef

Goldziher mentioned this pull request Feb 21, 2025

Async Support dhdaines/playa#63

Closed

chore: made functions private

ca1c266

chore: fix AI shenanigens

0d70200

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add PDF metadata extraction #17

feat: add PDF metadata extraction #17

Goldziher commented Feb 21, 2025

dhdaines commented Feb 21, 2025

Goldziher commented Feb 21, 2025

dhdaines commented Feb 21, 2025

Goldziher commented Feb 21, 2025 •

edited

Loading

dhdaines commented Feb 21, 2025 •

edited

Loading

Goldziher commented Mar 1, 2025

dhdaines commented Mar 1, 2025

Goldziher commented Mar 1, 2025

dhdaines commented Mar 2, 2025

feat: add PDF metadata extraction #17

Are you sure you want to change the base?

feat: add PDF metadata extraction #17

Conversation

Goldziher commented Feb 21, 2025

dhdaines commented Feb 21, 2025

Goldziher commented Feb 21, 2025

dhdaines commented Feb 21, 2025

Goldziher commented Feb 21, 2025 • edited Loading

dhdaines commented Feb 21, 2025 • edited Loading

Goldziher commented Mar 1, 2025

dhdaines commented Mar 1, 2025

Goldziher commented Mar 1, 2025

dhdaines commented Mar 2, 2025

Goldziher commented Feb 21, 2025 •

edited

Loading

dhdaines commented Feb 21, 2025 •

edited

Loading