Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add PDF metadata extraction #17

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

feat: add PDF metadata extraction #17

wants to merge 4 commits into from

Conversation

Goldziher
Copy link
Owner

This PR integrates PDF metadata extraction using playa, relates to: dhdaines/playa#63

@dhdaines
Copy link

Oh, interesting! I may not have time to look at this in detail today.

One thing to consider is that the various PLAYA objects don't produce anything particularky useful with dataclasses.asdict or NamedTuple._asdict at the moment (since there are internal/incomprehensible fields there), but my intent is to make them do so, since the playa CLI should be able to dump them as JSON. So, some of this functionality should just go in the PLAYA itself.

I would just use Pydantic but I don't want any dependencies :)

@Goldziher
Copy link
Owner Author

Oh, interesting! I may not have time to look at this in detail today.

One thing to consider is that the various PLAYA objects don't produce anything particularky useful with dataclasses.asdict or NamedTuple._asdict at the moment (since there are internal/incomprehensible fields there), but my intent is to make them do so, since the playa CLI should be able to dump them as JSON. So, some of this functionality should just go in the PLAYA itself.

I would just use Pydantic but I don't want any dependencies :)

I'd be happy to help.

Also I dont think pydantic is required.

@dhdaines
Copy link

I'd be happy to help.

Also I dont think pydantic is required.

Indeed - particularly since no validation is implied.

The idea would be that instead of calling PathMetadata( ... ) on your side you could simply do path.asdict(). It would probably be good to supply TypedDict annotations for the output of that anyway though.

@Goldziher
Copy link
Owner Author

Goldziher commented Feb 21, 2025

I'd be happy to help.

Also I dont think pydantic is required.

Indeed - particularly since no validation is implied.

The idea would be that instead of calling PathMetadata( ... ) on your side you could simply do path.asdict(). It would probably be good to supply TypedDict annotations for the output of that anyway though.

This would be awesome 😎.

Defintely save me a lot of work.

When do you see this on your roadmap?

And should I continue implementation on my end, or hold off for now?

Also, consider using NotRequired fields from typing-extensions (it's a standard back port dependency).

I'd drop python 3.8 since it's deprecated and has issues with some typing.

@dhdaines
Copy link

dhdaines commented Feb 21, 2025

It would probably be good to supply TypedDict annotations for the output of that anyway though.

This would be awesome 😎.

Defintely save me a lot of work.

When do you see this on your roadmap?

Probably one of the next things that I'll do, for the next release (next week).

And should I continue implementation on my end, or hold off for now?

You could - though I think the schemas for the metadata might end up being a bit different that what you've got now.

Also, consider using NotRequired fields from typing-extensions (it's a standard back port dependency).

Oh, this is handy ... much better than just adding total=False.

I'd drop python 3.8 since it's deprecated and has issues with some typing.

Even though it's end of life I think it's important to keep supporting it, as it's what came with Ubuntu 20.04 and not everybody has upgraded that yet...

@Goldziher
Copy link
Owner Author

@dhdaines any updates on your end?

@dhdaines
Copy link

dhdaines commented Mar 1, 2025

@dhdaines any updates on your end?

Thanks for the reminder! I'll have some time to look at it today.

@Goldziher
Copy link
Owner Author

@dhdaines any updates on your end?

Thanks for the reminder! I'll have some time to look at it today.

thanks, please update me when you release the new API. If you can document it - would be awesome.

@dhdaines
Copy link

dhdaines commented Mar 2, 2025

Working on it here: dhdaines/playa#68

And a new version of this PR to go with it (not complete): #25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants