Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.pdf files can be detected as .ai based on content #582

Open
eric-yuan-vanta opened this issue Feb 15, 2023 · 4 comments
Open

.pdf files can be detected as .ai based on content #582

eric-yuan-vanta opened this issue Feb 15, 2023 · 4 comments

Comments

@eric-yuan-vanta
Copy link
Contributor

When pdf files have images created from photoshop or adobe ai in them, file-type detects them as .ai based on the byte checking heuristic we have in place.

I'm proposing that even if the magic string is found, if the original file's extension is .pdf, file-type should consider it a pdf and not change it's type based on some content inside of it.

An even more strict approach that I would also support is only returning ai file type if the file extension is already .ai. It seems more natural/compatible to default to .pdf if .ai isn't explicitly specified, since the ai detection is just a loose heuristic anyway.

@eric-yuan-vanta
Copy link
Contributor Author

eric-yuan-vanta commented Feb 15, 2023

@sindresorhus Curious if you have thoughts on this.

I plan to put up a fix with the second approach, but I would like to get https://github.com/sindresorhus/file-type/pulls in first

@eric-yuan-vanta
Copy link
Contributor Author

eric-yuan-vanta commented Feb 15, 2023

But it seems like we don't have access to the original file extension, since we only use the stream which makes sense, so maybe this approach is no good.

In my own usage, I'll work around it by managing this case in the caller.

Still, I wonder if there's a better way to do this than what we have today.

@Borewit
Copy link
Collaborator

Borewit commented Feb 17, 2023

None of the file implemented recognition is perfect (guaranteed to be correct). By writing 4 characters at the beginning of a text file you can probably mimic half of of the file recognition heuristics. This reliability of the heuristics vary strongly.

If the recognition is likely to introduce false positives (for which there is no clear definition), it may indeed be better to, preferably improve the algorithm, or, like you suggest, fall back on it's parent file type.

@Borewit
Copy link
Collaborator

Borewit commented Dec 10, 2024

We indeed can only determine the file-type based on the file content.
It helps if you make examples available, so we can analyze the difference. (Preferably small file sizes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants