We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hi,
When I try to extract information from some pdf from DocLayNet dataset with md.convert, I get this error :
--------------------------------------------------------------------------- FileConversionException Traceback (most recent call last) Cell In[3], [line 45](vscode-notebook-cell:?execution_count=3&line=45) [41](vscode-notebook-cell:?execution_count=3&line=41) page_hash = doc[:doc.find('.pdf')] [44](vscode-notebook-cell:?execution_count=3&line=44) start_time = time.time() ---> [45](vscode-notebook-cell:?execution_count=3&line=45) conv_result = md.convert(str(doc_path)) [46](vscode-notebook-cell:?execution_count=3&line=46) diff_time = time.time() - start_time [47](vscode-notebook-cell:?execution_count=3&line=47) print(f"Time computing : {diff_time} s") File c:\Users\AppData\Local\miniconda3\envs\myenv\Lib\site-packages\markitdown\_markitdown.py:1094, in MarkItDown.convert(self, source, **kwargs) [1092](file:///C:/Users/AppData/Local/miniconda3/envs/myenv/Lib/site-packages/markitdown/_markitdown.py:1092) return self.convert_url(source, **kwargs) [1093](file:///C:/Users/AppData/Local/miniconda3/envs/myenv/Lib/site-packages/markitdown/_markitdown.py:1093) else: -> [1094](file:///C:/Users/AppData/Local/miniconda3/envs/myenv/Lib/site-packages/markitdown/_markitdown.py:1094) return self.convert_local(source, **kwargs) [1095](file:///C:/Users/AppData/Local/miniconda3/envs/myenv/Lib/site-packages/markitdown/_markitdown.py:1095) # Request response [1096](file:///C:/Users/AppData/Local/miniconda3/envs/myenv/Lib/site-packages/markitdown/_markitdown.py:1096) elif isinstance(source, requests.Response): File c:\Users\AppData\Local\miniconda3\envs\myenv\Lib\site-packages\markitdown\_markitdown.py:1114, in MarkItDown.convert_local(self, path, **kwargs) [1111](file:///C:/Users/AppData/Local/miniconda3/envs/myenv/Lib/site-packages/markitdown/_markitdown.py:1111) self._append_ext(extensions, g) [1113](file:///C:/Users/AppData/Local/miniconda3/envs/myenv/Lib/site-packages/markitdown/_markitdown.py:1113) # Convert -> [1114](file:///C:/Users/AppData/Local/miniconda3/envs/myenv/Lib/site-packages/markitdown/_markitdown.py:1114) return self._convert(path, extensions, **kwargs) File c:\Users\AppData\Local\miniconda3\envs\myenv\Lib\site-packages\markitdown\_markitdown.py:1255, in MarkItDown._convert(self, local_path, extensions, **kwargs) [1253](file:///C:/Users/AppData/Local/miniconda3/envs/myenv/Lib/site-packages/markitdown/_markitdown.py:1253) # If we got this far without success, report any exceptions [1254](file:///C:/Users/AppData/Local/miniconda3/envs/myenv/Lib/site-packages/markitdown/_markitdown.py:1254) if len(error_trace) > 0: -> [1255](file:///C:/Users/AppData/Local/miniconda3/envs/myenv/Lib/site-packages/markitdown/_markitdown.py:1255) raise FileConversionException( [1256](file:///C:/Users/AppData/Local/miniconda3/envs/myenv/Lib/site-packages/markitdown/_markitdown.py:1256) f"Could not convert '{local_path}' to Markdown. File type was recognized as {extensions}. While converting the file, the following error was encountered:\n\n{error_trace}" [1257](file:///C:/Users/AppData/Local/miniconda3/envs/myenv/Lib/site-packages/markitdown/_markitdown.py:1257) ) [1259](file:///C:/Users/AppData/Local/miniconda3/envs/myenv/Lib/site-packages/markitdown/_markitdown.py:1259) # Nothing can handle it! [1260](file:///C:/Users/AppData/Local/miniconda3/envs/myenv/Lib/site-packages/markitdown/_markitdown.py:1260) raise UnsupportedFormatException( [1261](file:///C:/Users/AppData/Local/miniconda3/envs/myenv/Lib/site-packages/markitdown/_markitdown.py:1261) f"Could not convert '{local_path}' to Markdown. The formats {extensions} are not supported." [1262](file:///C:/Users/AppData/Local/miniconda3/envs/myenv/Lib/site-packages/markitdown/_markitdown.py:1262) ) FileConversionException: Could not convert 'E:\users\.cache\huggingface\hub\datasets--pierreguillou--DocLayNet-large\snapshots\38ff443244c1b496c33ed237d3d4468daf24265c\data\part_dataset_3\part_dataset_3\test\pdfs\ccbe08f3390d47046dbb9d4c839788ba05a0f5e139ab6931a06e8304247c54f0.pdf' to Markdown. File type was recognized as ['.pdf', '.pdf', '.fdf']. While converting the file, the following error was encountered: Traceback (most recent call last): File "c:\Users\AppData\Local\miniconda3\envs\myenv\Lib\site-packages\markitdown\_markitdown.py", line 1239, in _convert res = converter.convert(local_path, **_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "c:\Users\AppData\Local\miniconda3\envs\myenv\Lib\site-packages\markitdown\_markitdown.py", line 490, in convert text_content=pdfminer.high_level.extract_text(local_path), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "c:\Users\AppData\Local\miniconda3\envs\myenv\Lib\site-packages\pdfminer\high_level.py", line 169, in extract_text for page in PDFPage.get_pages( ^^^^^^^^^^^^^^^^^^ File "c:\Users\AppData\Local\miniconda3\envs\myenv\Lib\site-packages\pdfminer\pdfpage.py", line 171, in get_pages for (pageno, page) in enumerate(cls.create_pages(doc)): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "c:\Users\AppData\Local\miniconda3\envs\myenv\Lib\site-packages\pdfminer\pdfpage.py", line 127, in create_pages yield cls(document, objid, tree, next(page_labels)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "c:\Users\AppData\Local\miniconda3\envs\myenv\Lib\site-packages\pdfminer\pdfpage.py", line 64, in __init__ resolve1(mediabox_param) for mediabox_param in self.attrs["MediaBox"] ~~~~~~~~~~^^^^^^^^^^^^ TypeError: 'PDFObjRef' object is not iterable
Is someone have already encountered this issue? It's really strange because the document is not an .fdf file but a .pdf one.
The text was updated successfully, but these errors were encountered:
please give us some sample data we will also test in our end and will try to update you
Sorry, something went wrong.
Thanks, here are sample data that provide this error by my side:
1a20fe0c5546fc2bdd6d39238368a45ac744bb8ca5704866fea0b66603b7cc7d.pdf
00cbcdcc89d8a14fa7411d1d6e845947568dd3da8c7956c1c225a29d75e6185a.pdf
0cc658bd9a55e4bc191c9252d39f505fd071240f60cfd53870c39af57d989c2c.pdf
0ce8892fa840c18e459935729a05a5e8518149dc2dee2a797615a4e1a176e27c.pdf
0e4fb27cb786635b1132ed074aff632afccd50589d058ca419d4ac1e8d4489ed.pdf
Source : https://huggingface.co/datasets/ds4sd/DocLayNet, part_dataset_1>test>pdfs
No branches or pull requests
Hi,
When I try to extract information from some pdf from DocLayNet dataset with md.convert, I get this error :
Is someone have already encountered this issue? It's really strange because the document is not an .fdf file but a .pdf one.
The text was updated successfully, but these errors were encountered: