You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2024-11-22 11:26:54.840 | INFO | magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 0, text_len: 21252, cid_chars_radio: 0.0
2024-11-22 11:27:38.056 | INFO | magic_pdf.model.pdf_extract_kit:call:184 - layout detection time: 41.1
2024-11-22 11:27:41.379 | INFO | magic_pdf.model.pdf_extract_kit:call:232 - det time: 3.31
2024-11-22 11:27:41.380 | INFO | magic_pdf.model.pdf_extract_kit:call:272 - table time: 0.0
2024-11-22 11:27:41.381 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:168 - -----page_id : 0, page total time: 44.42-----
2024-11-22 11:28:11.389 | INFO | magic_pdf.model.pdf_extract_kit:call:184 - layout detection time: 30.01
2024-11-22 11:28:13.730 | INFO | magic_pdf.model.pdf_extract_kit:call:232 - det time: 2.33
2024-11-22 11:28:13.731 | INFO | magic_pdf.model.pdf_extract_kit:call:272 - table time: 0.0
2024-11-22 11:28:13.732 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:168 - -----page_id : 1, page total time: 32.35-----
2024-11-22 11:28:43.934 | INFO | magic_pdf.model.pdf_extract_kit:call:184 - layout detection time: 30.2
2024-11-22 11:28:45.223 | INFO | magic_pdf.model.pdf_extract_kit:call:232 - det time: 1.28
2024-11-22 11:31:57.389 | INFO | magic_pdf.model.pdf_extract_kit:call:272 - table time: 192.16
2024-11-22 11:31:57.390 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:168 - -----page_id : 2, page total time: 223.66-----
2024-11-22 11:32:31.371 | INFO | magic_pdf.model.pdf_extract_kit:call:184 - layout detection time: 33.98
2024-11-22 11:32:33.672 | INFO | magic_pdf.model.pdf_extract_kit:call:232 - det time: 2.29
2024-11-22 11:35:45.704 | INFO | magic_pdf.model.pdf_extract_kit:call:272 - table time: 192.03
2024-11-22 11:35:45.705 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:168 - -----page_id : 3, page total time: 228.31-----
2024-11-22 11:36:17.324 | INFO | magic_pdf.model.pdf_extract_kit:call:184 - layout detection time: 31.62
2024-11-22 11:36:19.286 | INFO | magic_pdf.model.pdf_extract_kit:call:232 - det time: 1.95
2024-11-22 11:36:19.287 | INFO | magic_pdf.model.pdf_extract_kit:call:272 - table time: 0.0
2024-11-22 11:36:19.288 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:168 - -----page_id : 4, page total time: 33.58-----
2024-11-22 11:36:49.040 | INFO | magic_pdf.model.pdf_extract_kit:call:184 - layout detection time: 29.75
2024-11-22 11:36:51.518 | INFO | magic_pdf.model.pdf_extract_kit:call:232 - det time: 2.47
2024-11-22 11:36:51.521 | INFO | magic_pdf.model.pdf_extract_kit:call:272 - table time: 0.0
2024-11-22 11:36:51.522 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:168 - -----page_id : 5, page total time: 32.23-----
2024-11-22 11:37:21.369 | INFO | magic_pdf.model.pdf_extract_kit:call:184 - layout detection time: 29.85
2024-11-22 11:37:22.244 | INFO | magic_pdf.model.pdf_extract_kit:call:232 - det time: 0.87
2024-11-22 11:37:22.245 | INFO | magic_pdf.model.pdf_extract_kit:call:272 - table time: 0.0
2024-11-22 11:37:22.246 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:168 - -----page_id : 6, page total time: 30.72-----
2024-11-22 11:37:23.689 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:178 - gc time: 1.44
2024-11-22 11:37:23.690 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:182 - doc analyze time: 626.73, speed: 0.01 pages/second
2024-11-22 11:37:32.284 | ERROR | magic_pdf.user_api:parse_pdf:88 - Can't mix strings and bytes in path components
Traceback (most recent call last):
File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 157, in
main()
└ <function main at 0x7fa036b564d0>
File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 153, in main
app.start()
│ └ <function IPKernelApp.start at 0x7fa033e62e60>
└ <ipykernel.kernelapp.IPKernelApp object at 0x7fa0338ee770>
File "/databricks/python/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 736, in start
self.io_loop.start()
│ │ └ <function BaseAsyncIOLoop.start at 0x7fa0358db010>
│ └ <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7fa0338ef790>
└ <ipykernel.kernelapp.IPKernelApp object at 0x7fa0338ee770>
File "/databricks/python/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 199, in start
self.asyncio_loop.run_forever()
│ │ └ <function BaseEventLoop.run_forever at 0x7fa0361aff40>
│ └ <_UnixSelectorEventLoop running=True closed=False debug=False>
└ <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7fa0338ef790>
File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
│ └ <function BaseEventLoop._run_once at 0x7fa0361b9ab0>
└ <_UnixSelectorEventLoop running=True closed=False debug=False>
File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
│ └ <function Handle._run at 0x7fa03615d480>
└ <Handle Task.task_wakeup(, ...],))>)>
File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
│ │ │ │ │ └ <member '_args' of 'Handle' objects>
│ │ │ │ └ <Handle Task.task_wakeup(, ...],))>)>
│ │ │ └ <member '_callback' of 'Handle' objects>
│ │ └ <Handle Task.task_wakeup(, ...],))>)>
│ └ <member '_context' of 'Handle' objects>
└ <Handle Task.task_wakeup(, ...],))>)>
File "/databricks/python/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 516, in dispatch_queue
await self.process_one()
│ └ <function Kernel.process_one at 0x7fa03447c3a0>
└ <dbruntime.DatabricksShell.DatabricksKernel object at 0x7fa033728a90>
File "/databricks/python/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 505, in process_one
await dispatch(*args)
│ └ ([<zmq.sugar.frame.Frame object at 0x7fa02f155220>, <zmq.sugar.frame.Frame object at 0x7f9e93e80720>, <zmq.sugar.frame.Frame ...
└ <bound method Kernel.dispatch_shell of <dbruntime.DatabricksShell.DatabricksKernel object at 0x7fa033728a90>>
File "/databricks/python/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 412, in dispatch_shell
await result
└ <coroutine object Kernel.execute_request at 0x7f9ea80d4c10>
File "/databricks/python/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 740, in execute_request
reply_content = await reply_content
└ <coroutine object DatabricksKernel.do_execute at 0x7f9d8353b760>
File "/databricks/python/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 422, in do_execute
res = shell.run_cell(
│ └ <function ZMQInteractiveShell.run_cell at 0x7fa033e60790>
└ <dbruntime.DatabricksShell.DatabricksShell object at 0x7fa033728f40>
File "/databricks/python/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3009, in run_cell
result = self._run_cell(
│ └ <function InteractiveShell._run_cell at 0x7fa0351cf010>
└ <dbruntime.DatabricksShell.DatabricksShell object at 0x7fa033728f40>
File "/databricks/python/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3064, in _run_cell
result = runner(coro)
│ └ <coroutine object InteractiveShell.run_cell_async at 0x7f9ea803bca0>
└ <function _pseudo_sync_runner at 0x7fa0351fa950>
File "/databricks/python/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
coro.send(None)
│ └ <method 'send' of 'coroutine' objects>
└ <coroutine object InteractiveShell.run_cell_async at 0x7f9ea803bca0>
File "/databricks/python/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3269, in run_cell_async
has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
│ │ │ │ └ '/root/.ipykernel/1880/command-3938208215605947-2839413430'
│ │ │ └ [<ast.Assign object at 0x7f9d829dc6a0>]
│ │ └ <ast.Module object at 0x7f9d829dc850>
│ └ <function InteractiveShell.run_ast_nodes at 0x7fa0351cf2e0>
└ <dbruntime.DatabricksShell.DatabricksShell object at 0x7fa033728f40>
File "/databricks/python/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3448, in run_ast_nodes
if await self.run_code(code, result, async_=asy):
│ │ │ │ └ False
│ │ │ └ <ExecutionResult object at 7f9d829de230, execution_count=11 error_before_exec=None error_in_exec=None info=<ExecutionInfo obj...
│ │ └ <code object at 0x7f9e5bca6ce0, file "/root/.ipykernel/1880/command-3938208215605947-2839413430", line 1>
│ └ <function InteractiveShell.run_code at 0x7fa0351cf370>
└ <dbruntime.DatabricksShell.DatabricksShell object at 0x7fa033728f40>
File "/databricks/python/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3508, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
│ │ │ │ └ {'display': <bound method Display.display of <dbruntime.display.Display object at 0x7fa0338ee200>>, 'displayHTML': <function ...
│ │ │ └ <dbruntime.DatabricksShell.DatabricksShell object at 0x7fa033728f40>
│ │ └ <property object at 0x7fa0351d0360>
│ └ <dbruntime.DatabricksShell.DatabricksShell object at 0x7fa033728f40>
└ <code object at 0x7f9e5bca6ce0, file "/root/.ipykernel/1880/command-3938208215605947-2839413430", line 1>
File "/root/.ipykernel/1880/command-3938208215605947-2839413430", line 1, in
content_list, md_content = pdf_parse_main(RFP_PATH, is_json_md_dump=False, is_draw_visualization_bbox=False)
│ └ '/Workspace/Users/[email protected]/RFP_samples/1100016526 - SOW.pdf'
└ <function pdf_parse_main at 0x7f9ea80e2d40>
File "/root/.ipykernel/1880/command-3938208215605946-709558184", line 88, in pdf_parse_main
pipe.pipe_parse()
│ └ <function UNIPipe.pipe_parse at 0x7f9ea8082560>
└ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f9d829de4a0>
File "/databricks/python/lib/python3.10/site-packages/magic_pdf/user_api.py", line 91, in parse_union_pdf
pdf_info_dict = parse_pdf(parse_pdf_by_txt)
│ └ <function parse_pdf_by_txt at 0x7f9ea8082200>
└ <function parse_union_pdf..parse_pdf at 0x7f9d83333eb0>
File "/databricks/python/lib/python3.10/site-packages/magic_pdf/user_api.py", line 78, in parse_pdf
return method(
└ <function parse_pdf_by_txt at 0x7f9ea8082200>
File "/databricks/python/lib/python3.10/site-packages/magic_pdf/pdf_parse_by_txt.py", line 16, in parse_pdf_by_txt
return pdf_parse_union(dataset,
│ └ <magic_pdf.data.dataset.PymuDocDataset object at 0x7f9df7872920>
└ <function pdf_parse_union at 0x7f9ea8082170>
File "/databricks/python/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core_v2.py", line 820, in pdf_parse_union
page_info = parse_page_core(
└ <function parse_page_core at 0x7f9ea80820e0>
File "/databricks/python/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core_v2.py", line 730, in parse_page_core
spans = ocr_cut_image_and_table(
└ <function ocr_cut_image_and_table at 0x7f9ea805f6d0>
File "/databricks/python/lib/python3.10/site-packages/magic_pdf/libs/pdf_image_tools.py", line 32, in cut_image
imageWriter.write(img_hash256_path, byte_data)
│ │ │ └ b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00\x00\x00\x00\xff\xdb\x00C\x00\x02\x01\x01\x01\x01\x01\x02\x01\x01\x01\x02...
│ │ └ '30284ece347623ce24fb3dc3ab2e1a00925a41da6f29804f4f17bae92de2b1db.jpg'
│ └ <function DiskReaderWriter.write at 0x7f9ea80832e0>
└ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x7f9d82d02d10>
File "/databricks/python/lib/python3.10/site-packages/magic_pdf/rw/DiskReaderWriter.py", line 32, in write
abspath = os.path.join(self.path, path)
│ │ │ │ │ └ b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00\x00\x00\x00\xff\xdb\x00C\x00\x02\x01\x01\x01\x01\x01\x02\x01\x01\x01\x02...
│ │ │ │ └ '/Workspace/Users/[email protected]/RFP_samples/1100016526 - SOW/images'
│ │ │ └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x7f9d82d02d10>
│ │ └ <function join at 0x7fa036b42b00>
│ └ <module 'posixpath' from '/usr/lib/python3.10/posixpath.py'>
└ <module 'os' from '/usr/lib/python3.10/os.py'>
File "/usr/lib/python3.10/posixpath.py", line 90, in join
genericpath._check_arg_types('join', a, *p)
│ │ │ └ (b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00\x00\x00\x00\xff\xdb\x00C\x00\x02\x01\x01\x01\x01\x01\x02\x01\x01\x01\x0...
│ │ └ '/Workspace/Users/[email protected]/RFP_samples/1100016526 - SOW/images'
│ └ <function _check_arg_types at 0x7fa036b42950>
└ <module 'genericpath' from '/usr/lib/python3.10/genericpath.py'>
File "/usr/lib/python3.10/genericpath.py", line 155, in _check_arg_types
raise TypeError("Can't mix strings and bytes in path components") from None
TypeError: Can't mix strings and bytes in path components
2024-11-22 11:37:32.311 | WARNING | magic_pdf.user_api:parse_union_pdf:93 - parse_pdf_by_txt drop or error, switch to parse_pdf_by_ocr
Operating system | 操作系统
Linux
Python version | Python 版本
3.10
Software version | 软件版本 (magic-pdf --version)
0.9.x
Device mode | 设备模式
cpu
The text was updated successfully, but these errors were encountered:
boranyang-ML
changed the title
跑magic_pdf_parse_main.py demo修改了一下magic-pdf.json报错,求解
0.10.0跑magic_pdf_parse_main.py demo修改了一下magic-pdf.json报错,求解
Nov 22, 2024
Description of the bug | 错误描述
运行在databricks notebook上,Python3.10,版本是0.10.0
主要是基于demo的文件修改了一下magic-pdf.json的设置,取消了公式识别,启用了table识别,然后在写入结果的时候报错,来源貌似是os.path.join()在尝试combine一个path和一个bytes,求解
Edit: 0.9.3不会报错
How to reproduce the bug | 如何复现
基于demo文件magic_pdf_parse_main.py
我修改了模型的json设置为
json_mods = {
'models-dir': model_dir,
'layoutreader-model-dir': layoutreader_model_dir,
"layout-config": {
"model": "layoutlmv3"
},
"formula-config": {
"mfd_model": "yolo_v8_mfd",
"mfr_model": "unimernet_small",
"enable": False
},
"table-config": {
"model": "tablemaster", # tablemaster 和 rapid都报错
"enable": True,
"max_time": 400
},
}
报错log:
2024-11-22 11:26:54.840 | INFO | magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 0, text_len: 21252, cid_chars_radio: 0.0
2024-11-22 11:27:38.056 | INFO | magic_pdf.model.pdf_extract_kit:call:184 - layout detection time: 41.1
2024-11-22 11:27:41.379 | INFO | magic_pdf.model.pdf_extract_kit:call:232 - det time: 3.31
2024-11-22 11:27:41.380 | INFO | magic_pdf.model.pdf_extract_kit:call:272 - table time: 0.0
2024-11-22 11:27:41.381 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:168 - -----page_id : 0, page total time: 44.42-----
2024-11-22 11:28:11.389 | INFO | magic_pdf.model.pdf_extract_kit:call:184 - layout detection time: 30.01
2024-11-22 11:28:13.730 | INFO | magic_pdf.model.pdf_extract_kit:call:232 - det time: 2.33
2024-11-22 11:28:13.731 | INFO | magic_pdf.model.pdf_extract_kit:call:272 - table time: 0.0
2024-11-22 11:28:13.732 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:168 - -----page_id : 1, page total time: 32.35-----
2024-11-22 11:28:43.934 | INFO | magic_pdf.model.pdf_extract_kit:call:184 - layout detection time: 30.2
2024-11-22 11:28:45.223 | INFO | magic_pdf.model.pdf_extract_kit:call:232 - det time: 1.28
2024-11-22 11:31:57.389 | INFO | magic_pdf.model.pdf_extract_kit:call:272 - table time: 192.16
2024-11-22 11:31:57.390 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:168 - -----page_id : 2, page total time: 223.66-----
2024-11-22 11:32:31.371 | INFO | magic_pdf.model.pdf_extract_kit:call:184 - layout detection time: 33.98
2024-11-22 11:32:33.672 | INFO | magic_pdf.model.pdf_extract_kit:call:232 - det time: 2.29
2024-11-22 11:35:45.704 | INFO | magic_pdf.model.pdf_extract_kit:call:272 - table time: 192.03
2024-11-22 11:35:45.705 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:168 - -----page_id : 3, page total time: 228.31-----
2024-11-22 11:36:17.324 | INFO | magic_pdf.model.pdf_extract_kit:call:184 - layout detection time: 31.62
2024-11-22 11:36:19.286 | INFO | magic_pdf.model.pdf_extract_kit:call:232 - det time: 1.95
2024-11-22 11:36:19.287 | INFO | magic_pdf.model.pdf_extract_kit:call:272 - table time: 0.0
2024-11-22 11:36:19.288 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:168 - -----page_id : 4, page total time: 33.58-----
2024-11-22 11:36:49.040 | INFO | magic_pdf.model.pdf_extract_kit:call:184 - layout detection time: 29.75
2024-11-22 11:36:51.518 | INFO | magic_pdf.model.pdf_extract_kit:call:232 - det time: 2.47
2024-11-22 11:36:51.521 | INFO | magic_pdf.model.pdf_extract_kit:call:272 - table time: 0.0
2024-11-22 11:36:51.522 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:168 - -----page_id : 5, page total time: 32.23-----
2024-11-22 11:37:21.369 | INFO | magic_pdf.model.pdf_extract_kit:call:184 - layout detection time: 29.85
2024-11-22 11:37:22.244 | INFO | magic_pdf.model.pdf_extract_kit:call:232 - det time: 0.87
2024-11-22 11:37:22.245 | INFO | magic_pdf.model.pdf_extract_kit:call:272 - table time: 0.0
2024-11-22 11:37:22.246 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:168 - -----page_id : 6, page total time: 30.72-----
2024-11-22 11:37:23.689 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:178 - gc time: 1.44
2024-11-22 11:37:23.690 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:182 - doc analyze time: 626.73, speed: 0.01 pages/second
2024-11-22 11:37:32.284 | ERROR | magic_pdf.user_api:parse_pdf:88 - Can't mix strings and bytes in path components
Traceback (most recent call last):
File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 157, in
main()
└ <function main at 0x7fa036b564d0>
File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 153, in main
app.start()
│ └ <function IPKernelApp.start at 0x7fa033e62e60>
└ <ipykernel.kernelapp.IPKernelApp object at 0x7fa0338ee770>
File "/databricks/python/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 736, in start
self.io_loop.start()
│ │ └ <function BaseAsyncIOLoop.start at 0x7fa0358db010>
│ └ <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7fa0338ef790>
└ <ipykernel.kernelapp.IPKernelApp object at 0x7fa0338ee770>
File "/databricks/python/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 199, in start
self.asyncio_loop.run_forever()
│ │ └ <function BaseEventLoop.run_forever at 0x7fa0361aff40>
│ └ <_UnixSelectorEventLoop running=True closed=False debug=False>
└ <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7fa0338ef790>
File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
│ └ <function BaseEventLoop._run_once at 0x7fa0361b9ab0>
└ <_UnixSelectorEventLoop running=True closed=False debug=False>
File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
│ └ <function Handle._run at 0x7fa03615d480>
└ <Handle Task.task_wakeup(, ...],))>)>
File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
│ │ │ │ │ └ <member '_args' of 'Handle' objects>
│ │ │ │ └ <Handle Task.task_wakeup(, ...],))>)>
│ │ │ └ <member '_callback' of 'Handle' objects>
│ │ └ <Handle Task.task_wakeup(, ...],))>)>
│ └ <member '_context' of 'Handle' objects>
└ <Handle Task.task_wakeup(, ...],))>)>
File "/databricks/python/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 516, in dispatch_queue
await self.process_one()
│ └ <function Kernel.process_one at 0x7fa03447c3a0>
└ <dbruntime.DatabricksShell.DatabricksKernel object at 0x7fa033728a90>
File "/databricks/python/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 505, in process_one
await dispatch(*args)
│ └ ([<zmq.sugar.frame.Frame object at 0x7fa02f155220>, <zmq.sugar.frame.Frame object at 0x7f9e93e80720>, <zmq.sugar.frame.Frame ...
└ <bound method Kernel.dispatch_shell of <dbruntime.DatabricksShell.DatabricksKernel object at 0x7fa033728a90>>
File "/databricks/python/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 412, in dispatch_shell
await result
└ <coroutine object Kernel.execute_request at 0x7f9ea80d4c10>
File "/databricks/python/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 740, in execute_request
reply_content = await reply_content
└ <coroutine object DatabricksKernel.do_execute at 0x7f9d8353b760>
File "/databricks/python_shell/dbruntime/DatabricksShell.py", line 131, in do_execute
reply_content = await super().do_execute(*args, **kwargs)
│ └ {'cell_id': None}
└ ('content_list, md_content = pdf_parse_main(RFP_PATH, is_json_md_dump=False, is_draw_visualization_bbox=False)', False, True,...
File "/databricks/python/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 422, in do_execute
res = shell.run_cell(
│ └ <function ZMQInteractiveShell.run_cell at 0x7fa033e60790>
└ <dbruntime.DatabricksShell.DatabricksShell object at 0x7fa033728f40>
File "/databricks/python/lib/python3.10/site-packages/ipykernel/zmqshell.py", line 546, in run_cell
return super().run_cell(*args, **kwargs)
│ └ {'store_history': True, 'silent': False, 'cell_id': None}
└ ('content_list, md_content = pdf_parse_main(RFP_PATH, is_json_md_dump=False, is_draw_visualization_bbox=False)',)
File "/databricks/python/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3009, in run_cell
result = self._run_cell(
│ └ <function InteractiveShell._run_cell at 0x7fa0351cf010>
└ <dbruntime.DatabricksShell.DatabricksShell object at 0x7fa033728f40>
File "/databricks/python/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3064, in _run_cell
result = runner(coro)
│ └ <coroutine object InteractiveShell.run_cell_async at 0x7f9ea803bca0>
└ <function _pseudo_sync_runner at 0x7fa0351fa950>
File "/databricks/python/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
coro.send(None)
│ └ <method 'send' of 'coroutine' objects>
└ <coroutine object InteractiveShell.run_cell_async at 0x7f9ea803bca0>
File "/databricks/python/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3269, in run_cell_async
has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
│ │ │ │ └ '/root/.ipykernel/1880/command-3938208215605947-2839413430'
│ │ │ └ [<ast.Assign object at 0x7f9d829dc6a0>]
│ │ └ <ast.Module object at 0x7f9d829dc850>
│ └ <function InteractiveShell.run_ast_nodes at 0x7fa0351cf2e0>
└ <dbruntime.DatabricksShell.DatabricksShell object at 0x7fa033728f40>
File "/databricks/python/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3448, in run_ast_nodes
if await self.run_code(code, result, async_=asy):
│ │ │ │ └ False
│ │ │ └ <ExecutionResult object at 7f9d829de230, execution_count=11 error_before_exec=None error_in_exec=None info=<ExecutionInfo obj...
│ │ └ <code object at 0x7f9e5bca6ce0, file "/root/.ipykernel/1880/command-3938208215605947-2839413430", line 1>
│ └ <function InteractiveShell.run_code at 0x7fa0351cf370>
└ <dbruntime.DatabricksShell.DatabricksShell object at 0x7fa033728f40>
File "/databricks/python/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3508, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
│ │ │ │ └ {'display': <bound method Display.display of <dbruntime.display.Display object at 0x7fa0338ee200>>, 'displayHTML': <function ...
│ │ │ └ <dbruntime.DatabricksShell.DatabricksShell object at 0x7fa033728f40>
│ │ └ <property object at 0x7fa0351d0360>
│ └ <dbruntime.DatabricksShell.DatabricksShell object at 0x7fa033728f40>
└ <code object at 0x7f9e5bca6ce0, file "/root/.ipykernel/1880/command-3938208215605947-2839413430", line 1>
File "/root/.ipykernel/1880/command-3938208215605947-2839413430", line 1, in
content_list, md_content = pdf_parse_main(RFP_PATH, is_json_md_dump=False, is_draw_visualization_bbox=False)
│ └ '/Workspace/Users/[email protected]/RFP_samples/1100016526 - SOW.pdf'
└ <function pdf_parse_main at 0x7f9ea80e2d40>
File "/root/.ipykernel/1880/command-3938208215605946-709558184", line 88, in pdf_parse_main
pipe.pipe_parse()
│ └ <function UNIPipe.pipe_parse at 0x7f9ea8082560>
└ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f9d829de4a0>
File "/databricks/python/lib/python3.10/site-packages/magic_pdf/pipe/UNIPipe.py", line 43, in pipe_parse
self.pdf_mid_data = parse_union_pdf(self.pdf_bytes, self.model_list, self.image_writer,
│ │ │ │ │ │ │ │ └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x7f9d82d02d10>
│ │ │ │ │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f9d829de4a0>
│ │ │ │ │ │ └ [{'layout_dets': [{'category_id': 0, 'poly': [96.58213806152344, 612.4583740234375, 440.2578430175781, 612.4583740234375, 440...
│ │ │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f9d829de4a0>
│ │ │ │ └ b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\xc4\xc6\n3 0 obj\n<< /Filter /FlateDecode /Length 14961 >>\nstream\nx\x01\x...
│ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f9d829de4a0>
│ │ └ <function parse_union_pdf at 0x7f9ea8082320>
│ └ None
└ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f9d829de4a0>
File "/databricks/python/lib/python3.10/site-packages/magic_pdf/user_api.py", line 91, in parse_union_pdf
pdf_info_dict = parse_pdf(parse_pdf_by_txt)
│ └ <function parse_pdf_by_txt at 0x7f9ea8082200>
└ <function parse_union_pdf..parse_pdf at 0x7f9d83333eb0>
File "/databricks/python/lib/python3.10/site-packages/magic_pdf/pdf_parse_by_txt.py", line 16, in parse_pdf_by_txt
return pdf_parse_union(dataset,
│ └ <magic_pdf.data.dataset.PymuDocDataset object at 0x7f9df7872920>
└ <function pdf_parse_union at 0x7f9ea8082170>
File "/databricks/python/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core_v2.py", line 820, in pdf_parse_union
page_info = parse_page_core(
└ <function parse_page_core at 0x7f9ea80820e0>
File "/databricks/python/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core_v2.py", line 730, in parse_page_core
spans = ocr_cut_image_and_table(
└ <function ocr_cut_image_and_table at 0x7f9ea805f6d0>
File "/databricks/python/lib/python3.10/site-packages/magic_pdf/pre_proc/cut_image.py", line 22, in ocr_cut_image_and_table
span['image_path'] = cut_image(span['bbox'], page_id, page, return_path=return_path('tables'),
│ │ │ │ │ └ <function ocr_cut_image_and_table..return_path at 0x7f9d836b6290>
│ │ │ │ └ <magic_pdf.data.dataset.Doc object at 0x7f9df77ef430>
│ │ │ └ 2
│ │ └ {'bbox': [34, 237, 562, 510], 'score': 0.9999329447746277, 'html': '
│ └ <function cut_image at 0x7f9ece0a7880>
└ {'bbox': [34, 237, 562, 510], 'score': 0.9999329447746277, 'html': '
File "/databricks/python/lib/python3.10/site-packages/magic_pdf/libs/pdf_image_tools.py", line 32, in cut_image
imageWriter.write(img_hash256_path, byte_data)
│ │ │ └ b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00
\x00
\x00\x00\xff\xdb\x00C\x00\x02\x01\x01\x01\x01\x01\x02\x01\x01\x01\x02...│ │ └ '30284ece347623ce24fb3dc3ab2e1a00925a41da6f29804f4f17bae92de2b1db.jpg'
│ └ <function DiskReaderWriter.write at 0x7f9ea80832e0>
└ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x7f9d82d02d10>
File "/databricks/python/lib/python3.10/site-packages/magic_pdf/rw/DiskReaderWriter.py", line 32, in write
abspath = os.path.join(self.path, path)
│ │ │ │ │ └ b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00
\x00
\x00\x00\xff\xdb\x00C\x00\x02\x01\x01\x01\x01\x01\x02\x01\x01\x01\x02...│ │ │ │ └ '/Workspace/Users/[email protected]/RFP_samples/1100016526 - SOW/images'
│ │ │ └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x7f9d82d02d10>
│ │ └ <function join at 0x7fa036b42b00>
│ └ <module 'posixpath' from '/usr/lib/python3.10/posixpath.py'>
└ <module 'os' from '/usr/lib/python3.10/os.py'>
File "/usr/lib/python3.10/posixpath.py", line 90, in join
genericpath._check_arg_types('join', a, *p)
│ │ │ └ (b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00
\x00
\x00\x00\xff\xdb\x00C\x00\x02\x01\x01\x01\x01\x01\x02\x01\x01\x01\x0...│ │ └ '/Workspace/Users/[email protected]/RFP_samples/1100016526 - SOW/images'
│ └ <function _check_arg_types at 0x7fa036b42950>
└ <module 'genericpath' from '/usr/lib/python3.10/genericpath.py'>
File "/usr/lib/python3.10/genericpath.py", line 155, in _check_arg_types
raise TypeError("Can't mix strings and bytes in path components") from None
TypeError: Can't mix strings and bytes in path components
2024-11-22 11:37:32.311 | WARNING | magic_pdf.user_api:parse_union_pdf:93 - parse_pdf_by_txt drop or error, switch to parse_pdf_by_ocr
Operating system | 操作系统
Linux
Python version | Python 版本
3.10
Software version | 软件版本 (magic-pdf --version)
0.9.x
Device mode | 设备模式
cpu
The text was updated successfully, but these errors were encountered: