Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

能否增加页眉页脚剔除开关参数,控制是否剃除页眉页脚内容 #626

Closed
guoguo0646 opened this issue Sep 18, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@guoguo0646
Copy link

目前版本[0.8.1]识别出pdf文档中的页眉页脚并做了自动剃除,在有些场景下由于页眉页脚里包含了些比较重要的内容需要在最终解析结果里保留页眉页脚的内容,能否增加页眉页脚剔除开关全局参数,并通过此参数控制是否剃除页眉页脚内容?

@guoguo0646 guoguo0646 added the enhancement New feature or request label Sep 18, 2024
@myhloli
Copy link
Collaborator

myhloli commented Sep 18, 2024

middle.json中的discarded_blocks存储了每页被剔除的文本信息,可以自己写个逻辑转存出来。

@myhloli myhloli closed this as completed Sep 19, 2024
@skyantao
Copy link

书本的页码没有被识别出来,我需要页码用于业务定位,怎么能输出呢?

discarded_blocks 里面也没有

@myhloli
Copy link
Collaborator

myhloli commented Dec 31, 2024

书本的页码没有被识别出来,我需要页码用于业务定位,怎么能输出呢?

discarded_blocks 里面也没有

只要页码的话,contentlist中有个pageidx字段代表页码

@skyantao
Copy link

我需要从目录 的页码指向 对应的位置,page_index 从0 开始的,前面有封面、版号、目录、序言等,导致无法获取正确的页码

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants