-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort tables extracted on a page by their top
position
#338
Conversation
5a8eb77
to
d3c6034
Compare
Codecov Report
@@ Coverage Diff @@
## develop #338 +/- ##
========================================
Coverage 98.17% 98.17%
========================================
Files 10 10
Lines 1205 1205
========================================
Hits 1183 1183
Misses 22 22
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for this, @samkit-jain. I think you've identified a helpful improvement. It may cause problems for some (hopefully very few) people — if they depend on the current leftmost-and-then-topmost ordering and have a PDF where the left-alignment is either meaningful or predictably uneven — but I think it will save people more trouble in the long run.
Many thanks for adding the test, too. I've made a suggestion about moving the logic to a slightly different place in the code, but this otherwise looks good.
Thanks for the review @jsvine How about we add a new argument |
My initial instinct is not to add this argument. My hunch is that this change will affect very few integrations, and that there is (and will remain) a fairly easy way to customize the table order, by using tables = sorted(my_page.find_tables(), key=lambda x: x.bbox)
extracted = [ t.extract() for t in tables ] That said, I'm open to being persuaded otherwise. Thoughts? |
6517abf
to
56c47e2
Compare
@jsvine I agree. The user has control over the sorting using the |
56c47e2
to
a17c5a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks! Will merge.
Fixes #336
h/t @gqh1995 for reporting
When multiple tables are found on a page, sorts them top to bottom. If there is a collision, sorts left to right. Not sure, but I think instead of sorting as I have done here, but perhaps, can update the sorting at
pdfplumber/pdfplumber/table.py
Line 315 in de767c3
PS: The changelog would need to be updated with the commit hash after this PR gets merged in.