Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort tables extracted on a page by their top position #338

Merged
merged 1 commit into from
Jan 22, 2021
Merged

Conversation

samkit-jain
Copy link
Collaborator

@samkit-jain samkit-jain commented Jan 19, 2021

Fixes #336
h/t @gqh1995 for reporting

When multiple tables are found on a page, sorts them top to bottom. If there is a collision, sorts left to right. Not sure, but I think instead of sorting as I have done here, but perhaps, can update the sorting at

_sorted = sorted(tables, key=lambda t: min(t["corners"]))
instead.

PS: The changelog would need to be updated with the commit hash after this PR gets merged in.

@codecov
Copy link

codecov bot commented Jan 19, 2021

Codecov Report

Merging #338 (70829c0) into develop (de767c3) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##           develop     #338   +/-   ##
========================================
  Coverage    98.17%   98.17%           
========================================
  Files           10       10           
  Lines         1205     1205           
========================================
  Hits          1183     1183           
  Misses          22       22           
Impacted Files Coverage Δ
pdfplumber/table.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de767c3...a17c5a7. Read the comment docs.

@samkit-jain samkit-jain requested a review from jsvine January 19, 2021 10:11
@samkit-jain samkit-jain marked this pull request as ready for review January 19, 2021 10:11
Copy link
Owner

@jsvine jsvine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this, @samkit-jain. I think you've identified a helpful improvement. It may cause problems for some (hopefully very few) people — if they depend on the current leftmost-and-then-topmost ordering and have a PDF where the left-alignment is either meaningful or predictably uneven — but I think it will save people more trouble in the long run.

Many thanks for adding the test, too. I've made a suggestion about moving the logic to a slightly different place in the code, but this otherwise looks good.

@samkit-jain
Copy link
Collaborator Author

Thanks for the review @jsvine How about we add a new argument table_direction that controls whether the tables are read leftmost (default) or topmost? This would ensure that the change does not break any integrations.

@jsvine
Copy link
Owner

jsvine commented Jan 20, 2021

My initial instinct is not to add this argument. My hunch is that this change will affect very few integrations, and that there is (and will remain) a fairly easy way to customize the table order, by using page.find_tables(...) instead of page.extract_tables(...). E.g.,:

tables = sorted(my_page.find_tables(), key=lambda x: x.bbox)
extracted = [ t.extract() for t in tables ]

That said, I'm open to being persuaded otherwise. Thoughts?

@samkit-jain samkit-jain force-pushed the issue-336 branch 2 times, most recently from 6517abf to 56c47e2 Compare January 20, 2021 06:19
@samkit-jain
Copy link
Collaborator Author

@jsvine I agree. The user has control over the sorting using the .find_tables(...) method which makes adding a new argument redundant.

@samkit-jain samkit-jain requested a review from jsvine January 20, 2021 06:20
Copy link
Owner

@jsvine jsvine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks! Will merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants