Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add snap_x/y_tolerance and join_x/y_tolerance table-extraction settings #553

Merged
merged 3 commits into from
Dec 17, 2021

Conversation

jsvine
Copy link
Owner

@jsvine jsvine commented Dec 2, 2021

Based largely on @dustindall's work in PR #51, adapted to current code. Also resolves issue #475.

This PR also changes table.merge_edges(...) behavior when join_tolerance (and x/y variants) <= 0, so that joining is attempted regardless, to handle cases of overlapping lines (which would still fulfill tolerance == 0). This is an uncommon situation but is, I believe, represents the user-expected behavior.

Based largely on @dustindall's work in PR #51, adapted to current code.

Also resolves issue #475.
Now all tolerance settings have x/y versions as well.

This commit also changes `table.merge_edges(...)` behavior when
`join_tolerance` (and `x`/`y` variants) `<= 0`, so that joining is
attempted regardless, to handle cases of overlapping lines.
@jsvine jsvine requested a review from samkit-jain December 2, 2021 04:37
@codecov
Copy link

codecov bot commented Dec 2, 2021

Codecov Report

Merging #553 (1f8ed1e) into develop (156bb4f) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head 1f8ed1e differs from pull request most recent head aa2d594. Consider uploading reports for the commit aa2d594 to get more accurate results
Impacted file tree graph

@@           Coverage Diff            @@
##           develop     #553   +/-   ##
========================================
  Coverage    98.79%   98.80%           
========================================
  Files           10       10           
  Lines         1249     1253    +4     
========================================
+ Hits          1234     1238    +4     
  Misses          15       15           
Impacted Files Coverage Δ
pdfplumber/table.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 156bb4f...aa2d594. Read the comment docs.

@@ -59,16 +65,18 @@ def get_group(edge):
else:
return ("v", edge["x0"])

if snap_tolerance > 0:
edges = snap_edges(edges, snap_tolerance)
if snap_x_tolerance > 0 or snap_y_tolerance > 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be an and comparator? And should <=0 values be treated differently?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so? I think you want to run this block as long as any positive snap tolerance is specified. Or perhaps I misunderstand the concern?

Re. <=0 values:

  • 0 should be treated, I think, as indicating "do not snap" and thus no additional code should need to be run.

  • I suppose we should convert negative values to 0. I'll push some code that does that. Unless you see a strong purpose for a negative tolerance here?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On negative values, decided instead just to raise an error if they're negative — which gives us more flexibility to change the behavior in the future (rather than silently modifying the user's initial parameters). See aa2d594 below.

edge_groups = itertools.groupby(_sorted, key=get_group)
edge_gen = (
join_edge_group(items, k[0], join_tolerance) for k, items in edge_groups
_sorted = sorted(edges, key=get_group)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the join_x_tolerance > 0 and join_y_tolerance > 0 check be added here as well?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies, I could/should have been clearer about this. This is what I meant by this part of the PR message:

This PR also changes table.merge_edges(...) behavior when join_tolerance (and x/y variants) <= 0, so that joining is attempted regardless, to handle cases of overlapping lines (which would still fulfill tolerance == 0). This is an uncommon situation but is, I believe, represents the user-expected behavior.

Essentially, join_edge_group(...) is still helpful, even when join_tolerance==0, because it combines ("joins") overlapping line segments that lay along the same infinite line.

@jsvine jsvine merged commit 3275e19 into develop Dec 17, 2021
@jsvine
Copy link
Owner Author

jsvine commented Dec 17, 2021

Merged, but I'm open to reverting/changing anything that seems off before releasing the next version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants