Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit number of urls and emails reported in a single file #384 #894

Closed
wants to merge 1 commit into from
Closed

Conversation

inishchith
Copy link
Contributor

@inishchith inishchith commented Jan 26, 2018

PR for issue : #384
Signed-off-by: Nishchith Shetty [email protected]

Note : This fix can be extended later to a CLI option for the limit . As of now the threshold is set to 50 .

@codecov
Copy link

codecov bot commented Jan 26, 2018

Codecov Report

Merging #894 into develop will increase coverage by 0.15%.
The diff coverage is 75%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #894      +/-   ##
===========================================
+ Coverage    79.07%   79.23%   +0.15%     
===========================================
  Files           93       93              
  Lines        11456    11462       +6     
===========================================
+ Hits          9059     9082      +23     
+ Misses        2397     2380      -17
Impacted Files Coverage Δ
src/scancode/api.py 83.81% <75%> (+7.76%) ⬆️
src/scancode/cli.py 93.72% <0%> (+0.22%) ⬆️
src/commoncode/fileutils.py 84.75% <0%> (+1.21%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1009508...f436ed2. Read the comment docs.

@inishchith
Copy link
Contributor Author

@pombredanne need your review .

@pombredanne
Copy link
Member

Thanks! your commit is missing a sign off.

Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks gain!
Please see my comments here and in #384 (comment)
We also need some unit tests with this at two levels:

  1. the api proper
  2. end to end tests at the CLI lvel

@@ -105,30 +105,42 @@ def get_emails(location):
"""
Yield mappings of emails detected in the file at `location`.
"""
email_threshold = 100
email_count = 0
from cluecode.finder import find_emails
for email, line_num in find_emails(location):
Copy link
Member

@pombredanne pombredanne Jan 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could more simply enumerate and break when you reach the limit

if email_count < 100:
yield misc
if email_count >= 100:
print("\nWarning : %d emails detected at %s but only %d were reported"%(email_count,location,email_threshold))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said in the ticket no warning neeed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for now i'll just remove the warning and break while the hardcoded threshold is reached ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, and use enumerate with the threshold as an arg with a default to 50

@inishchith
Copy link
Contributor Author

@pombredanne for now i've enumerated and used break on a default set 50 . Also once i get more acquainted with the codebase i think i can work on #787 .

Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but this may be best done directly in the plugins IMHO

@@ -105,30 +105,36 @@ def get_emails(location):
"""
Yield mappings of emails detected in the file at `location`.
"""
email_threshold = 50
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use a threshold=50 function arg here and below

if not urls:
continue
misc = OrderedDict()
misc['url'] = urls
misc['start_line'] = line_num
misc['end_line'] = line_num
yield misc
if url_count >= url_threshold:
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using itertools.slice would likely be best, in which case this would take place not here but in the corresponding url and email plugins in the #885 branch?
We could accept leave this API unchanged and in get_scanner https://github.com/nexB/scancode-toolkit/blob/877c03e5fd54e099102253e73d0100c1f7407dd7/src/scancode/plugin_url.py#L53 return a closure with functools.partial that would wrap its results in an itertools.islice based on the received argument of a new option --max-urls` option added in https://github.com/nexB/scancode-toolkit/blob/877c03e5fd54e099102253e73d0100c1f7407dd7/src/scancode/plugin_url.py

What do you think?

Copy link
Contributor Author

@inishchith inishchith Jan 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pombredanne Yes , sounds good . it'd take some time for me to look at how to add another option to the plugin at #787 . I tried skimming the branch , couldn't figure out from where is -u or --url setting url variable ( of kwargs ) to true I mean from where is URLScanner getting called .

@inishchith
Copy link
Contributor Author

@pombredanne working on a different branch for #787 so i'll close this .

@inishchith inishchith closed this Jan 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants