-
-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit number of urls and emails reported in a single file #384 #894
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #894 +/- ##
===========================================
+ Coverage 79.07% 79.23% +0.15%
===========================================
Files 93 93
Lines 11456 11462 +6
===========================================
+ Hits 9059 9082 +23
+ Misses 2397 2380 -17
Continue to review full report at Codecov.
|
@pombredanne need your review . |
Thanks! your commit is missing a sign off. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks gain!
Please see my comments here and in #384 (comment)
We also need some unit tests with this at two levels:
- the api proper
- end to end tests at the CLI lvel
src/scancode/api.py
Outdated
@@ -105,30 +105,42 @@ def get_emails(location): | |||
""" | |||
Yield mappings of emails detected in the file at `location`. | |||
""" | |||
email_threshold = 100 | |||
email_count = 0 | |||
from cluecode.finder import find_emails | |||
for email, line_num in find_emails(location): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could more simply enumerate and break when you reach the limit
src/scancode/api.py
Outdated
if email_count < 100: | ||
yield misc | ||
if email_count >= 100: | ||
print("\nWarning : %d emails detected at %s but only %d were reported"%(email_count,location,email_threshold)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As said in the ticket no warning neeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for now i'll just remove the warning and break while the hardcoded threshold is reached ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, and use enumerate with the threshold as an arg with a default to 50
Signed-off-by: inishchith <[email protected]>
@pombredanne for now i've enumerated and used break on a default set 50 . Also once i get more acquainted with the codebase i think i can work on #787 . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, but this may be best done directly in the plugins IMHO
@@ -105,30 +105,36 @@ def get_emails(location): | |||
""" | |||
Yield mappings of emails detected in the file at `location`. | |||
""" | |||
email_threshold = 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use a threshold=50
function arg here and below
if not urls: | ||
continue | ||
misc = OrderedDict() | ||
misc['url'] = urls | ||
misc['start_line'] = line_num | ||
misc['end_line'] = line_num | ||
yield misc | ||
if url_count >= url_threshold: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using itertools.slice
would likely be best, in which case this would take place not here but in the corresponding url and email plugins in the #885 branch?
We could accept leave this API unchanged and in get_scanner
https://github.com/nexB/scancode-toolkit/blob/877c03e5fd54e099102253e73d0100c1f7407dd7/src/scancode/plugin_url.py#L53 return a closure with functools.partial
that would wrap its results in an itertools.islice based on the received argument of a new option
--max-urls` option added in https://github.com/nexB/scancode-toolkit/blob/877c03e5fd54e099102253e73d0100c1f7407dd7/src/scancode/plugin_url.py
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pombredanne Yes , sounds good . it'd take some time for me to look at how to add another option to the plugin
at #787 . I tried skimming the branch , couldn't figure out from where is -u
or --url
setting url
variable ( of kwargs ) to true
I mean from where is URLScanner getting called .
@pombredanne working on a different branch for #787 so i'll close this . |
PR for issue : #384
Signed-off-by: Nishchith Shetty [email protected]
Note : This fix can be extended later to a CLI option for the limit . As of now the threshold is set to 50 .