-
-
Notifications
You must be signed in to change notification settings - Fork 573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for avoiding false positives #1838
Comments
@furuholm thank you for this detailed ticket. You wrote:
We could do that, in fact we have a list of important "legalese" words https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/legalese.py But practically how would you do this? annotate rules with essential words? Or could there be some automated process to say if this rule is matched but these words are not matched, then this is not a match to this rule?
The legalese words are used to prioritize the matching and diff, but are not used in the score computation. We could have an alternate score that would take then into account
Here are a few other ways to resolve this:
a. whole text exact match using a checksum Because of c., the more license text and examples the more are detected faster exactly. Because the automaton is trie-based, adding more rules does not grow the size of the index as much. So within reason the more rules there are the faster and the more accurate detection will be.
My first reaction here would be to add new rules (e.g. my suggestion 1.) but ideally we should follow your approach for scoring and some of the other suggestions I maed |
@furuholm I am tempted to qualify this is as a bug rather than an enhancement request. What do you think? |
Thanks for the quick response @pombredanne! There is a lot to unpack in your answer, and I am not familiar enough with the ScanCode codebase to understand all aspects, but from what I understand from your answer your suggestions are focused on how to get more accurate results. I don't see how these would target false positives. The CPAL example is the best I since there is so much matching text. If any other word than CPAL was replaced, this match would be accurate. So if ScanCode does not know that CPAL is important, how would it be able to exclude this match?
My initial thought was to annotate rules with key words that should be included (probably in the yaml)
Sounds like a good idea!
I have to admit that I don't follow you here. Could you elaborate?
Sure! |
Quite simply if we have a detected rule such as:
... the rule "knows" that there is a The plan with #1379 is to find that |
Seems that there are issues that addresses the the problems I listed in my original report, except for maybe
If this is the case then maybe we should close this issue? If we find more examples of false positives with similar properties we can always reopen it. |
I think that the contribution of @petergardfjall in #2637 addresses all the point raised here. I am closing now as we have made major improvements also with #2878 |
I have come across several instances of matches that a human easily can determine is wrong, but I can totally get why the matching algorithm considers a hit. I have included a couple of examples from https://github.com/x-stream/xstream
Example 1:
The text
Matches
cpal-1.0_11.RULE
(score 90)The texts are very similar, but since
CPAL
is not mentioned in the text (wherasBSD
is) it is easy for a human to conclude that this is not aCPAL
license.By the way. The text above also matches the correct
license bsd-new_292.RULE
. But the score is only 16.Example 2:
The text
Matches
apache-2.0_or_gpl-2.0-plus_with_classpath-exception-2.0_2.RULE
(score 11.43)Again, there are textual similarites, but neither Apache, GPL or exception is mentioned in the text.
Have you considered any approaches to minimizing hits like these? One suggestion is to make it possible to specify words that should (or shouldn't) be present to get a match? Alternatively affect the score dramatically? These suggestions might be naive though. In that case, maybe there are other ways to approach this?
The text was updated successfully, but these errors were encountered: