Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for avoiding false positives #1838

Closed
furuholm opened this issue Nov 20, 2019 · 6 comments
Closed

Proposal for avoiding false positives #1838

furuholm opened this issue Nov 20, 2019 · 6 comments

Comments

@furuholm
Copy link
Contributor

I have come across several instances of matches that a human easily can determine is wrong, but I can totally get why the matching algorithm considers a hit. I have included a couple of examples from https://github.com/x-stream/xstream

Example 1:

The text

The software in this package is published under the terms of the BSD
style license a copy of which has been included with this distribution in
the LICENSE.txt file.

Matches cpal-1.0_11.RULE (score 90)

* The software in this package is published under the terms of the CPAL v1.0
* license, a copy of which has been included with this distribution in the
* LICENSE.txt file.

The texts are very similar, but since CPAL is not mentioned in the text (wheras BSD is) it is easy for a human to conclude that this is not a CPAL license.

By the way. The text above also matches the correct license bsd-new_292.RULE. But the score is only 16.

a BSD-style license

Example 2:

The text

         <url>http://x-stream.github.io/license.html</url>
      <distribution>repo</distribution>
    </license>
  </licenses>

Matches apache-2.0_or_gpl-2.0-plus_with_classpath-exception-2.0_2.RULE (score 11.43)

  <licenses>
    <license>
      <name>Apache License, Version 2.0</name>
      <url>http://www.apache.org/licenses/LICENSE-2.0</url>
      <distribution>repo</distribution>
    </license>
    <license>
      <name>GNU General Public License (GPL) version 2, or any later version</name>
      <url>http://www.gnu.org/licenses/</url>
      <distribution>repo</distribution>
    </license>
    <license>
      <name>GPLv2 with Classpath exception</name>
      <url>http://www.gnu.org/software/classpath/license.html</url>
      <distribution>repo</distribution>
    </license>
</licenses>

Again, there are textual similarites, but neither Apache, GPL or exception is mentioned in the text.

Have you considered any approaches to minimizing hits like these? One suggestion is to make it possible to specify words that should (or shouldn't) be present to get a match? Alternatively affect the score dramatically? These suggestions might be naive though. In that case, maybe there are other ways to approach this?

@pombredanne
Copy link
Member

@furuholm thank you for this detailed ticket.
Eventually the scancode license detection does a pair-wise diff so this can effectively catch things that are similar texts that differ only by a few words, and that's both the strength and weakness of the approach. (I like of it only as a strength since it can catch things it does not know about).

You wrote:

One suggestion is to make it possible to specify words that should (or shouldn't) be present to get a match?

We could do that, in fact we have a list of important "legalese" words https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/legalese.py

But practically how would you do this? annotate rules with essential words? Or could there be some automated process to say if this rule is matched but these words are not matched, then this is not a match to this rule?

Alternatively affect the score dramatically?

The legalese words are used to prioritize the matching and diff, but are not used in the score computation. We could have an alternate score that would take then into account

These suggestions might be naive though. In that case, maybe there are other ways to approach this?

Here are a few other ways to resolve this:

  1. add more detection rules.
    The license detection is a multi layered process. After some preparation and using a "legalese language model" to tokenize the input, these detections are done in sequence:

a. whole text exact match using a checksum
b. contained text exact exact using an automaton
c. approximate ranked match using bit vectors and token vectors, used as an input to a pair wise diff

Because of c., the more license text and examples the more are detected faster exactly. Because the automaton is trie-based, adding more rules does not grow the size of the index as much.

So within reason the more rules there are the faster and the more accurate detection will be.

  1. for the specific case of the Maven POM (and for other structured data license bits) the idea would be to detect based on the structure and not in the textual tags. There is a ticket for that in Add general support to detect "structured" and tagged license notices  #707
    That could mean eventually collecting all the POMs and building an mapping to actual licenses (and repeat this for all packages managers).
    We need to do this in any case.

  2. for rules that are short (e.g. less than ~ 20 words) and still really relevant ( such as
    bsd-new_292.RULE) we should tag their .yml with with a high relevance
    See for instance this pair:

  1. for the first case where a BSD is mistaken for a CPAL, there is also another route which is to correct an incomplete notice match based on the license detected in a referenced filename (here a LICENSE.txt) See Deal with "This file is distributed under the same license as the..." #1379

My first reaction here would be to add new rules (e.g. my suggestion 1.) but ideally we should follow your approach for scoring and some of the other suggestions I maed

@pombredanne
Copy link
Member

@furuholm I am tempted to qualify this is as a bug rather than an enhancement request. What do you think?

@furuholm
Copy link
Contributor Author

Thanks for the quick response @pombredanne!

There is a lot to unpack in your answer, and I am not familiar enough with the ScanCode codebase to understand all aspects, but from what I understand from your answer your suggestions are focused on how to get more accurate results. I don't see how these would target false positives. The CPAL example is the best I since there is so much matching text. If any other word than CPAL was replaced, this match would be accurate. So if ScanCode does not know that CPAL is important, how would it be able to exclude this match?

But practically how would you do this? annotate rules with essential words? Or could there be some automated process to say if this rule is matched but these words are not matched, then this is not a match to this rule?

My initial thought was to annotate rules with key words that should be included (probably in the yaml)

for rules that are short (e.g. less than ~ 20 words) and still really relevant (such as
bsd-new_292.RULE) we should tag their .yml with with a high relevance

Sounds like a good idea!

for the first case where a BSD is mistaken for a CPAL, there is also another route which is to correct an incomplete notice match based on the license detected in a referenced filename (here a LICENSE.txt) See #1379

I have to admit that I don't follow you here. Could you elaborate?

I am tempted to qualify this is as a bug rather than an enhancement request. What do you think?

Sure!

@pombredanne
Copy link
Member

for the first case where a BSD is mistaken for a CPAL, there is also another route which is to correct an incomplete notice match based on the license detected in a referenced filename (here a LICENSE.txt) See #1379

I have to admit that I don't follow you here. Could you elaborate?

Quite simply if we have a detected rule such as:

* The software in this package is published under the terms of the CPAL v1.0
* license, a copy of which has been included with this distribution in the
* LICENSE.txt file.

... the rule "knows" that there is a LICENSE.txt filename that is referenced in its data file: https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/rules/cpal-1.0_11.yml#L4

The plan with #1379 is to find that LICENSE.txt file in the scan and use its detected license to update the match/do something. In this case, https://raw.githubusercontent.com/x-stream/xstream/master/LICENSE.txt is a bsd-new that would be clearly detected and could be used to refine the imperfect match to the CPAL rule.

@furuholm
Copy link
Contributor Author

furuholm commented Dec 2, 2019

Seems that there are issues that addresses the the problems I listed in my original report, except for maybe

for rules that are short (e.g. less than ~ 20 words) and still really relevant (such as
bsd-new_292.RULE) we should tag their .yml with with a high relevance

If this is the case then maybe we should close this issue? If we find more examples of false positives with similar properties we can always reopen it.

@pombredanne
Copy link
Member

I think that the contribution of @petergardfjall in #2637 addresses all the point raised here. I am closing now as we have made major improvements also with #2878

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants