Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Scans of License Texts, "is_license_text" plugin related #2164

Open
AyanSinhaMahapatra opened this issue Aug 17, 2020 · 4 comments
Open

Comments

@AyanSinhaMahapatra
Copy link
Member

While going through this issue, I’ve scrapped and collected the licenses, and run the scancode license scan on them, I Found some of these license files, even though they are entirely "license files", does not have the “is_license_text” (the plugin) value as True.

The plugin works as follows, quoting from the docstring -

    Set the "is_license_text" flag to true for at the file level for text files
    that contain mostly (as 90% of their size) license texts or notices.
    Has no effect unless --license, --license-text and --info scan data
    are available.

These files

Free Art License 1.3.txt
GNU Lesser General Public License 3.0.txt
Lawrence Berkeley National Labs BSD Variant License (BSD-3-Clause-LBNL).txt
Open Government Licence 1.0 (United Kingdom).txt
Open Government Licence 2.0 (United Kingdom).txt
Open Government Licence 3.0 (United Kingdom).txt
Open License 2.0 France.txt
Quebec Free License - Permissive (LiLiQ-P) version 1.1.txt
University of Illinois - NCSA Open Source License.txt
X.Net License.txt

Scan results in this file -

false_is_lic_text.json.txt

So assuming this is a case that is proper, we should have to handle these differently, as these are not detected easily.

Questions:-

  1. Maybe this is because there’s some extra text with the license texts?
  2. Still, they should at least be detected as a license file I presume, as more than 90% of their content is license words?
  3. Has these anything to do with Legalese words, also how often and in which cases do you update the legalese words, and how is that process?
@AyanSinhaMahapatra
Copy link
Member Author

I might be missing something obvious (there's a GPL 3.0 text, so it seems weird that it won't be detected as a license text file), apologies if that's the case.
Also RFC for these questions

@AyanSinhaMahapatra
Copy link
Member Author

AyanSinhaMahapatra commented Aug 17, 2020

Also another thought about how "is_legal" is determined, they are flagged True if the file name is something like "LICENSE"/"COPYING" etc, but in these cases, as the file names are their license full names, they don't have "is_legal" True.
I was wondering is these cases could be included? Is there any reason not to?

Edit: This was addressed.

@pombredanne
Copy link
Member

Sorry for the late reply:

  • Maybe this is because there’s some extra text with the license texts?

It seems they each file may contain less than 90% of license. inaccurate detection is the likely cause that may be fixed with new rules. We would need to check with --license-text-diagnostics what the actual detailed issues are though.

  • Still, they should at least be detected as a license file I presume, as more than 90% of their content is license words?

If there is 90%, yes, they should

  • Has these anything to do with Legalese words, also how often and in which cases do you update the legalese words, and how is that process?

Legalese main impact in on ranking matches in the "token set match" step and the subsequent "sequence match" step the legalese words are infrequently updated and that would be based on observation of test failures (though we could redefine these as a whole too using a purely data driven approach and collection stats)

Also another thought about how "is_legal" is determined, they are flagged True if the file name is something like "LICENSE"/"COPYING" etc, but in these cases, as the file names are their license full names, they don't have "is_legal" True.
I was wondering is these cases could be included? Is there any reason not to?

That's a good idea and point. Please create a ticket so we can implement that improvement!

@AyanSinhaMahapatra
Copy link
Member Author

It seems they each file may contain less than 90% of license. inaccurate detection is the likely cause that may be fixed with new rules. We would need to check with --license-text-diagnostics what the actual detailed issues are though.

Yes, there are some extra texts indeed because of scraping, I got rid of some of the obvious ones manually, but these seems to be handled with the diagnostics flag on and on a case to case basis. So I'll do that.

Legalese main impact in on ranking matches in the "token set match" step and the subsequent "sequence match" step the legalese words are infrequently updated and that would be based on observation of test failures (though we could redefine these as a whole too using a purely data driven approach and collection stats)

Understood, thanks.

Also another thought about how "is_legal" is determined, they are flagged True if the file name is something like "LICENSE"/"COPYING" etc, but in these cases, as the file names are their license full names, they don't have "is_legal" True.
I was wondering is these cases could be included? Is there any reason not to?

That's a good idea and point. Please create a ticket so we can implement that improvement!

Okay, I'll create a ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants