[RFC] Scans of License Texts, "is_license_text" plugin related #2164

AyanSinhaMahapatra · 2020-08-17T09:26:36Z

While going through this issue, I’ve scrapped and collected the licenses, and run the scancode license scan on them, I Found some of these license files, even though they are entirely "license files", does not have the “is_license_text” (the plugin) value as True.

The plugin works as follows, quoting from the docstring -

    Set the "is_license_text" flag to true for at the file level for text files
    that contain mostly (as 90% of their size) license texts or notices.
    Has no effect unless --license, --license-text and --info scan data
    are available.

These files

Free Art License 1.3.txt
GNU Lesser General Public License 3.0.txt
Lawrence Berkeley National Labs BSD Variant License (BSD-3-Clause-LBNL).txt
Open Government Licence 1.0 (United Kingdom).txt
Open Government Licence 2.0 (United Kingdom).txt
Open Government Licence 3.0 (United Kingdom).txt
Open License 2.0 France.txt
Quebec Free License - Permissive (LiLiQ-P) version 1.1.txt
University of Illinois - NCSA Open Source License.txt
X.Net License.txt

Scan results in this file -

false_is_lic_text.json.txt

So assuming this is a case that is proper, we should have to handle these differently, as these are not detected easily.

Questions:-

Maybe this is because there’s some extra text with the license texts?
Still, they should at least be detected as a license file I presume, as more than 90% of their content is license words?
Has these anything to do with Legalese words, also how often and in which cases do you update the legalese words, and how is that process?

The text was updated successfully, but these errors were encountered:

AyanSinhaMahapatra · 2020-08-17T09:28:32Z

I might be missing something obvious (there's a GPL 3.0 text, so it seems weird that it won't be detected as a license text file), apologies if that's the case.
Also RFC for these questions

AyanSinhaMahapatra · 2020-08-17T09:30:30Z

Also another thought about how "is_legal" is determined, they are flagged True if the file name is something like "LICENSE"/"COPYING" etc, but in these cases, as the file names are their license full names, they don't have "is_legal" True.
I was wondering is these cases could be included? Is there any reason not to?

Edit: This was addressed.

pombredanne · 2020-10-15T07:53:09Z

Sorry for the late reply:

Maybe this is because there’s some extra text with the license texts?

It seems they each file may contain less than 90% of license. inaccurate detection is the likely cause that may be fixed with new rules. We would need to check with --license-text-diagnostics what the actual detailed issues are though.

Still, they should at least be detected as a license file I presume, as more than 90% of their content is license words?

If there is 90%, yes, they should

Has these anything to do with Legalese words, also how often and in which cases do you update the legalese words, and how is that process?

Legalese main impact in on ranking matches in the "token set match" step and the subsequent "sequence match" step the legalese words are infrequently updated and that would be based on observation of test failures (though we could redefine these as a whole too using a purely data driven approach and collection stats)

Also another thought about how "is_legal" is determined, they are flagged True if the file name is something like "LICENSE"/"COPYING" etc, but in these cases, as the file names are their license full names, they don't have "is_legal" True.
I was wondering is these cases could be included? Is there any reason not to?

That's a good idea and point. Please create a ticket so we can implement that improvement!

AyanSinhaMahapatra · 2020-10-15T09:32:41Z

It seems they each file may contain less than 90% of license. inaccurate detection is the likely cause that may be fixed with new rules. We would need to check with --license-text-diagnostics what the actual detailed issues are though.

Yes, there are some extra texts indeed because of scraping, I got rid of some of the obvious ones manually, but these seems to be handled with the diagnostics flag on and on a case to case basis. So I'll do that.

Legalese main impact in on ranking matches in the "token set match" step and the subsequent "sequence match" step the legalese words are infrequently updated and that would be based on observation of test failures (though we could redefine these as a whole too using a purely data driven approach and collection stats)

Understood, thanks.

Also another thought about how "is_legal" is determined, they are flagged True if the file name is something like "LICENSE"/"COPYING" etc, but in these cases, as the file names are their license full names, they don't have "is_legal" True.
I was wondering is these cases could be included? Is there any reason not to?

That's a good idea and point. Please create a ticket so we can implement that improvement!

Okay, I'll create a ticket.

AyanSinhaMahapatra assigned pombredanne Aug 17, 2020

AyanSinhaMahapatra added license scan question labels Aug 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Scans of License Texts, "is_license_text" plugin related #2164

[RFC] Scans of License Texts, "is_license_text" plugin related #2164

AyanSinhaMahapatra commented Aug 17, 2020

AyanSinhaMahapatra commented Aug 17, 2020

AyanSinhaMahapatra commented Aug 17, 2020 •

edited

Loading

pombredanne commented Oct 15, 2020

AyanSinhaMahapatra commented Oct 15, 2020

[RFC] Scans of License Texts, "is_license_text" plugin related #2164

[RFC] Scans of License Texts, "is_license_text" plugin related #2164

Comments

AyanSinhaMahapatra commented Aug 17, 2020

AyanSinhaMahapatra commented Aug 17, 2020

AyanSinhaMahapatra commented Aug 17, 2020 • edited Loading

pombredanne commented Oct 15, 2020

AyanSinhaMahapatra commented Oct 15, 2020

AyanSinhaMahapatra commented Aug 17, 2020 •

edited

Loading