-
-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Scans of License Texts, "is_license_text" plugin related #2164
Comments
I might be missing something obvious (there's a GPL 3.0 text, so it seems weird that it won't be detected as a license text file), apologies if that's the case. |
Also another thought about how "is_legal" is determined, they are flagged True if the file name is something like "LICENSE"/"COPYING" etc, but in these cases, as the file names are their license full names, they don't have "is_legal" True. Edit: This was addressed. |
Sorry for the late reply:
It seems they each file may contain less than 90% of license. inaccurate detection is the likely cause that may be fixed with new rules. We would need to check with --license-text-diagnostics what the actual detailed issues are though.
If there is 90%, yes, they should
Legalese main impact in on ranking matches in the "token set match" step and the subsequent "sequence match" step the legalese words are infrequently updated and that would be based on observation of test failures (though we could redefine these as a whole too using a purely data driven approach and collection stats)
That's a good idea and point. Please create a ticket so we can implement that improvement! |
Yes, there are some extra texts indeed because of scraping, I got rid of some of the obvious ones manually, but these seems to be handled with the diagnostics flag on and on a case to case basis. So I'll do that.
Understood, thanks.
Okay, I'll create a ticket. |
While going through this issue, I’ve scrapped and collected the licenses, and run the scancode license scan on them, I Found some of these license files, even though they are entirely "license files", does not have the “is_license_text” (the plugin) value as True.
The plugin works as follows, quoting from the docstring -
These files
Free Art License 1.3.txt
GNU Lesser General Public License 3.0.txt
Lawrence Berkeley National Labs BSD Variant License (BSD-3-Clause-LBNL).txt
Open Government Licence 1.0 (United Kingdom).txt
Open Government Licence 2.0 (United Kingdom).txt
Open Government Licence 3.0 (United Kingdom).txt
Open License 2.0 France.txt
Quebec Free License - Permissive (LiLiQ-P) version 1.1.txt
University of Illinois - NCSA Open Source License.txt
X.Net License.txt
Scan results in this file -
false_is_lic_text.json.txt
So assuming this is a case that is proper, we should have to handle these differently, as these are not detected easily.
Questions:-
The text was updated successfully, but these errors were encountered: