-
-
Notifications
You must be signed in to change notification settings - Fork 574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
for license references can we somehow include 2-3 words after and before the detected keywords? #1122
Comments
Hello, I'll be interested in something like this (\w+\W+\w+\W+(?is)(L|l)icen(s|c)e(?is)\W+\w+\W+\w+). Thanks again for your help! |
@muzsielod thanks for this ticket. Getting some words reported before and after the word "license" would therefore help you figure out what the license statement is and where to look for this. Is this correct? A few things for your consideration: in #377 there are some discussions of something similar:
... and also here #377 (comment) Also, a generic "see-license" license key key was added with several common rules with this commit 7018e94#diff-8e8b79632a0dacaa9fc2321ab1deaead ... see also this search https://github.com/nexB/scancode-toolkit/search?q=%22see-license%22&type=Code You wrote also:
We could add a regex-based matcher to the Scancode license detection, but I am not sure this is the best way since generally speaking the detection is word-based and not character based. Instead I would be interested for a start to get the text of the examples you want to detect so we can find the best solution. |
Hi @pombredanne.
I will explain you what I did below: line 24 "License that will be placed inside of all created bundles." - which is not relevant information The problem with this is that in all the cases it will return the matched_text just "license" and I have to check 4-5 times in just this (there are sometime huge files with hundreds of lines) file if there is a relevant information or it's just a random keyword. My idea with the regex was to specify a license keyword in a way that tells the scancode when to list more than just the "license" in case of a detection inside the matched_text, as it does when detects an MIT License with different copyright holder "[Copyright] ([c]) [2018] [Google] [LLC]." In my case should look like this: Because the scancode detects a lot of license references, but in some cases of 2 word references like (License:MIT) or new short licenses/license references which are not added in the tool, it will miss the license hit. I also red the Scan deduction and summarization. I find it a good idea, however there are some cases when you can have multiple references in one single file which (by additional text, component name or project name) refer to different packages and I think its safer to verify that reference manually, because you can miss important stuff. First it will be good to detect the references, from there you can do it manually. I think this way you can have a better overview of my idea, but tell me if there is something else to be explained. Thank you. |
@muzsielod ok, so this is more or less something such as https://github.com/angular/universal/blob/89856090b4cc92612124584513ff8c717a25618c/build-config.js
You could instead create only one license key (may be without a text) and multiple rules for the English and US spellings (and possibly rules for singular and plural). Case is always ignored. Note that in that file the MIT license is detected correctly using the
Are you really looking forward to detect all the mentions of How often do you find issues with licenses that are not correctly detected? (this would be a bug to me all the times) It may be simpler to fix these bugs with rules one at a time? Now adding this "license" word detection could be done alright, but this would have to be an extra Cli option as this will generate a lot of noise that not many folks would care for... In any case this would not be a regex but rather something that catches the |
@muzsielod any feedback? |
Hi @pombredanne, yes I want to detect all the license keywords and even add other ones if somehow this is manageable, there are several keywords which detects short permission notices or licenses like the keyword "permission". I'd like to create a few this type of one keyword-licenses with 2-3 detected words after and before. And after I'll go true the report manually (review) and verify just the files which have a potential license or reference hints. I didn't do a list sorrily with the incorrectly detected licenses, but if it helps you, in the future I'll try to do that and send them to you. Anyhow they were just a few incorrectly detected, the bigger problem was with the above mentioned short references or new licenses. I know that for the basic user this extra on key-licenses will be seen as "noise" but I like to catch them all. Can you help me or explain me how to do that SPDX-license-identifier like detection for the "license" key, or this has to be a code base intervention in the Scancode? I'd like to test it out and experiment with it if this is somehow possible, and come back with the results. Thank your for your feedback. |
@muzsielod sorry for not replying earlier... somehow this ticket slipped through unnoticed.
|
This is now the default with the |
Hello,
I usually scan components for myself with the Scancode to verify the licensing of the files.
And in most cases there are references to some licenses stated as See the license in the root or licensed under the content of the LICENSE file. So I created a license which detects all the "license" keywords in a component. And as you can imagine there are a lots of license hits in even in a small component. Is there a way to create a rule or modify the license entry in that way that the Scancode will detect 2-3 words after and before the "license" keyword and print them in the html report? It will help a lot in filtering the false "license" hits in the files content without the need of manual checking the files.
Thank you for your help.
The text was updated successfully, but these errors were encountered: