Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add or test all licenses in https://github.com/okfn/licenses #863

Open
Tracked by #1825
pombredanne opened this issue Dec 2, 2017 · 18 comments
Open
Tracked by #1825

Add or test all licenses in https://github.com/okfn/licenses #863

pombredanne opened this issue Dec 2, 2017 · 18 comments

Comments

@pombredanne
Copy link
Member

We likely have most, but we need 1. tests, 2. add rules 3.eventually add new licenses

@SaravananOffl
Copy link
Contributor

SaravananOffl commented Dec 3, 2017

@pombredanne In this site(http://licenses.opendefinition.org/licenses/groups/all.json) there are more than 90 licenses . How do you want to go about it ? I know most of them are available in scancode but still running scans manually is a huge task .Right?

@starlord1311
Copy link

can i take up this issue? @pombredanne

@pombredanne
Copy link
Member Author

@starlord1311 I think that @SaravananOffl is working on it, though you could split up the work alright

@pombredanne
Copy link
Member Author

@SaravananOffl

How do you want to go about it ? I know most of them are available in scancode but still running scans manually is a huge task .Right?

IMHO

  1. fetch the licenses such to have a text for each
  2. add these as a test
  3. for the one that do not pass the test, add a new rule, or a new license

@SaravananOffl
Copy link
Contributor

@pombredanne I'm thinking of writing a python script to get(i.e to scrap) the name of each license from the json file(http://licenses.opendefinition.org/licenses/groups/all.json).This method could certainly reduce the tasks for us.

@yash-nisar
Copy link
Contributor

@SaravananOffl Are you done with the script ?

@avirlrma
Copy link
Contributor

avirlrma commented Feb 5, 2018

@SaravananOffl Any word?
@pombredanne I'll like to take this if assignee is inactive.

@pombredanne
Copy link
Member Author

@aviral1701 go for it :)

@dakshaladia
Copy link

Hi I wish to contribute to this issue. Is it available?

@pombredanne
Copy link
Member Author

@dakshaladia Sorry for the late reply... actually @AyanSinhaMahapatra is already working on that one

@AyanSinhaMahapatra
Copy link
Member

AyanSinhaMahapatra commented Mar 16, 2020

@pombredanne I've already written a script to download all these license texts (and some notices) and performed a scan of them. In the process of analyzing the results, as the scan result json is pretty long, 10k lines. I had a few questions btw,

  1. The script I wrote is of similar fashion to this script added before here, but this just extracted the links to the licenses and outputs a file with all the links, mine extracts all links form json files in this directory https://github.com/okfn/licenses/tree/master/licenses and then scrapes those pages for the licenses texts, then downloads them in named files. Now should this script also go in a PR like the previous one I linked? (Note that after this script there was some manual work though not much, as this script basically only downloads license texts from https://opensource.org/, other 10-20 licenses from different sources had to be downloaded separately)

  2. Here in this comment where you elaborate on how to go about solving this type of issues, you've mentioned that each new license creation should go in its own PR and So my suggestion is to start small, one gentoo file at a time, not all at once: otherwise this would be too much work to review at once for us
    Is this valid for this one too? Would you suggest something like in this comment

  3. So each of the files here (scraped license texts) should also be added to the tests/licensedcode/data/licenses. Should all the test files be pushed in one PR?

  4. You say "for each file that is is not detected 100% to create a license rule" this is valid for very old/deprecated licenses that exist in okfn/licenses too? Should I give examples as this is a case-specific question, or adding everything is preferred?

Btw the scan results

"summary": {
    "license_expressions": [
      {
        "value": "unknown",
        "count": 16
      },
      {
        "value": "free-unknown",
        "count": 3
      },
      {
        "value": "other-permissive",
        "count": 3
      },
      {
        "value": null,
        "count": 2
      },

This is a rough summary of the license detection, out of 114 license texts.

@AyanSinhaMahapatra
Copy link
Member

@pombredanne Could you take a look at these questions above?

@pombredanne
Copy link
Member Author

I performed a scan of them. In the process of analyzing the results, as the scan result json is pretty long, 10k lines

You may want to run one scan per file for ease of handling too.

@pombredanne
Copy link
Member Author

@AyanSinhaMahapatra you wrote:

  1. The script I wrote is of similar fashion to this script added before here, but this just extracted the links to the licenses and outputs a file with all the links, mine extracts all links form json files in this directory https://github.com/okfn/licenses/tree/master/licenses and then scrapes those pages for the licenses texts, then downloads them in named files. Now should this script also go in a PR like the previous one I linked? (Note that after this script there was some manual work though not much, as this script basically only downloads license texts from https://opensource.org/, other 10-20 licenses from different sources had to be downloaded separately)

This seems like a one off, so you can instead paste the script in the ticket or related PR comment.
The thing though would be to ensure that we are not introducing too much bias with web-scraped texts that are unlikely to be the ones used as real license texts.

  1. Here in this comment where you elaborate on how to go about solving this type of issues, you've mentioned that each new license creation should go in its own PR and So my suggestion is to start small, one gentoo file at a time, not all at once: otherwise this would be too much work to review at once for us
    Is this valid for this one too? Would you suggest something like in this comment

In general yes, smaller PRs are easier for new licenses. And for rules, that's OK to have many a batch of new rules at once.

  1. So each of the files here (scraped license texts) should also be added to the tests/licensedcode/data/licenses. Should all the test files be pushed in one PR?

We likely already have the licenses for these, so there are likely very few new licenses to add, but we should consider adding rules if they are not detected correctly. (and withing reason, as the web scraping may be introducing quirks that we may not care for and therefore a new rule may not always be warranted .

  1. You say "for each file that is is not detected 100% to create a license rule" this is valid for very old/deprecated licenses that exist in okfn/licenses too? Should I give examples as this is a case-specific question, or adding everything is preferred?

for very old/deprecated licenses we want to have at least a license entry and possibly a few rules for their typical notices, but that should be rather limited rule-wise. Also the caveat about the bias of screen-scraped texts still applies.

@AyanSinhaMahapatra
Copy link
Member

You may want to run one scan per file for ease of handling too.

So this was before we had scancode-results-analyzer, now I don't have to go through all the json results, only the important ones for review and it's okay.

This seems like a one-off, so you can instead paste the script in the ticket or related PR comment.

Okay Sure, I'll paste that instead.

The thing though would be to ensure that we are not introducing too much bias with web-scraped texts that are unlikely to be the ones used as real license texts.

So yes, as these are scraped there are extra text, and although I cleaned a bit manually there are still texts that remain.

We likely already have the licenses for these, so there are likely very few new licenses to add, but we should consider adding rules if they are not detected correctly. (and withing reason, as the web scraping may be introducing quirks that we may not care for and therefore a new rule may not always be warranted .

Yes, there aren't a lot of new ones so we should be good to add just those + and get rid of extra texts and see if there are any detection problems then.

for very old/deprecated licenses we want to have at least a license entry and possibly a few rules for their typical notices, but that should be rather limited rule-wise. Also the caveat about the bias of screen-scraped texts still applies.

Understood.

@AyanSinhaMahapatra
Copy link
Member

AyanSinhaMahapatra commented Oct 15, 2020

So my plan was using these, as in all of the licenses in this issue as tests for scancode-results-analyzer, as in the problems being detected, and the rules generated as much as possible.

You've tagged me in these two issues also, in the same sense - #2275 (comment) and #2274 (comment), these are also valid tests as the texts and rule will be generated automatically from the json scan results (not just that particular file, but the whole scan of that package).

While looking at the scan results of the whole package to see if these cases are successfully detected, we detect more issues that we don't have tickets for, you'd remember this from our conversations, and you wanted me to add them to scancode asap. So I'll open one PR for these where all the rules are automatically generated, and we discuss if the rules and modifications in the scancode-results-analyzer if there are inconsistencies in these.

Does that sound okay?

@pombredanne
Copy link
Member Author

So I'll open one PR for these where all the rules are automatically generated, and we discuss if the rules and modifications in the scancode-results-analyzer if there are inconsistencies in these.

Perfect!

@SidharajYadav
Copy link

how i contributed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants