Add or test all licenses in https://github.com/okfn/licenses #863

pombredanne · 2017-12-02T07:10:58Z

We likely have most, but we need 1. tests, 2. add rules 3.eventually add new licenses

SaravananOffl · 2017-12-03T03:46:23Z

@pombredanne In this site(http://licenses.opendefinition.org/licenses/groups/all.json) there are more than 90 licenses . How do you want to go about it ? I know most of them are available in scancode but still running scans manually is a huge task .Right?

starlord1311 · 2017-12-11T21:55:18Z

can i take up this issue? @pombredanne

pombredanne · 2017-12-11T21:57:46Z

@starlord1311 I think that @SaravananOffl is working on it, though you could split up the work alright

pombredanne · 2017-12-11T21:58:47Z

@SaravananOffl

How do you want to go about it ? I know most of them are available in scancode but still running scans manually is a huge task .Right?

IMHO

fetch the licenses such to have a text for each
add these as a test
for the one that do not pass the test, add a new rule, or a new license

SaravananOffl · 2017-12-12T03:42:26Z

@pombredanne I'm thinking of writing a python script to get(i.e to scrap) the name of each license from the json file(http://licenses.opendefinition.org/licenses/groups/all.json).This method could certainly reduce the tasks for us.

yash-nisar · 2017-12-22T12:43:35Z

@SaravananOffl Are you done with the script ?

avirlrma · 2018-02-05T10:13:56Z

@SaravananOffl Any word?
@pombredanne I'll like to take this if assignee is inactive.

pombredanne · 2018-02-09T09:20:14Z

@aviral1701 go for it :)

dakshaladia · 2020-02-29T09:45:29Z

Hi I wish to contribute to this issue. Is it available?

pombredanne · 2020-03-11T00:03:14Z

@dakshaladia Sorry for the late reply... actually @AyanSinhaMahapatra is already working on that one

AyanSinhaMahapatra · 2020-03-16T12:27:10Z

@pombredanne I've already written a script to download all these license texts (and some notices) and performed a scan of them. In the process of analyzing the results, as the scan result json is pretty long, 10k lines. I had a few questions btw,

The script I wrote is of similar fashion to this script added before here, but this just extracted the links to the licenses and outputs a file with all the links, mine extracts all links form json files in this directory https://github.com/okfn/licenses/tree/master/licenses and then scrapes those pages for the licenses texts, then downloads them in named files. Now should this script also go in a PR like the previous one I linked? (Note that after this script there was some manual work though not much, as this script basically only downloads license texts from https://opensource.org/, other 10-20 licenses from different sources had to be downloaded separately)
Here in this comment where you elaborate on how to go about solving this type of issues, you've mentioned that each new license creation should go in its own PR and So my suggestion is to start small, one gentoo file at a time, not all at once: otherwise this would be too much work to review at once for us
Is this valid for this one too? Would you suggest something like in this comment
So each of the files here (scraped license texts) should also be added to the tests/licensedcode/data/licenses. Should all the test files be pushed in one PR?
You say "for each file that is is not detected 100% to create a license rule" this is valid for very old/deprecated licenses that exist in okfn/licenses too? Should I give examples as this is a case-specific question, or adding everything is preferred?

Btw the scan results

"summary": {
    "license_expressions": [
      {
        "value": "unknown",
        "count": 16
      },
      {
        "value": "free-unknown",
        "count": 3
      },
      {
        "value": "other-permissive",
        "count": 3
      },
      {
        "value": null,
        "count": 2
      },

This is a rough summary of the license detection, out of 114 license texts.

AyanSinhaMahapatra · 2020-09-21T15:07:48Z

@pombredanne Could you take a look at these questions above?

pombredanne · 2020-10-15T07:54:11Z

I performed a scan of them. In the process of analyzing the results, as the scan result json is pretty long, 10k lines

You may want to run one scan per file for ease of handling too.

pombredanne · 2020-10-15T08:04:49Z

@AyanSinhaMahapatra you wrote:

The script I wrote is of similar fashion to this script added before here, but this just extracted the links to the licenses and outputs a file with all the links, mine extracts all links form json files in this directory https://github.com/okfn/licenses/tree/master/licenses and then scrapes those pages for the licenses texts, then downloads them in named files. Now should this script also go in a PR like the previous one I linked? (Note that after this script there was some manual work though not much, as this script basically only downloads license texts from https://opensource.org/, other 10-20 licenses from different sources had to be downloaded separately)

This seems like a one off, so you can instead paste the script in the ticket or related PR comment.
The thing though would be to ensure that we are not introducing too much bias with web-scraped texts that are unlikely to be the ones used as real license texts.

Here in this comment where you elaborate on how to go about solving this type of issues, you've mentioned that each new license creation should go in its own PR and So my suggestion is to start small, one gentoo file at a time, not all at once: otherwise this would be too much work to review at once for us
Is this valid for this one too? Would you suggest something like in this comment

In general yes, smaller PRs are easier for new licenses. And for rules, that's OK to have many a batch of new rules at once.

So each of the files here (scraped license texts) should also be added to the tests/licensedcode/data/licenses. Should all the test files be pushed in one PR?

We likely already have the licenses for these, so there are likely very few new licenses to add, but we should consider adding rules if they are not detected correctly. (and withing reason, as the web scraping may be introducing quirks that we may not care for and therefore a new rule may not always be warranted .

You say "for each file that is is not detected 100% to create a license rule" this is valid for very old/deprecated licenses that exist in okfn/licenses too? Should I give examples as this is a case-specific question, or adding everything is preferred?

for very old/deprecated licenses we want to have at least a license entry and possibly a few rules for their typical notices, but that should be rather limited rule-wise. Also the caveat about the bias of screen-scraped texts still applies.

AyanSinhaMahapatra · 2020-10-15T09:39:02Z

You may want to run one scan per file for ease of handling too.

So this was before we had scancode-results-analyzer, now I don't have to go through all the json results, only the important ones for review and it's okay.

This seems like a one-off, so you can instead paste the script in the ticket or related PR comment.

Okay Sure, I'll paste that instead.

The thing though would be to ensure that we are not introducing too much bias with web-scraped texts that are unlikely to be the ones used as real license texts.

So yes, as these are scraped there are extra text, and although I cleaned a bit manually there are still texts that remain.

We likely already have the licenses for these, so there are likely very few new licenses to add, but we should consider adding rules if they are not detected correctly. (and withing reason, as the web scraping may be introducing quirks that we may not care for and therefore a new rule may not always be warranted .

Yes, there aren't a lot of new ones so we should be good to add just those + and get rid of extra texts and see if there are any detection problems then.

for very old/deprecated licenses we want to have at least a license entry and possibly a few rules for their typical notices, but that should be rather limited rule-wise. Also the caveat about the bias of screen-scraped texts still applies.

Understood.

AyanSinhaMahapatra · 2020-10-15T09:49:39Z

So my plan was using these, as in all of the licenses in this issue as tests for scancode-results-analyzer, as in the problems being detected, and the rules generated as much as possible.

You've tagged me in these two issues also, in the same sense - #2275 (comment) and #2274 (comment), these are also valid tests as the texts and rule will be generated automatically from the json scan results (not just that particular file, but the whole scan of that package).

While looking at the scan results of the whole package to see if these cases are successfully detected, we detect more issues that we don't have tickets for, you'd remember this from our conversations, and you wanted me to add them to scancode asap. So I'll open one PR for these where all the rules are automatically generated, and we discuss if the rules and modifications in the scancode-results-analyzer if there are inconsistencies in these.

Does that sound okay?

pombredanne · 2020-10-22T15:27:36Z

So I'll open one PR for these where all the rules are automatically generated, and we discuss if the rules and modifications in the scancode-results-analyzer if there are inconsistencies in these.

Perfect!

SidharajYadav · 2022-01-27T16:51:48Z

how i contributed

pombredanne added easy good first issue help wanted license scan new and improved data labels Dec 2, 2017

AyanSinhaMahapatra mentioned this issue Dec 11, 2019

All Good First Issues List #1825

Closed

21 tasks

pombredanne assigned AyanSinhaMahapatra Mar 11, 2020

AyanSinhaMahapatra mentioned this issue Aug 17, 2020

[RFC] Scans of License Texts, "is_license_text" plugin related #2164

Open

AyanSinhaMahapatra mentioned this issue Sep 21, 2020

Validate that all licenses (and their URLs) are detected from https://directory.fsf.org/wiki?title=Category:License #2210

Open

AyanSinhaMahapatra mentioned this issue Oct 22, 2020

Extensive Tests for scancode-results-analyzer aboutcode-org/scancode-analyzer#14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add or test all licenses in https://github.com/okfn/licenses #863

Add or test all licenses in https://github.com/okfn/licenses #863

pombredanne commented Dec 2, 2017

SaravananOffl commented Dec 3, 2017 •

edited

Loading

starlord1311 commented Dec 11, 2017

pombredanne commented Dec 11, 2017

pombredanne commented Dec 11, 2017

SaravananOffl commented Dec 12, 2017

yash-nisar commented Dec 22, 2017

avirlrma commented Feb 5, 2018 •

edited

Loading

pombredanne commented Feb 9, 2018

dakshaladia commented Feb 29, 2020

pombredanne commented Mar 11, 2020

AyanSinhaMahapatra commented Mar 16, 2020 •

edited

Loading

AyanSinhaMahapatra commented Sep 21, 2020

pombredanne commented Oct 15, 2020

pombredanne commented Oct 15, 2020

AyanSinhaMahapatra commented Oct 15, 2020

AyanSinhaMahapatra commented Oct 15, 2020 •

edited

Loading

pombredanne commented Oct 22, 2020

SidharajYadav commented Jan 27, 2022

Add or test all licenses in https://github.com/okfn/licenses #863

Add or test all licenses in https://github.com/okfn/licenses #863

Comments

pombredanne commented Dec 2, 2017

SaravananOffl commented Dec 3, 2017 • edited Loading

starlord1311 commented Dec 11, 2017

pombredanne commented Dec 11, 2017

pombredanne commented Dec 11, 2017

SaravananOffl commented Dec 12, 2017

yash-nisar commented Dec 22, 2017

avirlrma commented Feb 5, 2018 • edited Loading

pombredanne commented Feb 9, 2018

dakshaladia commented Feb 29, 2020

pombredanne commented Mar 11, 2020

AyanSinhaMahapatra commented Mar 16, 2020 • edited Loading

AyanSinhaMahapatra commented Sep 21, 2020

pombredanne commented Oct 15, 2020

pombredanne commented Oct 15, 2020

AyanSinhaMahapatra commented Oct 15, 2020

AyanSinhaMahapatra commented Oct 15, 2020 • edited Loading

pombredanne commented Oct 22, 2020

SidharajYadav commented Jan 27, 2022

SaravananOffl commented Dec 3, 2017 •

edited

Loading

avirlrma commented Feb 5, 2018 •

edited

Loading

AyanSinhaMahapatra commented Mar 16, 2020 •

edited

Loading

AyanSinhaMahapatra commented Oct 15, 2020 •

edited

Loading