Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many duplicates in SPDX files #2905

Closed
vargenau opened this issue Mar 31, 2022 · 12 comments
Closed

Many duplicates in SPDX files #2905

vargenau opened this issue Mar 31, 2022 · 12 comments
Labels

Comments

@vargenau
Copy link
Contributor

Description

In the SPDX code, we have multiple times the same code, for example:

LicenseID: LicenseRef-scancode-unknown-license-reference
LicenseName: Unknown License file reference
LicenseComment: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-license-reference.yml
</text>
ExtractedText: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-license-reference.yml
</text>

or

# File

FileName: ./tern.original/LICENSE.txt
SPDXID: SPDXRef-7
FileChecksum: SHA1: 5ec0910f78578a5df32b56cae953249d45d0dd5b
LicenseConcluded: NOASSERTION
LicenseInfoInFile: BSD-2-Clause
LicenseInfoInFile: BSD-2-Clause
LicenseInfoInFile: BSD-2-Clause
LicenseInfoInFile: OFL-1.1
FileCopyrightText: <text>Copyright (c) 2017 VMware, Inc.
</text>

I do not know if it is really a bug, but is is at least confusing.

reuse.spdx.txt
tern.spdx.txt

How To Reproduce

scancode -c -l -i --spdx-tv tern.spdx /home/vargenau/git/tern.original/
scancode -c -l -i --spdx-tv reuse.spdx /home/vargenau/git/reuse/

where the code comes from GitHub:

https://github.com/tern-tools/tern
https://github.com/fsfe/reuse-tool

System configuration

  • What OS are you running on? Ubuntu 21.10
  • What version of scancode-toolkit was used to generate the scan file? ScanCode version 30.1.0
  • What installation method was used to install/run scancode? pip
@vargenau vargenau added the bug label Mar 31, 2022
@pombredanne
Copy link
Member

@vargenau Thank you for the report!

You wrote:

In the SPDX code, we have multiple times the same code, for example: ...

I assume here (may be incorrectly) that when you say " we have multiple times the same code" you mean that the "Se details ... " comment is showing up twice?

It is always best to run with --license-text with an SPDX output to ensure that the extracted text is populated.
Otherwise, you have a boilerplate ExtractedText: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-license-reference.yml </text>

@pombredanne
Copy link
Member

I ran a single scan on https://raw.githubusercontent.com/tern-tools/tern/cdc6732eda7de1e5e1f9e1298a6db2e073ec48fc/LICENSE.txt

Of note:

  1. the text is damaged with mojibake. This is eventually making matching a bit less accurate
$ chardet3 LICENSE.txt 
LICENSE.txt: windows-1252 with confidence 0.73
$ file LICENSE.txt 
LICENSE.txt: Non-ISO extended-ASCII text, with very long lines, with CRLF line terminators
  1. the license detection could be better and return only two matches rather than three
  2. we are working on refinements to eventually merge multiple matches in a single detection

NB: If you are interested in container scans, check out also the companion server project http://scancode.io/

headers:
    -   tool_name: scancode-toolkit
        tool_version: 31.0.0b1
        options:
            input:
                - LICENSE.txt
            --license: yes
            --license-text: yes
            --license-text-diagnostics: yes
            --yaml: '-'
        notice: |
            Generated with ScanCode and provided on an "AS IS" BASIS, WITHOUT WARRANTIES
            OR CONDITIONS OF ANY KIND, either express or implied. No content created from
            ScanCode should be considered or used as legal advice. Consult an Attorney
            for any legal advice.
            ScanCode is a free software code scanning tool from nexB Inc. and others.
            Visit https://github.com/nexB/scancode-toolkit/ for support and download.
        start_timestamp: '2022-03-31T185132.281914'
        end_timestamp: '2022-03-31T185135.025805'
        output_format_version: 2.0.0
        duration: '2.743912696838379'
        message:
        errors: []
        extra_data:
            spdx_license_list_version: '3.16'
            files_count: 1
files:
    -   path: LICENSE.txt
        type: file
        licenses:
            -   key: bsd-simplified
                score: '100.0'
                name: BSD-2-Clause
                short_name: BSD-2-Clause
                category: Permissive
                is_exception: no
                is_unknown: no
                owner: Regents of the University of California
                homepage_url: http://www.opensource.org/licenses/BSD-2-Clause
                text_url: http://opensource.org/licenses/bsd-license.php
                reference_url: https://scancode-licensedb.aboutcode.org/bsd-simplified
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.yml
                spdx_license_key: BSD-2-Clause
                spdx_url: https://spdx.org/licenses/BSD-2-Clause
                start_line: 5
                end_line: 5
                matched_rule:
                    identifier: bsd-simplified_226.RULE
                    license_expression: bsd-simplified
                    licenses:
                        - bsd-simplified
                    referenced_filenames: []
                    is_license_text: no
                    is_license_notice: yes
                    is_license_reference: no
                    is_license_tag: no
                    is_license_intro: no
                    has_unknown: no
                    matcher: 2-aho
                    rule_length: 13
                    matched_length: 13
                    match_coverage: '100.0'
                    rule_relevance: 100
                matched_text: "The BSD-2 license (the \x93License\x94) set forth below applies\
                    \ to all parts"
            -   key: bsd-simplified
                score: '50.0'
                name: BSD-2-Clause
                short_name: BSD-2-Clause
                category: Permissive
                is_exception: no
                is_unknown: no
                owner: Regents of the University of California
                homepage_url: http://www.opensource.org/licenses/BSD-2-Clause
                text_url: http://opensource.org/licenses/bsd-license.php
                reference_url: https://scancode-licensedb.aboutcode.org/bsd-simplified
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.yml
                spdx_license_key: BSD-2-Clause
                spdx_url: https://spdx.org/licenses/BSD-2-Clause
                start_line: 5
                end_line: 5
                matched_rule:
                    identifier: bsd-simplified_275.RULE
                    license_expression: bsd-simplified
                    licenses:
                        - bsd-simplified
                    referenced_filenames: []
                    is_license_text: no
                    is_license_notice: yes
                    is_license_reference: no
                    is_license_tag: no
                    is_license_intro: no
                    has_unknown: no
                    matcher: 3-seq
                    rule_length: 28
                    matched_length: 14
                    match_coverage: '50.0'
                    rule_relevance: 100
                matched_text: "License\x94) [set] [forth] [below] [applies] [to] [all] [parts]\
                    \ [of] [the] [Tern] project.  You may not use this file except in compliance\
                    \ with the License."
            -   key: bsd-simplified
                score: '100.0'
                name: BSD-2-Clause
                short_name: BSD-2-Clause
                category: Permissive
                is_exception: no
                is_unknown: no
                owner: Regents of the University of California
                homepage_url: http://www.opensource.org/licenses/BSD-2-Clause
                text_url: http://opensource.org/licenses/bsd-license.php
                reference_url: https://scancode-licensedb.aboutcode.org/bsd-simplified
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.yml
                spdx_license_key: BSD-2-Clause
                spdx_url: https://spdx.org/licenses/BSD-2-Clause
                start_line: 5
                end_line: 7
                matched_rule:
                    identifier: bsd-simplified_53.RULE
                    license_expression: bsd-simplified
                    licenses:
                        - bsd-simplified
                    referenced_filenames: []
                    is_license_text: no
                    is_license_notice: no
                    is_license_reference: yes
                    is_license_tag: no
                    is_license_intro: no
                    has_unknown: no
                    matcher: 2-aho
                    rule_length: 3
                    matched_length: 3
                    match_coverage: '100.0'
                    rule_relevance: 100
                matched_text: "License. \r\n\r\nBSD-2"
            -   key: bsd-simplified
                score: '100.0'
                name: BSD-2-Clause
                short_name: BSD-2-Clause
                category: Permissive
                is_exception: no
                is_unknown: no
                owner: Regents of the University of California
                homepage_url: http://www.opensource.org/licenses/BSD-2-Clause
                text_url: http://opensource.org/licenses/bsd-license.php
                reference_url: https://scancode-licensedb.aboutcode.org/bsd-simplified
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.yml
                spdx_license_key: BSD-2-Clause
                spdx_url: https://spdx.org/licenses/BSD-2-Clause
                start_line: 7
                end_line: 12
                matched_rule:
                    identifier: bsd-simplified_169.RULE
                    license_expression: bsd-simplified
                    licenses:
                        - bsd-simplified
                    referenced_filenames: []
                    is_license_text: yes
                    is_license_notice: no
                    is_license_reference: no
                    is_license_tag: no
                    is_license_intro: no
                    has_unknown: no
                    matcher: 2-aho
                    rule_length: 184
                    matched_length: 184
                    match_coverage: '100.0'
                    rule_relevance: 100
                matched_text: "License \r\n\r\nRedistribution and use in source and binary forms,\
                    \ with or without modification, are permitted provided that the following\
                    \ conditions are met:\r\n\x95\tRedistributions of source code must retain\
                    \ the above copyright notice, this list of conditions and the following\
                    \ disclaimer.\r\n\x95\tRedistributions in binary form must reproduce the\
                    \ above copyright notice, this list of conditions and the following disclaimer\
                    \ in the documentation and/or other materials provided with the distribution.\r\
                    \nTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"\
                    AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED\
                    \ TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR\
                    \ PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS\
                    \ BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR\
                    \ CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE\
                    \ GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)\
                    \ HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,\n WHETHER IN CONTRACT,\
                    \ STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING\
                    \ IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY\
                    \ OF SUCH DAMAGE."
        license_expressions:
            - bsd-simplified
            - bsd-simplified
            - bsd-simplified
            - bsd-simplified
        percentage_of_license_text: '94.64'
        scan_errors: []

@pombredanne
Copy link
Member

wrt. to reuse: https://github.com/fsfe/reuse-tool/tree/master/src/reuse/resources contains long lists of SPDX licenses that are real license mentions but false positives since this is a tool that is license-related.
The latest develop branch has several fixes in this area and many more planned in #2878 but this is still showing up in this case.

In general, note that ScanCode is not optimized to scan tools that are themselves license detection tools, so you can expect a lot of matches in these cases.

@vargenau
Copy link
Contributor Author

vargenau commented Apr 1, 2022

@vargenau Thank you for the report!

You wrote:

In the SPDX code, we have multiple times the same code, for example: ...

I assume here (may be incorrectly) that when you say " we have multiple times the same code" you mean that the "Se details ... " comment is showing up twice?

It is always best to run with --license-text with an SPDX output to ensure that the extracted text is populated. Otherwise, you have a boilerplate ExtractedText: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-license-reference.yml </text>

Hi Philippe,

Sorry if I was not clear.

In the tern.spdx SPDX file, you have the following:

LicenseInfoInFile: BSD-2-Clause
LicenseInfoInFile: BSD-2-Clause
LicenseInfoInFile: BSD-2-Clause

(see bigger extract in the initial report)

Why do we have the same information 3 times for the same file?

@vargenau
Copy link
Contributor Author

vargenau commented Apr 1, 2022

@vargenau Thank you for the report!

You wrote:

In the SPDX code, we have multiple times the same code, for example: ...

I assume here (may be incorrectly) that when you say " we have multiple times the same code" you mean that the "Se details ... " comment is showing up twice?

It is always best to run with --license-text with an SPDX output to ensure that the extracted text is populated. Otherwise, you have a boilerplate ExtractedText: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-license-reference.yml </text>

Hi Philippe,

As you recommended, I have used the --license-text option.

I now get:

LicenseID: LicenseRef-scancode-unknown-license-reference
LicenseName: Unknown License file reference
LicenseComment: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-license-reference.yml
</text>
ExtractedText: <text>licensed under the</text>

LicenseID: LicenseRef-scancode-unknown-license-reference
LicenseName: Unknown License file reference
LicenseComment: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-license-reference.yml
</text>
ExtractedText: <text>see [LICENSE.txt](</text>

So we have the same LicenseID with a different ExtractedText. This seems illegal for me.

The SPDX spec says:
Provide a locally unique identifier to refer to licenses that are not found on the SPDX License List.

What do you think?

@pombredanne
Copy link
Member

So we have the same LicenseID with a different ExtractedText. This seems illegal for me.

The SPDX spec says: Provide a locally unique identifier to refer to licenses that are not found on the SPDX License List.

This "scancode" as used in the LicenseRef is called a license namespace and this is registered here https://tools.spdx.org/app/archive_namespace_requests/2/
So this is global and not local.

@mjherzog
Copy link
Member

mjherzog commented Apr 1, 2022

unknown-license-reference is a special case where ScanCode detects elements of what may be a license. The LicenseID values for unknown-license detections are generated for consistency in the output data - not for use in an SPDX document. There is major rework pending on the handling of unknown-license-reference - see also #2878

@pombredanne
Copy link
Member

@vargenau I am revisiting this as we start some major work on false positive:

  • You are scanning tern and reuse and they have quite a few licenses in them making it not the best example as they are license tools. That said we should still scan them correctly
  • There is work on a the new thing called "License Detection" to eventually group multiple license matches in a detection that should generally help cope with some of the issues you brought up here

@rnjudge see https://raw.githubusercontent.com/tern-tools/tern/cdc6732eda7de1e5e1f9e1298a6db2e073ec48fc/LICENSE.txt which is your damaged and not-really-standard license text and notice. The main issue is mojibake

@rnjudge
Copy link

rnjudge commented May 12, 2022

@pombredanne that file comes directly from GitHub when you choose a license for the project. Do you have a suggestion for a more parse-able/standard license text we can use to communicate BSD-2?

Thanks for bringing this to my attention, I wasn't aware. Happy to update!

@pombredanne
Copy link
Member

@rnjudge you wrote:

that file comes directly from GitHub when you choose a license for the project. Do you have a suggestion for a more parse-able/standard license text we can use to communicate BSD-2?

It may have bee this way, but this seems to be no longer the case:
https://raw.githubusercontent.com/pombredanne/test-bsd2/main/LICENSE

Any BSD text that scancode detects works! (I will be adding yours as a new rule FWIW) .... using https://scancode-licensedb.aboutcode.org/bsd-simplified.html will surely work perfectly . This https://opensource.org/licenses/bsd-license.php and this too https://spdx.org/licenses/BSD-2-Clause will be fine.

rnjudge added a commit to rnjudge/tern that referenced this issue May 13, 2022
It was brought to our attention[1] that the Tern license file was not using
a standard BSD-2 license text and notice which was making it difficult
for compliance tooling to parse. This commit updates the license file to
use the standard text for the BSD 2-Clause license[2]

[1] aboutcode-org/scancode-toolkit#2905 (comment)
[2] https://spdx.org/licenses/BSD-2-Clause

Signed-off-by: Rose Judge <[email protected]>
@rnjudge
Copy link

rnjudge commented May 13, 2022

Thanks @pombredanne. The license file in Tern was created 5 years ago so it's good you're bringing this up. I opened a PR to fix this in Tern. Could you have a look?

rnjudge added a commit to rnjudge/tern that referenced this issue May 13, 2022
It was brought to our attention[1] that the Tern license file was not
using a standard BSD-2 license text and notice which was making it
difficult for compliance tooling to parse. This commit updates the
license file to use the standard text for the BSD 2-Clause license[2].

[1] aboutcode-org/scancode-toolkit#2905 (comment)
[2] https://spdx.org/licenses/BSD-2-Clause

Signed-off-by: Rose Judge <[email protected]>
rnjudge added a commit to tern-tools/tern that referenced this issue May 19, 2022
It was brought to our attention[1] that the Tern license file was not
using a standard BSD-2 license text and notice which was making it
difficult for compliance tooling to parse. This commit updates the
license file to use the standard text for the BSD 2-Clause license[2].

[1] aboutcode-org/scancode-toolkit#2905 (comment)
[2] https://spdx.org/licenses/BSD-2-Clause

Signed-off-by: Rose Judge <[email protected]>
@vargenau
Copy link
Contributor Author

In this issue, we have in fact two different issues reported in:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants