Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine license matches in new LicenseDetection #2961

Merged
merged 86 commits into from
Nov 11, 2022
Merged
Show file tree
Hide file tree
Changes from 67 commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
a6e2941
Import functions from `scancode-analyzer`
AyanSinhaMahapatra May 17, 2022
ce84de8
Modify LicenseDetection attributes
AyanSinhaMahapatra May 17, 2022
466e0e0
Create LicenseDetection serialization function
AyanSinhaMahapatra May 17, 2022
8de3111
Enable LicenseDetection in API
AyanSinhaMahapatra May 17, 2022
77ccbc9
Remove the --unknown-licenses option
AyanSinhaMahapatra May 17, 2022
3bd00bc
Update local license dereferencing code
AyanSinhaMahapatra May 17, 2022
56facc3
Update unknown license dereferencing for key-files
AyanSinhaMahapatra May 17, 2022
391383c
Update intro rules with is_license_intro as True
AyanSinhaMahapatra May 17, 2022
5b0dc11
Add test files for LicenseDetection
AyanSinhaMahapatra May 17, 2022
ce19d7a
Add LicenseDetection full scan tests
AyanSinhaMahapatra May 17, 2022
578bd8d
Add function to create index from test rules folders
AyanSinhaMahapatra May 17, 2022
dc25cd4
Remove unwanted code, update docstrings
AyanSinhaMahapatra May 24, 2022
d70acde
Move LicenseDetection tests to new file
AyanSinhaMahapatra May 24, 2022
b4ddc77
Modify LicenseMatch data in results #2416
AyanSinhaMahapatra May 24, 2022
6a441da
Add --licenses-reference option
AyanSinhaMahapatra May 25, 2022
bca7f4b
Add docs for LicenseDetection and detail referencing
AyanSinhaMahapatra May 25, 2022
d57f390
Do not summurize license detections
AyanSinhaMahapatra May 31, 2022
cc945a0
Apply LicenseDetection everywhere
AyanSinhaMahapatra May 31, 2022
6a91773
Regenerate test expectations for LicenseDetection
AyanSinhaMahapatra May 31, 2022
5826429
Merge branch 'develop' into add-license-detection
AyanSinhaMahapatra May 31, 2022
3e68f76
Fix SPDX and other output plugins
AyanSinhaMahapatra Jun 1, 2022
4fc7d24
Rename license fields for resource
AyanSinhaMahapatra Jun 8, 2022
5e20984
Update test expectations for license field renaming
AyanSinhaMahapatra Jun 8, 2022
2e3dd3e
Rename and Add package license attributes
AyanSinhaMahapatra Jul 12, 2022
855cdef
Add functions for package LicenseDetection
AyanSinhaMahapatra Jul 12, 2022
802c085
Modify package parsers to adopt new LicenseDetection
AyanSinhaMahapatra Jul 12, 2022
ab677c6
Align to package LicenseDetection
AyanSinhaMahapatra Jul 12, 2022
552f173
Modify system packages to use LicenseDetection
AyanSinhaMahapatra Jul 12, 2022
e94a19e
Add new license key `undetected-license`
AyanSinhaMahapatra Jul 12, 2022
0462ef4
Regenerate test expectations for package LicenseDetection
AyanSinhaMahapatra Jul 12, 2022
fd9cd57
Merge branch 'develop' into add-license-detection
AyanSinhaMahapatra Jul 12, 2022
1147381
Fix test failures
AyanSinhaMahapatra Jul 12, 2022
8929571
Fix pypi setup.py email bug
AyanSinhaMahapatra Jul 17, 2022
9df811e
Add manifest license references detection
AyanSinhaMahapatra Jul 18, 2022
c47ce89
Add license from file if empty manifest license
AyanSinhaMahapatra Jul 18, 2022
fce8e7c
Add feature to get package license from sibling file
AyanSinhaMahapatra Jul 18, 2022
eb88b56
Allow package LicenseDetection without --licenses
AyanSinhaMahapatra Jul 21, 2022
67872a4
Regen datadriven LicenseDetections
AyanSinhaMahapatra Jul 21, 2022
9a93552
Revert to `rule_identifier`
AyanSinhaMahapatra Jul 21, 2022
ce11ae4
Remove `undetected-license` in favour of `unknown`
AyanSinhaMahapatra Jul 21, 2022
7fc32ee
Regen test expectations and fix tests
AyanSinhaMahapatra Jul 22, 2022
5aae457
Reorder license expressions and detections in result
AyanSinhaMahapatra Jul 22, 2022
4c1c129
Add fucntions to test license detection with subset of rules
AyanSinhaMahapatra Jul 31, 2022
86ec441
Fix datadriven license test errors
AyanSinhaMahapatra Jul 31, 2022
ca823b7
Fix debian_copyright test failure
AyanSinhaMahapatra Jul 31, 2022
81793dd
Address review feedback
AyanSinhaMahapatra Aug 1, 2022
fb8f492
Add tests from eclipse foundation issues
AyanSinhaMahapatra Aug 3, 2022
08cb42d
Merge branch 'develop' into add-license-detection
AyanSinhaMahapatra Aug 4, 2022
dca0371
Regenerate test expectations after merging develop
AyanSinhaMahapatra Aug 4, 2022
174a097
Add `other_license*` attributes for packages #2065
AyanSinhaMahapatra Aug 8, 2022
c8ca7a3
Remove `compute_normalized_license` functions
AyanSinhaMahapatra Aug 9, 2022
539ebed
Add `default_license_relation` attribute to handlers
AyanSinhaMahapatra Aug 9, 2022
6f42f6f
Support NuGet license URLs #3037
AyanSinhaMahapatra Aug 11, 2022
4f29860
Fix test failures
AyanSinhaMahapatra Aug 11, 2022
0450138
Fix rpm tests
AyanSinhaMahapatra Aug 18, 2022
21648a5
Merge branch 'develop' into add-license-detection
AyanSinhaMahapatra Aug 18, 2022
57a3c2a
Fix test failures and expectations after merging develop
AyanSinhaMahapatra Aug 18, 2022
f1ee5f6
Do not return empty dict as exctracted license
AyanSinhaMahapatra Aug 19, 2022
bae1a30
Fix csv output after adding LicenseDetection`
AyanSinhaMahapatra Aug 19, 2022
b1e422c
Tag intro rule properly
AyanSinhaMahapatra Sep 5, 2022
1b9a8f7
Fix package license expression None bug
AyanSinhaMahapatra Sep 5, 2022
eadb7bd
Also classify license intro in false positives list
AyanSinhaMahapatra Sep 6, 2022
64bd3d8
Merge branch 'develop' into add-license-detection
AyanSinhaMahapatra Sep 6, 2022
8edfb92
Make LicenseMatch grouping affected by presence of license intro
AyanSinhaMahapatra Sep 6, 2022
0d8fba9
Modify license references to also effect clues
AyanSinhaMahapatra Sep 19, 2022
08afdfd
Restore returning whole_lines by default
AyanSinhaMahapatra Sep 19, 2022
6e2cad3
Add libxml files as unknown license reference test
AyanSinhaMahapatra Sep 20, 2022
a5c52fa
Tag license intro rules correctly
AyanSinhaMahapatra Sep 29, 2022
02ab56c
Update false positives and unknown intro heuristics
AyanSinhaMahapatra Oct 3, 2022
7d5c647
Add unknown license reference to package dereferencing #2965 #1379
AyanSinhaMahapatra Oct 13, 2022
8721bdf
Improve unknown reference to package dereferencing #2965 #1379
AyanSinhaMahapatra Oct 18, 2022
5b0efe4
Merge branch 'develop' into add-license-detection
AyanSinhaMahapatra Oct 18, 2022
afd4025
Merge branch 'develop' into add-license-detection
AyanSinhaMahapatra Nov 3, 2022
b1c999f
Rename `detection_rules` to `detection_log`
AyanSinhaMahapatra Nov 6, 2022
f0213ec
Update license clues heuristics
AyanSinhaMahapatra Nov 7, 2022
5de2744
Add `rule_url` and update scancode URLs
AyanSinhaMahapatra Nov 7, 2022
fd3f1d7
Restore the --unknown-licenses experimental CLI option
AyanSinhaMahapatra Nov 8, 2022
637c4dd
Adjust unknown licenses heuristics
AyanSinhaMahapatra Nov 8, 2022
01b1de4
Replace `package` in referenced_filename
AyanSinhaMahapatra Nov 8, 2022
5768c8c
Update docstrings and improve readability
AyanSinhaMahapatra Nov 9, 2022
0e00d28
Update changelog and docs
AyanSinhaMahapatra Nov 10, 2022
5ebca43
Improve CHANGELOG
pombredanne Nov 11, 2022
18aca01
Update CHANGELOG to remove duplications
AyanSinhaMahapatra Nov 11, 2022
05d163a
Bump output format to v3
pombredanne Nov 11, 2022
d264aec
Update CHANGELOG
pombredanne Nov 11, 2022
8f07fdf
Bump version
pombredanne Nov 11, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
The diff you're trying to view is too large. We only load the first 3000 changed files.
Empty file modified docs/scripts/sphinx_build_link_check.sh
100644 → 100755
Empty file.
1 change: 1 addition & 0 deletions docs/source/explanations/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
:maxdepth: 2

overview
license-detection-reference

..
[ToAdd]
Expand Down
194 changes: 194 additions & 0 deletions docs/source/explanations/license-detection-reference.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
License Detection and Reference Additions
=========================================

`Main Issue <https://github.com/nexB/scancode-toolkit/issues/2878>`_

`Main Pull Request <https://github.com/nexB/scancode-toolkit/pull/2961>`_

`A presentation on this <https://github.com/nexB/scancode-toolkit/issues/2878#issuecomment-1079639973>`_


Previous Work
-------------

- Akansha's GSoC work on unknown local references and unknown detection
based on ngrams from LicenseDB texts.

- work from ``scancode-analyzer`` and ``debian copyright detection``
which had the concept of a LicenseDetection, flat LicenseMatches and
getting a unique detections across a scan referencing the details.

- work on primary-license and license scoring.

LicenseDetection
----------------

This aims to solve a few types of false positives commonly observed in
ScanCode license detection. These are:

The ``unknown`` cases
^^^^^^^^^^^^^^^^^^^^^

- Unknown Intros with Proper Detections after them
- Unknown references to local files

License Clues
^^^^^^^^^^^^^

Also this would introduce a ``license_clues`` list of LicenseMatches
which would have improper detections or other clues like urls which
cannot be marked as detections.

License Versions
^^^^^^^^^^^^^^^^

This would also simplify license-expressions for gpl/lgpl cases
with versioned/unversioned matches detected together.

Package License Detections
^^^^^^^^^^^^^^^^^^^^^^^^^^

License detections in package manifests now just have the license-expression
from the detection and this is different from licenses detected directly which
have details. So packages now would also have details.

Other Soulution Elements
^^^^^^^^^^^^^^^^^^^^^^^^

Merged:

- Key {{phrases}} in license text rules
- New license clarity scoring
- Report the primary license

Upcoming:

- Make it easier to report, review and curate license detections
(GSoC Project in scancode.io)

- Fixing bugs and updating the heuristics.
(This will be ongoing like the LicenseDB updates)

Examples
^^^^^^^^

An example from the eclipse foundation::

/*********************************************************************
* Copyright (c) 2019 Red Hat, Inc.
*
* This program and the accompanying materials are made
* available under the terms of the Eclipse Public License 2.0
* which is available at https://www.eclipse.org/legal/epl-2.0/
*
* SPDX-License-Identifier: EPL-2.0
**********************************************************************/


The text ``"This program and the accompanying materials are made\n* available under the terms
of the",`` is detected as ``unknown-license-reference`` with ``is_license_intro`` as True,
and has several ``"epl-2.0"`` detections after that.

What is a LicenseDetection?
---------------------------

A detection which can have one or multiple LicenseMatch in them,
and creates a License Expression that we finally report.

Properties:

- A file can have multiple LicenseDetections (seperated by non-legalese lines)
- This can be from a file directly or a package.
- We should be mostly certain of a proper detection to create a LicenseDetection.
- One LicenseDetection can have matches from different files, in case of local license
references.


LicenseMatch Result Data
------------------------

LicenseMatch data currently is based on a ``license key`` instead of being based
on an ``license-expression``.

So if there is a ``mit and apache-2.0`` license expression detected from a single
AyanSinhaMahapatra marked this conversation as resolved.
Show resolved Hide resolved
LicenseMatch, we currently add two entries in the ``licenses`` list for that
resource, one for each license key, (here ``mit`` and ``apache-2.0`` respectively).
This repeats the match details as these two entries have the same details except the
license key. And this is wrong.

We should only add one entry per match (and therefore per ``rule``) and here the
primary attribute should be the ``license-expression``, rather than the ``license-key``.

We also create a mapping inside a mapping in these license details to refer to the
license rule (and there are other incosistencies in how we report here). We should
just report a flat mapping here, (with a list at last for each of the license keys).


Only reference License related Data
-----------------------------------

Currently all license related data is inlined in each match, and this repeats
a lot of information. This repeatation exists in three levels:

- License Data
- LicenseDB Data
- LicenseDetection Data

If we introduce a new command line option ``--licenses-reference``, which of these
should we reference, just License/LicenseDB data, just LicenseDetection level data
or all of them?

License Data
^^^^^^^^^^^^

This is referencing data related to whole licenses, references by their license key.

Example: ``apache-2.0``

Other attributes are it's full test, links to origin, licenseDB, spdx, osi etc.


LicenseDB Data
^^^^^^^^^^^^^^

This is referencing data related to a LicenseDB entry.
I.e. the identifier is a `RULE` or a `LICENSE` file.

Example: ``apache-2.0_2.RULE``

Other attributes are it's license-expression, the boolean fields, length, relevance etc.


LicenseDetection Data
^^^^^^^^^^^^^^^^^^^^^

This is referencing by LicenseDetections. This has one or multiple license Matches.

Identifier is a hash/uuid field computed from a nested tuple of select attributes.

This will represent each LicenseDetection, if the same detection is present across multiple files.

Attributes will be:

- File Regions where these are found (File Path + Start and End line)
- Score, matched length, matcher (like ``1-hash``, ``2-aho``), and matched text.


What should be the default option?
----------------------------------

Two changes were long-planned and should be default:

- LicenseDetections in the results
- LicenseMatch being for a ``license-expression``

This is already a lot of change, so also having the referencing details as default doesn't
make sense IMHO.

- We need to have the details inlined as an option surely because otherwise it will be downstream
tools resposibility to get this and inline them.

We can always make the details referenced as the default option in a later release after more
testing and feedback. So we can then have the ``--licenses-reference`` command line option
which removes the details and puts them in a top-level list. And the details inlined as
default.
8 changes: 4 additions & 4 deletions etc/scripts/utils_thirdparty.py
Original file line number Diff line number Diff line change
Expand Up @@ -910,7 +910,7 @@ def load_pkginfo_data(self, dest_dir=THIRDPARTY_DIR):
declared_license = [raw_data["License"]] + [
c for c in classifiers if c.startswith("License")
]
license_expression = compute_normalized_license_expression(declared_license)
license_expression = get_license_expression(declared_license)
other_classifiers = [c for c in classifiers if not c.startswith("License")]

holder = raw_data["Author"]
Expand Down Expand Up @@ -2272,16 +2272,16 @@ def find_problems(
check_about(dest_dir=dest_dir)


def compute_normalized_license_expression(declared_licenses):
def get_license_expression(declared_licenses):
"""
Return a normalized license expression or None.
"""
if not declared_licenses:
return
try:
from packagedcode import pypi
from packagedcode.licensing import get_only_expression_from_extracted_license

return pypi.compute_normalized_license(declared_licenses)
return get_only_expression_from_extracted_license(declared_licenses)
except ImportError:
# Scancode is not installed, clean and join all the licenses
lics = [python_safe_name(l).lower() for l in declared_licenses]
Expand Down
2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@ console_scripts =
scancode_pre_scan =
ignore = scancode.plugin_ignore:ProcessIgnore
facet = summarycode.facet:AddFacet
classify = summarycode.classify:FileClassifier
classify = summarycode.classify_plugin:FileClassifier


# scancode_scan is the entry point for scan plugins that run a scan after the
Expand Down
95 changes: 51 additions & 44 deletions src/formattedcode/output_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,11 @@
# See https://github.com/nexB/scancode-toolkit for support or download.
# See https://aboutcode.org for more information about nexB OSS projects.
#

import attr
import csv
import logging
import os
import warnings

import saneyaml
Expand All @@ -20,24 +23,23 @@
from formattedcode import FileOptionType

# Tracing flags
TRACE = False
TRACE = os.environ.get('SCANCODE_DEBUG_OUTPUT_CSV', False)


def logger_debug(*args):
pass


logger = logging.getLogger(__name__)

if TRACE:
import sys
import logging

logger = logging.getLogger(__name__)
logging.basicConfig(stream=sys.stdout)
logger.setLevel(logging.DEBUG)

def logger_debug(*args):
return logger.debug(' '.join(isinstance(a, str)
and a or repr(a) for a in args))
return logger.debug(' '.join(isinstance(a, str) and a or repr(a) for a in args))


DEPRECATED_MSG = (
'The --csv option is deprecated and will be replaced by new CSV and '
Expand Down Expand Up @@ -75,12 +77,11 @@ def process_codebase(self, codebase, csv, **kwargs):


def write_csv(results, output_file):
# FIXMe: this is reading all in memory
# FIXME: this is reading all in memory
results = list(results)

headers = dict([
('info', []),
('license_expression', []),
('license', []),
('copyright', []),
('email', []),
Expand Down Expand Up @@ -129,49 +130,55 @@ def collect_keys(mapping, key_group):

errors = scanned_file.pop('scan_errors', [])

file_info = dict(path=path)
file_info.update(((k, v) for k, v in scanned_file.items()
# FIXME: info are NOT lists: lists are the actual scans
if not isinstance(v, (list, dict))))
file_info = dict(path=path)
file_info.update(
(
(k, v) for k, v in scanned_file.items()
if not isinstance(v, (list, dict))
)
)
# Scan errors are joined in a single multi-line value
file_info['scan_errors'] = '\n'.join(errors)

collect_keys(file_info, 'info')
yield file_info

for lic_exp in scanned_file.get('license_expressions', []):
inf = dict(path=path, license_expression=lic_exp)
collect_keys(inf, 'license_expression')
yield inf
for detection in scanned_file.get('license_detections', []):
license_expression = detection["license_expression"]
detection_rules = detection["detection_rules"]
detection_rules = '\n'.join(detection_rules)
license_matches = detection["matches"]
for match in license_matches:
lic = dict(path=path)
lic["license_expression"] = license_expression
lic["detection_rules"] = detection_rules

for k, val in match.items():
# do not include matched text for now.
if k == 'matched_text':
continue

if k == 'licenses':
license_keys = []
for license_item in val:
license_keys.append(license_item["key"])
k = 'license_match__' + k
lic[k] = '\n'.join(license_keys)
continue

if k in ('score', 'match_coverage', 'rule_relevance'):
val = with_two_decimals(val)

# lines are present in multiple scans: keep their column name as
# not scan-specific. Prefix othe columns with license__
if k not in ('start_line', 'end_line',):
k = 'license_match__' + k

lic[k] = val

for licensing in scanned_file.get('licenses', []):
lic = dict(path=path)
for k, val in licensing.items():
# do not include matched text for now.
if k == 'matched_text':
continue

if k == 'matched_rule':
for mrk, mrv in val.items():
if mrk in ('match_coverage', 'rule_relevance'):
# normalize the string representation of this number
mrv = with_two_decimals(mrv)
else:
mrv = pretty(mrv)
mrk = 'matched_rule__' + mrk
lic[mrk] = mrv
continue

if k == 'score':
val = with_two_decimals(val)

# lines are present in multiple scans: keep their column name as
# not scan-specific. Prefix othe columns with license__
if k not in ('start_line', 'end_line',):
k = 'license__' + k
lic[k] = val
collect_keys(lic, 'license')
yield lic
collect_keys(lic, 'license')
yield lic

for copyr in scanned_file.get('copyrights', []):
inf = dict(path=path)
Expand Down Expand Up @@ -348,6 +355,6 @@ def flatten_package(_package, path, prefix='package__'):
else:
# Use repr if not a string
if val:
pack[nk] = pretty(val)
pack[nk] = repr(val)

return pack
Loading