aboutcode-org · pombredanne · Nov 11, 2022 · May 17, 2022 · May 17, 2022 · May 17, 2022
diff --git a/docs/scripts/sphinx_build_link_check.sh b/docs/scripts/sphinx_build_link_check.sh
diff --git a/docs/source/explanations/index.rst b/docs/source/explanations/index.rst
@@ -7,6 +7,7 @@
    :maxdepth: 2
 
    overview
+   license-detection-reference
 
 ..
    [ToAdd]

diff --git a/docs/source/explanations/license-detection-reference.rst b/docs/source/explanations/license-detection-reference.rst
@@ -0,0 +1,194 @@
+License Detection and Reference Additions
+=========================================
+
+`Main Issue <https://github.com/nexB/scancode-toolkit/issues/2878>`_
+
+`Main Pull Request <https://github.com/nexB/scancode-toolkit/pull/2961>`_
+
+`A presentation on this <https://github.com/nexB/scancode-toolkit/issues/2878#issuecomment-1079639973>`_
+
+
+Previous Work
+-------------
+
+- Akansha's GSoC work on unknown local references and unknown detection
+  based on ngrams from LicenseDB texts.
+
+- work from ``scancode-analyzer`` and ``debian copyright detection``
+  which had the concept of a LicenseDetection, flat LicenseMatches and
+  getting a unique detections across a scan referencing the details.
+
+- work on primary-license and license scoring.
+
+LicenseDetection
+----------------
+
+This aims to solve a few types of false positives commonly observed in
+ScanCode license detection. These are:
+
+The ``unknown`` cases
+^^^^^^^^^^^^^^^^^^^^^
+
+- Unknown Intros with Proper Detections after them
+- Unknown references to local files
+
+License Clues
+^^^^^^^^^^^^^
+
+Also this would introduce a ``license_clues`` list of LicenseMatches
+which would have improper detections or other clues like urls which
+cannot be marked as detections.
+
+License Versions
+^^^^^^^^^^^^^^^^
+
+This would also simplify license-expressions for gpl/lgpl cases
+with versioned/unversioned matches detected together.
+
+Package License Detections
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+License detections in package manifests now just have the license-expression
+from the detection and this is different from licenses detected directly which
+have details. So packages now would also have details.
+
+Other Soulution Elements
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Merged:
+
+- Key {{phrases}} in license text rules
+- New license clarity scoring
+- Report the primary license
+
+Upcoming:
+
+- Make it easier to report, review and curate license detections
+  (GSoC Project in scancode.io)
+
+- Fixing bugs and updating the heuristics.
+  (This will be ongoing like the LicenseDB updates)
+
+Examples
+^^^^^^^^
+
+An example from the eclipse foundation::
+
+ /*********************************************************************
+ * Copyright (c) 2019 Red Hat, Inc.
+ *
+ * This program and the accompanying materials are made
+ * available under the terms of the Eclipse Public License 2.0
+ * which is available at https://www.eclipse.org/legal/epl-2.0/
+ *
+ * SPDX-License-Identifier: EPL-2.0
+ **********************************************************************/
+
+
+The text ``"This program and the accompanying materials are made\n* available under the terms
+of the",`` is detected as ``unknown-license-reference`` with ``is_license_intro`` as True,
+and has several ``"epl-2.0"`` detections after that.
+
+What is a LicenseDetection?
+---------------------------
+
+A detection which can have one or multiple LicenseMatch in them,
+and creates a License Expression that we finally report.
+
+Properties:
+
+- A file can have multiple LicenseDetections (seperated by non-legalese lines)
+- This can be from a file directly or a package.
+- We should be mostly certain of a proper detection to create a LicenseDetection.
+- One LicenseDetection can have matches from different files, in case of local license
+  references.
+
+
+LicenseMatch Result Data
+------------------------
+
+LicenseMatch data currently is based on a ``license key`` instead of being based
+on an ``license-expression``.
+
+So if there is a ``mit and apache-2.0`` license expression detected from a single
+LicenseMatch, we currently add two entries in the ``licenses`` list for that
+resource, one for each license key, (here ``mit`` and ``apache-2.0`` respectively).
+This repeats the match details as these two entries have the same details except the
+license key. And this is wrong.
+
+We should only add one entry per match (and therefore per ``rule``) and here the
+primary attribute should be the ``license-expression``, rather than the ``license-key``.
+
+We also create a mapping inside a mapping in these license details to refer to the
+license rule (and there are other incosistencies in how we report here). We should
+just report a flat mapping here, (with a list at last for each of the license keys).
+
+
+Only reference License related Data
+-----------------------------------
+
+Currently all license related data is inlined in each match, and this repeats
+a lot of information. This repeatation exists in three levels:
+
+- License Data
+- LicenseDB Data
+- LicenseDetection Data
+
+If we introduce a new command line option ``--licenses-reference``, which of these
+should we reference, just License/LicenseDB data, just LicenseDetection level data
+or all of them?
+
+License Data
+^^^^^^^^^^^^
+
+This is referencing data related to whole licenses, references by their license key.
+
+Example: ``apache-2.0``
+
+Other attributes are it's full test, links to origin, licenseDB, spdx, osi etc.
+
+
+LicenseDB Data
+^^^^^^^^^^^^^^
+
+This is referencing data related to a LicenseDB entry.
+I.e. the identifier is a `RULE` or a `LICENSE` file.
+
+Example: ``apache-2.0_2.RULE``
+
+Other attributes are it's license-expression, the boolean fields, length, relevance etc.
+
+
+LicenseDetection Data
+^^^^^^^^^^^^^^^^^^^^^
+
+This is referencing by LicenseDetections. This has one or multiple license Matches.
+
+Identifier is a hash/uuid field computed from a nested tuple of select attributes.
+
+This will represent each LicenseDetection, if the same detection is present across multiple files.
+
+Attributes will be:
+
+- File Regions where these are found (File Path + Start and End line)
+- Score, matched length, matcher (like ``1-hash``, ``2-aho``), and matched text.
+
+
+What should be the default option?
+----------------------------------
+
+Two changes were long-planned and should be default:
+
+- LicenseDetections in the results
+- LicenseMatch being for a ``license-expression``
+
+This is already a lot of change, so also having the referencing details as default doesn't
+make sense IMHO.
+
+- We need to have the details inlined as an option surely because otherwise it will be downstream
+  tools resposibility to get this and inline them.
+
+We can always make the details referenced as the default option in a later release after more
+testing and feedback. So we can then have the ``--licenses-reference`` command line option
+which removes the details and puts them in a top-level list. And the details inlined as
+default.
diff --git a/etc/scripts/utils_thirdparty.py b/etc/scripts/utils_thirdparty.py
@@ -910,7 +910,7 @@ def load_pkginfo_data(self, dest_dir=THIRDPARTY_DIR):
         declared_license = [raw_data["License"]] + [
             c for c in classifiers if c.startswith("License")
         ]
-        license_expression = compute_normalized_license_expression(declared_license)
+        license_expression = get_license_expression(declared_license)
         other_classifiers = [c for c in classifiers if not c.startswith("License")]
 
         holder = raw_data["Author"]
@@ -2272,16 +2272,16 @@ def find_problems(
     check_about(dest_dir=dest_dir)
 
 
-def compute_normalized_license_expression(declared_licenses):
+def get_license_expression(declared_licenses):
     """
     Return a normalized license expression or None.
     """
     if not declared_licenses:
         return
     try:
-        from packagedcode import pypi
+        from packagedcode.licensing import get_only_expression_from_extracted_license
 
-        return pypi.compute_normalized_license(declared_licenses)
+        return get_only_expression_from_extracted_license(declared_licenses)
     except ImportError:
         # Scancode is not installed, clean and join all the licenses
         lics = [python_safe_name(l).lower() for l in declared_licenses]

diff --git a/setup.cfg b/setup.cfg
@@ -157,7 +157,7 @@ console_scripts =
 scancode_pre_scan =
     ignore = scancode.plugin_ignore:ProcessIgnore
     facet = summarycode.facet:AddFacet
-    classify = summarycode.classify:FileClassifier
+    classify = summarycode.classify_plugin:FileClassifier
 
 
 # scancode_scan is the entry point for scan plugins that run a scan after the

diff --git a/src/formattedcode/output_csv.py b/src/formattedcode/output_csv.py
@@ -6,8 +6,11 @@
 # See https://github.com/nexB/scancode-toolkit for support or download.
 # See https://aboutcode.org for more information about nexB OSS projects.
 #
+
 import attr
 import csv
+import logging
+import os
 import warnings
 
 import saneyaml
@@ -20,24 +23,23 @@
 from formattedcode import FileOptionType
 
 # Tracing flags
-TRACE = False
+TRACE = os.environ.get('SCANCODE_DEBUG_OUTPUT_CSV', False)
 
 
 def logger_debug(*args):
     pass
 
 
+logger = logging.getLogger(__name__)
+
 if TRACE:
     import sys
-    import logging
-
-    logger = logging.getLogger(__name__)
     logging.basicConfig(stream=sys.stdout)
     logger.setLevel(logging.DEBUG)
 
     def logger_debug(*args):
-        return logger.debug(' '.join(isinstance(a, str)
-                                     and a or repr(a) for a in args))
+        return logger.debug(' '.join(isinstance(a, str) and a or repr(a) for a in args))
+
 
 DEPRECATED_MSG = (
     'The --csv option is deprecated and will be replaced by new CSV and '
@@ -75,12 +77,11 @@ def process_codebase(self, codebase, csv, **kwargs):
 
 
 def write_csv(results, output_file):
-    # FIXMe: this is reading all in memory
+    # FIXME: this is reading all in memory
     results = list(results)
 
     headers = dict([
         ('info', []),
-        ('license_expression', []),
         ('license', []),
         ('copyright', []),
         ('email', []),
@@ -129,49 +130,55 @@ def collect_keys(mapping, key_group):
 
         errors = scanned_file.pop('scan_errors', [])
 
-        file_info = dict(path=path)
-        file_info.update(((k, v) for k, v in scanned_file.items()
         # FIXME: info are NOT lists: lists are the actual scans
-                          if not isinstance(v, (list, dict))))
+        file_info = dict(path=path)
+        file_info.update(
+            (
+                (k, v) for k, v in scanned_file.items()
+                if not isinstance(v, (list, dict))
+            )
+        )
         # Scan errors are joined in a single multi-line value
         file_info['scan_errors'] = '\n'.join(errors)
 
         collect_keys(file_info, 'info')
         yield file_info
 
-        for lic_exp in scanned_file.get('license_expressions', []):
-            inf = dict(path=path, license_expression=lic_exp)
-            collect_keys(inf, 'license_expression')
-            yield inf
+        for detection in scanned_file.get('license_detections', []):
+            license_expression = detection["license_expression"]
+            detection_rules = detection["detection_rules"]
+            detection_rules = '\n'.join(detection_rules)
+            license_matches = detection["matches"]
+            for match in license_matches:
+                lic = dict(path=path)
+                lic["license_expression"] = license_expression
+                lic["detection_rules"] = detection_rules
+
+                for k, val in match.items():
+                    # do not include matched text for now.
+                    if k == 'matched_text':
+                        continue
+
+                    if k == 'licenses':
+                        license_keys = []
+                        for license_item in val:
+                            license_keys.append(license_item["key"])
+                        k = 'license_match__' + k
+                        lic[k] = '\n'.join(license_keys)
+                        continue
+
+                    if k in ('score', 'match_coverage', 'rule_relevance'):
+                        val = with_two_decimals(val)
+
+                    # lines are present in multiple scans: keep their column name as
+                    # not scan-specific. Prefix othe columns with license__
+                    if k not in ('start_line', 'end_line',):
+                        k = 'license_match__' + k
+
+                    lic[k] = val
 
-        for licensing in scanned_file.get('licenses', []):
-            lic = dict(path=path)
-            for k, val in licensing.items():
-                # do not include matched text for now.
-                if k == 'matched_text':
-                    continue
-
-                if k == 'matched_rule':
-                    for mrk, mrv in val.items():
-                        if mrk in ('match_coverage', 'rule_relevance'):
-                            # normalize the string representation of this number
-                            mrv = with_two_decimals(mrv)
-                        else:
-                            mrv = pretty(mrv)
-                        mrk = 'matched_rule__' + mrk
-                        lic[mrk] = mrv
-                    continue
-
-                if k == 'score':
-                    val = with_two_decimals(val)
-
-                # lines are present in multiple scans: keep their column name as
-                # not scan-specific. Prefix othe columns with license__
-                if k not in ('start_line', 'end_line',):
-                    k = 'license__' + k
-                lic[k] = val
-            collect_keys(lic, 'license')
-            yield lic
+                collect_keys(lic, 'license')
+                yield lic
 
         for copyr in scanned_file.get('copyrights', []):
             inf = dict(path=path)
@@ -348,6 +355,6 @@ def flatten_package(_package, path, prefix='package__'):
         else:
             # Use repr if not a string
             if val:
-                pack[nk] = pretty(val)
+                pack[nk] = repr(val)
 
     return pack
-Original file line number
+Diff line change
@@ Expand Up / @@ -7,6 +7,7 @@ @@
        :maxdepth: 2
        overview
+       license-detection-reference
     ..
        [ToAdd]
@@ Expand Down @@