Skip to content

Commit

Permalink
Merge branch 'develop' into add-license-detection
Browse files Browse the repository at this point in the history
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
  • Loading branch information
AyanSinhaMahapatra committed May 31, 2022
2 parents 6a91773 + aba3112 commit 3516046
Show file tree
Hide file tree
Showing 78 changed files with 6,515 additions and 2,790 deletions.
215 changes: 126 additions & 89 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,12 @@ Changelog
31.0.0 (next, roadmap)
-----------------------

Important API changes:
~~~~~~~~~~~~~~~~~~~~~~~~
This is a major release with important bug and security fixes, new and improved
features and API changes.

- Adopted the new skeleton from https://github.com/nexB/skeleton
The key change is the location of the virtual environment. It used to be
created at the root of the scancode-toolkit directory. It is now created
under the ``venv`` subdirectory.

- The main package API function `get_package_infos` is deprecated, and
replaced by `get_package_data`.
Important API changes:
~~~~~~~~~~~~~~~~~~~~~~~~

- The data structure of the JSON output has changed for copyrights, authors
and holders. We now use a proper name for attributes and not a generic "value".
Expand All @@ -31,14 +27,14 @@ Important API changes:
rather than "packages". This has all the data attributes of a "package_data"
field plus others: "package_uuid", "package_data_files" and "files".

- There is a a new top-level "packages" attribute that contains package
instances that can be aggregating data from multiple manifests.
- There is a a new top-level "packages" attribute that contains package
instances that can be aggregating data from multiple manifests.

- There is a a new top-level "dependencies" attribute that contains each dependency
instance, these can be standalone or releated to a package.
- There is a a new top-level "dependencies" attribute that contains each
dependency instance, these can be standalone or releated to a package.

- There is a new resource-level attribute "for_packages" which refers to packages
through package_uuids (pURL + uuid string).
- There is a new resource-level attribute "for_packages" which refers to
packages through package_uuids (pURL + uuid string).

- The data structure for HTML output has been changed to include emails and
urls under the "infos" object. The HTML template displays output for holders,
Expand All @@ -48,12 +44,18 @@ Important API changes:
column to "path". "copyright_holder" has been renamed to "holder"

- The license clarity scoring plugin has been overhauled to show new license
clarity criteria. More details of the new criteria are provided below.
clarity criteria. More details of the new scoring criteria are provided below.

- The functionality of the summary plugin has been imprived to provide declared
origin and license information for the codebase being scanned. The previous
summary plugin functionality has been preserved in the new ``tallies`` plugin.
More details are provided below.

- The functionality of the summary plugin has been changed to provide declared
origin information for the codebase being scanned. The previous summary plugin
functionality has been preserved in the new ``tallies`` plugin. More details
are provided below.
- ScanCode has adopted the new code skeleton from https://github.com/nexB/skeleton
The key change is the location of the virtual environment. It used to be
created at the root of the scancode-toolkit directory. It is now created
under the ``venv`` subdirectory. You mus be aware of this if you use ScanCode
from a git clone


Copyright detection:
Expand All @@ -76,7 +78,7 @@ License detection:
- XXXX new license detection rules have been added, and
- XXXX existing license rules have been updated.
- XXXX existing false positive license rules have been removed (see below).
- The SPDX license list has been updated to the latest v3.15
- The SPDX license list has been updated to the latest v3.16

- The rule attribute "only_known_words" has been renamed to "is_continuous" and its
meaning has been updated and expanded. A rule tagged as "is_continuous" can only
Expand All @@ -85,10 +87,10 @@ License detection:
The processing for "is_continous" has been merged in "key phrases" processing
below.

- Key phrases can now be defined in RULEs by surrounding one or more words with
`{{` and `}}`. When defined a RULE will only match when the key phrases match
exactly. When all the text of rule is a "key phrase", this is the same as being
"is_continuous".
- Key phrases can now be defined in a RULE text by surrounding one or more words
with double curly braces `{{` and `}}`. When defined a RULE will only match
when the key phrases match exactly. When all the text of rule is a "key phrase",
this is the same as being "is_continuous".

- The "--unknown-licenses" option now also detects unknown licenses using a
simple and effective ngrams-based matching in area that are not matched or
Expand Down Expand Up @@ -135,6 +137,7 @@ License detection:
tagged and they may not be detected unless you activate this new indexing
feature.


Package detection:
~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -172,77 +175,84 @@ Package detection:
License Clarity Scoring Update
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- We are moving away from the license clarity scoring defined by ClearlyDefined
in the license clarity score plugin. The previous license clarity scoring
logic produced a score that was misleading when it would return a low score
due to the stringent scoring criteria. We are now
using more general criteria to get a sense of what provenance information has
been provided and whether or not there is a conflict in licensing between
what licenses were declared at the top-level key files and what licenses have
been detected in the files under the top-level.
- We are moving away from the original license clarity scoring designed for
ClearlyDefined in the license clarity score plugin. The previous license
clarity scoring logic produced a score that was misleading when it would
return a low score due to the stringent scoring criteria. We are now using
more general criteria to get a sense of what provenance information has been
provided and whether or not there is a conflict in licensing between what
licenses were declared at the top-level key files and what licenses have been
detected in the files under the top-level.

- The license clarity score is a value from 0-100 calculated by combining the
weighted values determined for each of the scoring elements:
- The license clarity score is a value from 0-100 calculated by combining the
weighted values determined for each of the scoring elements:

- Declared license:
- Declared license:

- When true, indicates that the software package licensing is documented at
top-level or well-known locations in the software project, typically in a
package manifest, NOTICE, LICENSE, COPYING or README file.
- Scoring Weight = 40
- When true, indicates that the software package licensing is documented at
top-level or well-known locations in the software project, typically in a
package manifest, NOTICE, LICENSE, COPYING or README file.
- Scoring Weight = 40

- Identification precision:
- Identification precision:

- Indicates how well the license statement(s) of the software identify known
licenses that can be designated by precise keys (identifiers) as provided in
a publicly available license list, such as the ScanCode LicenseDB, the SPDX
license list, the OSI license list, or a URL pointing to a specific license
text in a project or organization website.
- Scoring Weight = 40
- Indicates how well the license statement(s) of the software identify known
licenses that can be designated by precise keys (identifiers) as provided in
a publicly available license list, such as the ScanCode LicenseDB, the SPDX
license list, the OSI license list, or a URL pointing to a specific license
text in a project or organization website.
- Scoring Weight = 40

- License texts:
- License texts:

- License texts are provided to support the declared license expression in
files such as a package manifest, NOTICE, LICENSE, COPYING or README.
- Scoring Weight = 10
- License texts are provided to support the declared license expression in
files such as a package manifest, NOTICE, LICENSE, COPYING or README.
- Scoring Weight = 10

- Declared copyright:
- Declared copyright:

- When true, indicates that the software package copyright is documented at
top-level or well-known locations in the software project, typically in a
package manifest, NOTICE, LICENSE, COPYING or README file.
- Scoring Weight = 10
- When true, indicates that the software package copyright is documented at
top-level or well-known locations in the software project, typically in a
package manifest, NOTICE, LICENSE, COPYING or README file.
- Scoring Weight = 10

- Ambiguous compound licensing:
- Ambiguous compound licensing:

- When true, indicates that the software has a license declaration that
makes it difficult to construct a reliable license expression, such as in
the case of multiple licenses where the conjunctive versus disjunctive
relationship is not well defined.
- Scoring Weight = -10
- When true, indicates that the software has a license declaration that
makes it difficult to construct a reliable license expression, such as in
the case of multiple licenses where the conjunctive versus disjunctive
relationship is not well defined.
- Scoring Weight = -10

- Conflicting license categories:
- Conflicting license categories:

- When true, indicates that the declared license expression of the software is in
the permissive category, but that other potentially conflicting categories,
such as copyleft and proprietary, have been detected in lower level code.
- Scoring Weight = -20
- When true, indicates that the declared license expression of the software
is in the permissive category, but that other potentially conflicting
categories, such as copyleft and proprietary, have been detected in lower
level code.
- Scoring Weight = -20


Summary Plugin Update
~~~~~~~~~~~~~~~~~~~~~
The summary plugin's behavior has been changed. Previously, it provided a count
of the detected license expressions, copyrights, holders, authors, and
programming languages from a scan. We have preserved this functionality by
creating a new plugin called ``tallies``. All functionality of the previous
summary plugin have been preserved in the tallies plugin.

The plugin now attempts to determine a declared license expression, holder, and
primary programming language from a scan. The license clarity score provides
context on what origin information is provided from key files. It also returns
lists of tallies of the other detected license expressions, holders, and
programming languages. All information is provided in the codebase level
attribute named ``summary``.
- The summary plugin's behavior has been changed. Previously, it provided a
count of the detected license expressions, copyrights, holders, authors, and
programming languages from a scan.

We have preserved this functionality by creating a new plugin called ``tallies``.
All functionality of the previous summary plugin have been preserved in the
tallies plugin.

- The new summary plugin now attempts to determine a declared license expression,
declared holder, and the primary programming language from a scan. And the
updated license clarity score provides context on the quality of the license
information provided in the codebase key files.

- The new summary plugin also returns lists of tallies for the other "secondary"
detected license expressions, copyright holders, and programming languages.

All summary information is provided at the codebase-level attribute named ``summary``.


Outputs:
Expand All @@ -258,15 +268,36 @@ Outputs:
Output version
--------------

Scancode Data Output Version is now 3.0.0.
Scancode Data Output Version is now 2.0.0.


Changes:

- rename resource level attribute `packages` to `package_data`.
- add top-level attribute `packages`.
- add top-level attribute `dependencies`.
- add resource-level attribute `for_packages`.
- remove `package-data` attribute `root_path`.
- Rename resource level attribute `packages` to `package_data`.
- Add top-level attribute `packages`.
- Add top-level attribute `dependencies`.
- Add resource-level attribute `for_packages`.
- Remove `package-data` attribute `root_path`.
- The fields of the license clarity scoring plugin have been replaced with the
following fields. An overview of the new fields can be found in the "License
Clarity Scoring Update" section above.
- `score`
- `declared_license`
- `identification_precision`
- `has_license_text`
- `declared_copyrights`
- `conflicting_license_categories`
- `ambigious_compound_licensing`
- The fields of the summary plugin have been replaced with the following fields.
An overview of the new fields can be found in the "Summary Plugin Update"
section above.
- `declared_license_expression`
- `license_clarity_score`
- `declared_holder`
- `primary_language`
- `other_license_expressions`
- `other_holders`
- `other_languages`


Documentation Update
Expand All @@ -276,16 +307,22 @@ Documentation Update
correct minor documentation issues.


Development environment changes:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Development environment and Code API changes:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- The main package API function `get_package_infos` is deprecated, and
replaced by `get_package_data`.

- The Resources path are always the same regardless of the strip-root or
full-root arguments.

- The license cache consistency is not checked anymore when you are using a Git
- The license cache consistency is not checked anymore when you are using a git
checkout. The SCANCODE_DEV_MODE tag file has been removed entirely. Use
instead the --reindex-licenses option to rebuild the license index.

- We can now regenerate updated test fixtures using the new SCANCODE_REGEN_TEST_FIXTURES
environment variable. There is no need to replace the regen=False with regen=True
in the code.
- We can now regenerate test fixtures using the new SCANCODE_REGEN_TEST_FIXTURES
environment variable. There is no need to replace the regen=False with
regen=True in the code.


30.1.0 - 2021-09-25
Expand Down
17 changes: 13 additions & 4 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,12 @@ Read more about ScanCode here: https://scancode-toolkit.readthedocs.io/.

Check out the code at https://github.com/nexB/scancode-toolkit

Discover also:

- The ScanCode.io server project here: https://scancodeio.readthedocs.io
- Other companion SCA projects for code origin, license and security analysis
here: https://aboutcode.org


Build and tests status
======================
Expand Down Expand Up @@ -92,12 +98,15 @@ for upcoming features.
Documentation
=============

The ScanCode documentation is hosted at `scancode-toolkit.readthedocs.io <https://scancode-toolkit.readthedocs.io/en/latest/>`_.
The ScanCode documentation is hosted at
`scancode-toolkit.readthedocs.io <https://scancode-toolkit.readthedocs.io/en/latest/>`_.

If you are new to Scancode, start `here <https://scancode-toolkit.readthedocs.io/en/latest/getting-started/newcomer.html>`_.
If you are new to Scancode, start with our
`newcomer <https://scancode-toolkit.readthedocs.io/en/latest/getting-started/newcomer.html>`_ page.

If you want to compare output changes between different versions of Scancode, or want to look at reference scans
generated by Scancode, start `here <https://github.com/nexB/scancode-toolkit-reference-scans>`_.
If you want to compare output changes between different versions of Scancode,
or want to look at scans generated by Scancode, review our
`reference scans <https://github.com/nexB/scancode-toolkit-reference-scans>`_.

Other Important Documentation Pages:

Expand Down
2 changes: 1 addition & 1 deletion docs/source/tutorials/how_to_run_a_scan.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ This extracts the zlib.tar.gz package:

.. note::

``--shallow`` option can be used to recursively extract packages.
Use the ``--shallow`` option to prevent recursive extraction of nested archives.


Deciding Scan Options
Expand Down
4 changes: 2 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ chardet==4.0.0
charset-normalizer==2.0.12
click==8.0.4
colorama==0.4.4
commoncode==30.2.0
commoncode==31.0.0b4
construct==2.10.68
container-inspector==31.0.0
cryptography==36.0.2
Expand Down Expand Up @@ -49,7 +49,7 @@ pefile==2021.9.3
pip-requirements-parser==31.2.0
pkginfo2==30.0.0
pluggy==1.0.0
plugincode==30.0.0
plugincode==31.0.0b1
ply==3.11
publicsuffix2==2.20191221
pyahocorasick==2.0.0b1
Expand Down
Loading

0 comments on commit 3516046

Please sign in to comment.