Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package scancode-analyzer #58

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
Release notes
-------------
### Version 0.0.0
Changelog
=========

*xxxx-xx-xx* -- Initial release.
v21.4.2
-------

Initial release.
29 changes: 23 additions & 6 deletions INSTALL.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,15 @@
Quickstart - Scancode Plugin
----------------------------
Installation
============

``scancode-results-analyzer`` can be installed as a scancode post-scan plugin.
The installation methods install the `scancode-analyzer` post-scan plugin, installed
with `scancode`, extending it to have the `--analyze-license-results` option.

1. Clone the Repository and navigate to the ``scancode-results-analyzer`` directory.
Install Plugin from Source
--------------------------

``scancode-analyzer`` can be installed as a scancode post-scan plugin.

1. Clone the Repository and navigate to the ``scancode-analyzer`` directory.

2. Configure (Installs the requirements, and scancode-toolkit with the plugin)::

Expand All @@ -23,13 +29,24 @@ Quickstart - Scancode Plugin

6. OR, import a JSON scan result and run the plugin on that scan::

scancode --json-pp results.json --from-json tests/data/results-test/selective-before-rules-added/only_errors.json --analyze-license-results
scancode --json-pp results.json --from-json path/to/scan_result.json --analyze-license-results

.. note::

`scancode-results-analyzer` has required CLI options, as these produce attributes
`scancode-analyzer` has required CLI options, as these produce attributes
essential to the analysis process. These are:
`--license --info --license-text --is-license-text --classify`
Even when loading from json, the scan generating these json files should have
been run with this options for the analysis plugin to work.


Install plugin via `pip`
------------------------

1. Install all `scancode` `prerequisites`_ and create a `virtualenvironment`_.

2. Run `pip install scancode-analyzer` to install the latest version of Scancode Analyzer.


.. _virtualenvironment: https://scancode-toolkit.readthedocs.io/en/latest/getting-started/install.html#installation-as-a-library-via-pip
.. _prerequisites: https://scancode-toolkit.readthedocs.io/en/latest/getting-started/install.html#prerequisites
37 changes: 20 additions & 17 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,19 +1,22 @@
scancode-results-analyzer
=========================
scancode-analyzer
=================

.. what-is-scancode-results-analyzer
.. what-is-scancode-analyzer

What is Scancode-Results-Analyzer
---------------------------------
What is Scancode-Analyzer
-------------------------

`ScanCode`_ detects licenses, copyrights, package manifests and direct dependencies and more both in source code and
binary files.
`ScanCode`_ detects licenses, copyrights, package manifests and direct dependencies and more both in
source code and binary files.

ScanCode license detection is using multiple techniques to accurately detect licenses based on automatons, inverted
indexes and multiple sequence alignments. The detection is not always accurate enough. The goal of this project is to
improve the accuracy of license detection leveraging the ClearlyDefined and other datasets, where ScanCode is used
to massively scan millions of packages. It would also be available as a `ScanCode`_ ``post-scan`` plugin to use it
in scans directly, or in `scancode.io`_ pipelines.
ScanCode license detection is using multiple techniques to accurately detect licenses based on
automatons, inverted indexes and multiple sequence alignments. As the detection supports approximate
matching, there's a lot of `unknown` detections, or multiple approximate matches.

The goal of this project is to improve the accuracy of license detection leveraging scancode scans,

It is a `ScanCode`_ ``post-scan`` plugin to use it in scans directly, and in future as
`scancode.io`_ pipelines, with better issue review and reporting features.

This project aims to:

Expand All @@ -22,7 +25,7 @@ This project aims to:
- Add this as a `scancode`_ post-scan plugin
- Add to pipelines in `scancode.io`_
- Write reusable tools and models to assist in the semi-automated reviews of scan results.
- It will also create new license detection rules semi-automatically to fix the detected anomalies
- It will also suggest new license detection rules semi-automatically to fix the detected anomalies

.. _ScanCode: https://github.com/nexB/scancode-toolkit
.. _scancode.io: https://github.com/nexB/scancode.io
Expand All @@ -37,12 +40,12 @@ Refer to the installation instructions on `INSTALL.rst`_
Documentation
-------------

Documentation: https://scancode-results-analyzer.readthedocs.io/en/latest/ [WIP]
Documentation: https://scancode-analyzer.readthedocs.io/en/latest/

Project Board
-------------

`Project Board`_ for ``scancode-results-analyzer`` : Analysing Scancode License Detection Results.
`Project Board`_ for ``scancode-analyzer`` : Analysing Scancode License Detection Results.

.. _INSTALL.rst: https://github.com/nexB/scancode-results-analyzer/tree/master/INSTALL.rst
.. _Project Board: https://github.com/nexB/scancode-results-analyzer/projects/1
.. _INSTALL.rst: https://github.com/nexB/scancode-analyzer/tree/master/INSTALL.rst
.. _Project Board: https://github.com/nexB/scancode-analyzer/projects/1
4 changes: 2 additions & 2 deletions docs/source/analysis-use-case/suggesting-licenses.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ The steps are as follows:
1. First from the list of `license expressions`, all the `license expressions` are sorted according
to their occurrences.

2. Generic `license_expressions` like `unknown`, `warranty-disclaimer` are removed fro, this sorted
2. Generic `license_expressions` like `unknown`, `warranty-disclaimer` are removed from this sorted
list.

3. If there's only one `license_expression` with the most number of occurrences, then that is the
Expand All @@ -73,7 +73,7 @@ The steps are as follows:
1. The boolean value denoting the license type, i.e. license text/notice/tag/reference is determined
from their respective class of problem, which they are already divided into.

2. The ``ignorable`` attributes are added later by using scripts.
2. The ``ignorable`` attributes could be added later by using scripts.

3. The possible license id (like ``mit``) is predicted as the license ID of the match with the
longest ``match_coverage``. This has to be manually verified in most cases.
Expand Down
79 changes: 48 additions & 31 deletions docs/source/api-and-outputs/json-output.rst
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
JSON Output Format
==================

`scancode-results-analyzer` is meant to be used as a post-scan Plugin for Scancode, where after
`scancode-analyzer` is meant to be used as a post-scan Plugin for Scancode, where after
running a scan, the scan results are then analyzed for scan errors, and that information is
added to the scancode JSON results.

Command Line Argument to use ``scancode-results-analyzer``: ``--analyze-license-results``
Command Line Argument to use ``scancode-analyzer``: ``--analyze-license-results``

Here's how example result-JSONs from `scancode-results-analyzer` could look like, post-analysis.
Here's how example result-JSONs from `scancode-analyzer` could look like, post-analysis.

.. _license_detection_issues_result_json:

Expand All @@ -23,13 +23,6 @@ for each resource in the codebase this list of dictionary will be added, where e
is for each corresponding file-region :ref:`file_region`, having the results of the analysis for all
the match(es) in that file-region.

.. note::

[WIP]
There would also be a codebase-level dictionary added,
1. With statistics on the license_detection issues.
2. All the unique license detection issues and their occurrences.
3. Header information.

.. code-block:: json

Expand Down Expand Up @@ -110,6 +103,7 @@ a file-region, and containing analysis results for all the license matches in a
"is_license_notice": true,
"is_license_tag": false,
"is_license_reference": false,
"is_license_intro": false,
"analysis_confidence": "high",
"is_suggested_matched_text_complete": true
},
Expand Down Expand Up @@ -159,6 +153,9 @@ location.
"licenses": [
{
"key": "lgpl-2.0"
},
{
"key": "gpl-3.0-plus"
}
],
"licence_detection_issues": [
Expand All @@ -174,13 +171,19 @@ location.
"is_license_notice": true,
"is_license_tag": false,
"is_license_reference": false,
"is_license_intro": false,
"analysis_confidence": "medium",
"is_suggested_matched_text_complete": true
},
"suggested_license": {
"license_expression": "lgpl-2.0-plus",
"matched_text": " * licensed under the terms of the LGPL.... "
}
},
"original_licenses": [
{
"key": "lgpl-2.0"
}
]
},
{
"start_line": 54,
Expand All @@ -194,14 +197,19 @@ location.
"is_license_notice": true,
"is_license_tag": false,
"is_license_reference": false,
"is_license_intro": false,
"analysis_confidence": "high",
"is_suggested_matched_text_complete": true
},
"suggested_license": {
"license_expression": "gpl-3.0-plus",
"matched_text": "\"genshellopt is free software: you can redistribute it and/or modify it under \\\nthe terms of the GNU General Public License as published by the Free Software \\\nFoundation, either version 3 of the License, or (at your option) any later \\\nversion."
},
"original_licenses": []
"original_licenses": [
{
"key": "gpl-3.0-plus"
}
]
}
]
}
Expand Down Expand Up @@ -260,6 +268,7 @@ it is an empty list.
"is_license_notice": true,
"is_license_tag": false,
"is_license_reference": false,
"is_license_intro": false,
"analysis_confidence": "medium",
"is_suggested_matched_text_complete": true
},
Expand Down Expand Up @@ -304,13 +313,19 @@ it is an empty list.
"is_license_notice": true,
"is_license_tag": false,
"is_license_reference": false,
"is_license_intro": false,
"analysis_confidence": "medium",
"is_suggested_matched_text_complete": true
},
"suggested_license": {
"license_expression": "lgpl-2.0-plus",
"matched_text": " * licensed under the terms of the LGPL. "
}
},
"original_licenses": [
{
"key": "unknown"
}
]
}
]
}
Expand All @@ -336,22 +351,24 @@ All Unique License Detection Issues

.. code-block:: json

"unique_license_detection_issues": [
{
"unique_identifier": 1,
"files": [
{
"path": "1921-socat-2.0.0-error.h",
"start_line": 3,
"end_line": 3
{
"unique_license_detection_issues": [
{
"unique_identifier": 1,
"files": [
{
"path": "1921-socat-2.0.0-error.h",
"start_line": 3,
"end_line": 3
}
],
"license_detection_issue": {
"issue_category": "imperfect-match-coverage",
"issue_description": "The license detection is inconclusive with high confidence, because only a small part of the rule text is matched."
}
],
"license_detection_issue": {
"issue_category": "imperfect-match-coverage",
"issue_description": "The license detection is inconclusive with high confidence, because only a small part of the rule text is matched."
}
}
]
]
}


Basic Statistics
Expand Down Expand Up @@ -395,7 +412,7 @@ BERT model versions used.

{
"header": {
"tool_name": "scancode-results-analyzer",
"tool_name": "scancode-analyzer",
"version": 0.1,
"cases_version": 0.1,
"ml_models": [
Expand Down Expand Up @@ -434,7 +451,7 @@ BERT model versions used.
Related Issues
--------------

- `nexB/scancode-results-analyzer#22 <https://github.com/nexB/scancode-results-analyzer/issues/22>`_
- `nexB/scancode-results-analyzer#20 <https://github.com/nexB/scancode-results-analyzer/issues/20>`_
- `nexB/scancode-results-analyzer#21 <https://github.com/nexB/scancode-results-analyzer/issues/21>`_
- `nexB/scancode-analyzer#22 <https://github.com/nexB/scancode-analyzer/issues/22>`_
- `nexB/scancode-analyzer#20 <https://github.com/nexB/scancode-analyzer/issues/20>`_
- `nexB/scancode-analyzer#21 <https://github.com/nexB/scancode-analyzer/issues/21>`_

4 changes: 2 additions & 2 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@

# -- Project information -----------------------------------------------------

project = 'scancode-results-analyzer'
copyright = '2020, nexb'
project = 'scancode-analyzer'
copyright = '2021, nexb'
author = 'nexb'

# -- General configuration ---------------------------------------------------
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,10 @@ All Issue Types
- ``reference-false-positive``
- A piece of code/text is incorrectly detected as a license.

* - ``intro``
- ``intro-unknown-match``
- A piece of common introduction to a license text/notice/reference is detected.

.. _case_lic_text:

License Texts
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ Why we need to divide matches in a file into file-regions:

2. If there are multiple matches in a region, they need to be analyzed as a whole, as even if most
matches have perfect ``score`` and ``match_coverage``, only one of them with a imperfect
`match_coverage`` would mean there is a issue with that whole file-region. For example one
``match_coverage`` would mean there is a issue with that whole file-region. For example one
license notice can be matched to a notice rule with imperfect scores, and several small
license reference rules.

Expand Down
12 changes: 6 additions & 6 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
.. scancode-results-analyzer documentation master file, created by
.. scancode-analyzer documentation master file, created by
sphinx-quickstart on Fri Oct 30 21:27:08 2020.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.

Welcome to `scancode-results-analyzer` Documentation!
=====================================================
Welcome to `scancode-analyzer` Documentation!
=============================================


.. include:: ../../README.rst
:start-after: what-is-scancode-results-analyzer
:start-after: what-is-scancode-analyzer
:end-before: from-github-links

Getting Started with `scancode-results-analyzer`
------------------------------------------------
Getting Started with `scancode-analyzer`
----------------------------------------

.. toctree::
:maxdepth: 3
Expand Down
7 changes: 7 additions & 0 deletions scancode-analyzer.ABOUT
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
about_resource: .
name: scancode-analyzer
license_expression: apache-2.0
copyright: Copyright (c) nexB Inc. and others.
homepage_url: https://github.com/nexB/scancode-analyzer
vcs_url: git+https://github.com/nexB/scancode-analyzer
bug_tracking_url: https://github.com/nexB/scancode-analyzer/issues
Loading