Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

evaluating scancode #1400

Open
valeriocos opened this issue Feb 27, 2019 · 21 comments
Open

evaluating scancode #1400

valeriocos opened this issue Feb 27, 2019 · 21 comments
Labels

Comments

@valeriocos
Copy link

Hi @pombredanne,

I've embedded scancode (a really nice tool) in Graal and now I'm evaluating scancode against nomos (another popular tool for license analysis) wrt precision and performance.

In a nutshell, the evaluation consists in iterating over the commits of a set of git repositories, for each commit graal performs a checkout and launches scancode/nomos on each file present in the commit, finally the results are persisted on disk.
While nomos is pretty fast (it processed 5 repos of around 3000 commits each in 2 hours), scancode is stilll processing the first repo. I'm wondering if I'm missing some parameters (or if you have some suggestions) to make the analysis faster. Currently I'm using the release 3.0.0 and I launch it with the following params:
https://github.com/chaoss/grimoirelab-graal/blob/master/graal/backends/core/analyzers/scancode.py#L58

Thank you

@pombredanne
Copy link
Member

@valeriocos Hi! it was nice to meet at FOSDEM and thank you for the report.
You want to use 3.0.2 but that minor. You would likely to run on multiple processes with --processes X where X would the number of parallel processes. There could be some other flags too
Can you tell me some examples of what you scan too? A whole checkout for each commit? or something else? And is this something you run on one file at a time?

@valeriocos
Copy link
Author

Thank you for answering @pombredanne and indeed It was nice to meet you too at FOSDEM.
I'm going to try with the version 3.0.2 and the --processes param and let you know.

Scancode is currently being executed on https://github.com/xiph/vorbis, graal performs a whole checkout of the commit, and then scancode is launched for each file in the commit (one file at a time)

if you want to execute it in your machine, you can install graal and then execute from command line:

graal colic 
https://github.com/xiph/vorbis (the URL of the repo)
--git-path /home/test-scancode-vorbis (where the repo is going to be downloaded)
--exec-path /home/scancode-toolkit-3.0.0/scancode (the exec path of scancode)
--category code_license_scancode (the category to activate scancode)

@pombredanne
Copy link
Member

@valeriocos there are two issues:

  1. there some pathological files in vorbis (and that's a bug tracked in It takes too long to scan Vorbis sources  #1404)
  2. running a scan file-by-file is the worst case scenario for ScanCode.

Now if I run the quick comparison where I run not file-by-file but a directory at a time, and both using 8 processes:

$ time ~/w421/scancode-toolkit-master2/scancode --license vorbis-9eadeccdc4247127d91ac70555074239f5ce3529 -n8 --json-pp sc.json
Setup plugins...
Collect file inventory...
Scan files for: licenses with 8 process(es)...
[####################] 413                                          
Scanning done.
Summary:        licenses with 8 process(es)
Errors count:   0
Scan Speed:     13.29 files/sec. 
Initial counts: 453 resource(s): 413 file(s) and 40 directorie(s) 
Final counts:   453 resource(s): 413 file(s) and 40 directorie(s) 
Timings:
  scan_start: 2019-02-28T140649.087502
  scan_end:   2019-02-28T140722.627997
  setup_scan:licenses: 2.33s
  setup: 2.33s
  scan: 31.07s
  total: 33.62s
Removing temporary files...done.

real	0m34.809s
user	2m41.974s
sys	0m2.723s

and :

$ time ./nomossa -d ~/tmp/bnch/vorbis-9eadeccdc4247127d91ac70555074239f5ce3529 -n8  > out.txt

real	0m3.184s
user	0m20.022s
sys	0m0.076s

So we have 34.809s vs. 3.184s which is about 11 times slower in this case (which is still too slow and #1400 would help). In practice because of bugs nomossa should only be running on a single thread to avoid munging the output. In this case the elapsed time is 6.1s which means this is still about 5 times faster.

If you were to run file by file, it would take about 18s vs. well ... ~39 minutes?
There is a fixed startup time of about ~2s to load the license index but some overhead: everything in ScanCode is designed more for batch rather one-at-a-time.

Now if you run on Python 2, you could directly invoke the license detection as a function call on one file at a time. There, the overhead of the index load would happen only once. You would also bypass any JSON serialization/deserialization.

Of course, the bug I mentioned in #1404 in also making this more problematic but I happen to a work in progress fix in a local branch

The other thing to consider is the accuracy of detection (it does not help to be fast if this is to be incorrect).
A quick comparison shows a few false positive GPL and several imprecise detections in nomos. I will post a comparison table tomorrow

@pombredanne
Copy link
Member

pombredanne commented Feb 28, 2019

then scancode is launched for each file in the commit (one file at a time)

the short answer is that this is the worst case for ScanCode (one file at a time)

@pombredanne
Copy link
Member

@valeriocos are you dealing with only the files changed in a commit ? or all the files are rescanned at every commit? because there is a new feature being worked on to provide multiple paths as args to ScanCode in #1399 based on a report by @nicobucher which seems quite related.

@valeriocos
Copy link
Author

valeriocos commented Mar 1, 2019

thank you @pombredanne for your detailed explanation. I'm dealing with only the files changed in a commit. I'll have a look at #1399 and try to use it.

Now if you run on Python 2, you could directly invoke the license detection as a function call on one file at a time. There, the overhead of the index load would happen only once. You would also bypass any JSON serialization/deserialization.

Unfortunately I'm using Python 3, thus scancode is executed by command line.

The other thing to consider is the accuracy of detection (it does not help to be fast if this is to be incorrect). A quick comparison shows a few false positive GPL and several imprecise detections in nomos. I will post a comparison table tomorrow

I attach the results for nomos and scancode obtained from the analysis of the vorbis repo. In a nutshell, each line in the files is a JSON representing the analysis for a given commit. In the attribute data.analysis you will find the output of scancode/nomos.

vorbis-analysis-nomos-vs-scancode.zip

(other repos are under analysis, once they are done I can share them with you if you want)

@pombredanne
Copy link
Member

ok... so there is something that @armijnhemel was asking me about a while back which would be a way to have a pre-fork daemon of sorts such that there is always a pre-loaded process ready to scan.

@armijnhemel how would you do this?

Alternatively scanning a list of paths may amortize the startup costs too.
And I have some fixes for the #1404 issues that cuts down the scan time in half in some special cases (code files with most numeric data)

@armijnhemel
Copy link
Contributor

I would agree with what @pombredanne says: running scancode file by file is the worst case scenario. What I could imagine is something similar to clamd where you can send a path and a persistent process scans that path.

@jgbarah
Copy link

jgbarah commented Mar 3, 2019

Can scancode analyze lists of files (in a single invocation, I mean). Maybe that could speed up the thing, since each commit may touch several files, and in that case, running would be sort of linear with the number of commits instead of number of files touched for all commits. (I think this is what you mention above as "list of paths").

On a related note, do you have plans to support Python3 in the near future? That could make things easier too...

@pombredanne
Copy link
Member

pombredanne commented Mar 4, 2019

@jgbarah Hey!

Can scancode analyze lists of files (in a single invocation, I mean). Maybe that could speed up the thing, since each commit may touch several files, and in that case, running would be sort of linear with the number of commits instead of number of files touched for all commits. (I think this is what you mention above as "list of paths").

Not yet but I started a branch that can do that now at #1399 and this supports doing things such as git diff --name-only master | xargs scancode -i --json-pp -
and passing multiple paths in one call... it should soon land in develop
See also #875 and #1397 ... this last one by @nicobucher is kinda timely with your evaluation.

On a related note, do you have plans to support Python3 in the near future? That could make things easier too...

Yes, see #295 ... this is going to be a GSoC project

I also have a simple remoting solution using execnet until then and this works nicely: I will push it later today for your enjoyment.

@pombredanne
Copy link
Member

@valeriocos re

I attach the results for nomos and scancode obtained from the analysis of the vorbis repo. In a nutshell, each line in the files is a JSON representing the analysis for a given commit. In the attribute data.analysis you will find the output of scancode/nomos.

Thank you! I will post my eval later.

pombredanne added a commit that referenced this issue Mar 5, 2019
See ticket #1400 for more details

This is an example of how to call Scancode as a function from Python2
or Python3. The benefits are that when the server process has loaded the
license index, and imported its modules there is no per-call
import/loading penalty anymore.

This is using execnet which is the multiprocessing library used by
py.test and therefore a rather stable and high quality engine.

Signed-off-by: Philippe Ombredanne <[email protected]>
@pombredanne
Copy link
Member

@valeriocos do you mind trying this 8afa686 in the 1397-multiple-inputs branch? There is a README

@valeriocos
Copy link
Author

thank you @pombredanne for this, I'll use it and report.

@pombredanne
Copy link
Member

@valeriocos sure, note that the code has now been merged (in the develop branch)

@valeriocos
Copy link
Author

Thank you @pombredanne and sorry for the late reply. I have prepared a branch to test scancli.py (https://github.com/valeriocos/grimoirelab-graal/blob/test-scancli/graal/backends/core/analyzers/scancode.py#L48). I'll execute some tests and report on the performance.

@pombredanne
Copy link
Member

@valeriocos ping. How things are working out for you?

@valeriocos
Copy link
Author

Hi @pombredanne , sorry for the late reply.
I tried the new version of scancode and it's far faster than the previous one. For instance, for the following repo: https://github.com/xiph/vorbis

  • scancode 3.0.0 took 20 hours, 52 minutes and 39 seconds
  • scancli took 3 hours, 36 minutes and 38 seconds

I inspected some of the results and in some cases I see some differences (I guess this is due to some improvements in scancode itself). For instance, for the repo https://github.com/xiph/vorbis and the commit git show 0695c7cbf5d766b7db3c664fa1bb82531c71fa38 I see that:

  • scancode 3.0.0 identified only one license TU Berlin License 1.0 (see below an excerpt of the data)
    "commit": "0695c7cbf5d766b7db3c664fa1bb82531c71fa38",
    "Author": "Monty <[email protected]>",
    "AuthorDate": "Wed Mar 29 03:49:29 2000 +0000",
    "Commit": "Monty <[email protected]>",
    "CommitDate": "Wed Mar 29 03:49:29 2000 +0000",
    "message": "Don't want to lose anything while I'm integrating (also don;t want to\ndisturb mainline till I'm done)\n\nMonty\n\nsvn path=\/branches\/unlabeled-1.18.2\/vorbis\/; revision=286",
    "analysis": [
      {
        "licenses": [
          {
            "key": "tu-berlin",
            "score": 98.9,
            "name": "Technische Universitaet Berlin Attribution License 1.0",
            "short_name": "TU Berlin License 1.0",
            "category": "Permissive",
            "is_exception": false,
            "owner": "Technische Universitaet Berlin",
            "homepage_url": "https:\/\/github.com\/swh\/ladspa\/blob\/7bf6f3799fdba70fda297c2d8fd9f526803d9680\/gsm\/COPYRIGHT",
            "text_url": "",
            "reference_url": "https:\/\/enterprise.dejacode.com\/urn\/urn:dje:license:tu-berlin",
            "spdx_license_key": "TU-Berlin-1.0",
            "spdx_url": "https:\/\/spdx.org\/licenses\/TU-Berlin-1.0",
            "start_line": 30,
            "end_line": 39,
            "matched_rule": {
              "identifier": "tu-berlin.LICENSE",
              "license_expression": "tu-berlin",
              "licenses": [
                "tu-berlin"
              ],
              "is_license_text": true,
              "is_license_notice": false,
              "is_license_reference": false,
              "is_license_tag": false
            }
          }
        ],
        "file_path": "lib\/lpc.c"
      }
    ],
    "analyzer": "scancode"
  • scancli identified two licenses GPL-2.0-only and TU Berlin License 1.0 (see below an excerpt of the data). A manual inspection points out that this result is more precise than the one above.
    "commit": "0695c7cbf5d766b7db3c664fa1bb82531c71fa38",
    "Author": "Monty <[email protected]>",
    "AuthorDate": "Wed Mar 29 03:49:29 2000 +0000",
    "Commit": "Monty <[email protected]>",
    "CommitDate": "Wed Mar 29 03:49:29 2000 +0000",
    "message": "Don't want to lose anything while I'm integrating (also don;t want to\ndisturb mainline till I'm done)\n\nMonty\n\nsvn path=\/branches\/unlabeled-1.18.2\/vorbis\/; revision=286",
    "analysis": {
      "licenses": [
        {
          "path": "lpc.c",
          "type": "file",
          "name": "lpc.c",
          "base_name": "lpc",
          "extension": ".c",
          "size": 11008,
          "date": "2019-03-15",
          "sha1": "71398429be51a79438400d0317dd6c4ab03e97d3",
          "md5": "e206cfa46afe1ff773767b934378b14d",
          "mime_type": "text\/x-c",
          "file_type": "C source, ASCII text",
          "programming_language": "C++",
          "is_binary": false,
          "is_text": true,
          "is_archive": false,
          "is_media": false,
          "is_source": true,
          "is_script": false,
          "licenses": [
            {
              "key": "gpl-2.0",
              "score": 94.74,
              "name": "GNU General Public License 2.0",
              "short_name": "GPL 2.0",
              "category": "Copyleft",
              "is_exception": false,
              "owner": "Free Software Foundation (FSF)",
              "homepage_url": "http:\/\/www.gnu.org\/licenses\/gpl-2.0.html",
              "text_url": "http:\/\/www.gnu.org\/licenses\/gpl-2.0.txt",
              "reference_url": "https:\/\/enterprise.dejacode.com\/urn\/urn:dje:license:gpl-2.0",
              "spdx_license_key": "GPL-2.0-only",
              "spdx_url": "https:\/\/spdx.org\/licenses\/GPL-2.0-only",
              "start_line": 3,
              "end_line": 6,
              "matched_rule": {
                "identifier": "gpl-2.0_613.RULE",
                "license_expression": "gpl-2.0",
                "licenses": [
                  "gpl-2.0"
                ],
                "is_license_text": false,
                "is_license_notice": true,
                "is_license_reference": false,
                "is_license_tag": false,
                "matcher": "2-aho",
                "rule_length": 36,
                "matched_length": 36,
                "match_coverage": 100,
                "rule_relevance": 100
              },
              "matched_text": "THIS FILE IS PART OF THE [Ogg] [Vorbis] SOFTWARE CODEC SOURCE CODE.  *\n * USE, DISTRIBUTION AND REPRODUCTION OF THIS SOURCE IS GOVERNED BY *\n * THE GNU PUBLIC LICENSE 2, WHICH IS INCLUDED WITH THIS SOURCE.    *\n * PLEASE READ THESE TERMS DISTRIBUTING.                            *"
            },
            {
              "key": "tu-berlin",
              "score": 98.9,
              "name": "Technische Universitaet Berlin Attribution License 1.0",
              "short_name": "TU Berlin License 1.0",
              "category": "Permissive",
              "is_exception": false,
              "owner": "Technische Universitaet Berlin",
              "homepage_url": "https:\/\/github.com\/swh\/ladspa\/blob\/7bf6f3799fdba70fda297c2d8fd9f526803d9680\/gsm\/COPYRIGHT",
              "text_url": "",
              "reference_url": "https:\/\/enterprise.dejacode.com\/urn\/urn:dje:license:tu-berlin",
              "spdx_license_key": "TU-Berlin-1.0",
              "spdx_url": "https:\/\/spdx.org\/licenses\/TU-Berlin-1.0",
              "start_line": 30,
              "end_line": 39,
              "matched_rule": {
                "identifier": "tu-berlin.LICENSE",
                "license_expression": "tu-berlin",
                "licenses": [
                  "tu-berlin"
                ],
                "is_license_text": true,
                "is_license_notice": false,
                "is_license_reference": false,
                "is_license_tag": false,
                "matcher": "3-seq",
                "rule_length": 91,
                "matched_length": 90,
                "match_coverage": 98.9,
                "rule_relevance": 100
              },
              "matched_text": "Any use of this software is permitted provided that this notice is not\nremoved and that neither the authors nor the Technische [Universita]\"[t]\nBerlin are deemed to have made any representations as to the\nsuitability of this software for any purpose nor are held responsible\nfor any defects of this software. THERE IS ABSOLUTELY NO WARRANTY FOR\nTHIS SOFTWARE.\n\nAs a matter of courtesy, the authors request to be informed about uses\nthis software has found, about bugs in this software, and about any\nimprovements that may be of general interest."
            }
          ],
          "license_expressions": [
            "gpl-2.0",
            "tu-berlin"
          ],
          "holders": [
            {
              "value": "Monty and The XIPHOPHORUS Company",
              "start_line": 8,
              "end_line": 10
            },
            {
              "value": "Preserved by Jutta Degener and Carsten Bormann, Technische",
              "start_line": 25,
              "end_line": 28
            }
          ],
          "copyrights": [
            {
              "value": "(c) COPYRIGHT 1994-2000 by Monty <[email protected]> and The XIPHOPHORUS Company http:\/\/www.xiph.org",
              "start_line": 8,
              "end_line": 10
            },
            {
              "value": "Preserved Copyright 1992, 1993, 1994 by Jutta Degener and Carsten Bormann, Technische",
              "start_line": 25,
              "end_line": 28
            }
          ],
          "authors": [
            {
              "value": "Jutta Degener and Carsten Bormann",
              "start_line": 20,
              "end_line": 23
            },
            {
              "value": "J. Durbin",
              "start_line": 57,
              "end_line": 57
            }
          ],
          "files_count": 0,
          "dirs_count": 0,
          "size_count": 0,
          "scan_errors": [
            
          ],
          "file_path": "lib\/lpc.c"
        }
      ]
    },
    "analyzer": "scancode"

If you are interested, I can share with you the complete analysis obtained via Graal and scancli for https://github.com/xiph/vorbis.

@valeriocos
Copy link
Author

@pombredanne should we close this issue or there is something left to discuss?
If scancli is ready/tested/stable, I'll include it in Graal in the next days.

Thanks!

@pombredanne
Copy link
Member

@valeriocos I would like to make sure everything is a OK before closing this. It also needs some doc for sure!

@valeriocos
Copy link
Author

Thank you for answering @pombredanne , I'll keep an eye on the issue.

@pombredanne
Copy link
Member

@armijnhemel that scanserv is something you wanted too BTW ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants