-
-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
evaluating scancode #1400
Comments
@valeriocos Hi! it was nice to meet at FOSDEM and thank you for the report. |
Thank you for answering @pombredanne and indeed It was nice to meet you too at FOSDEM. Scancode is currently being executed on https://github.com/xiph/vorbis, graal performs a whole checkout of the commit, and then scancode is launched for each file in the commit (one file at a time) if you want to execute it in your machine, you can install graal and then execute from command line:
|
@valeriocos there are two issues:
Now if I run the quick comparison where I run not file-by-file but a directory at a time, and both using 8 processes:
and :
So we have 34.809s vs. 3.184s which is about 11 times slower in this case (which is still too slow and #1400 would help). In practice because of bugs nomossa should only be running on a single thread to avoid munging the output. In this case the elapsed time is 6.1s which means this is still about 5 times faster. If you were to run file by file, it would take about 18s vs. well ... ~39 minutes? Now if you run on Python 2, you could directly invoke the license detection as a function call on one file at a time. There, the overhead of the index load would happen only once. You would also bypass any JSON serialization/deserialization. Of course, the bug I mentioned in #1404 in also making this more problematic but I happen to a work in progress fix in a local branch The other thing to consider is the accuracy of detection (it does not help to be fast if this is to be incorrect). |
the short answer is that this is the worst case for ScanCode (one file at a time) |
@valeriocos are you dealing with only the files changed in a commit ? or all the files are rescanned at every commit? because there is a new feature being worked on to provide multiple paths as args to ScanCode in #1399 based on a report by @nicobucher which seems quite related. |
thank you @pombredanne for your detailed explanation. I'm dealing with only the files changed in a commit. I'll have a look at #1399 and try to use it.
Unfortunately I'm using Python 3, thus scancode is executed by command line.
I attach the results for nomos and scancode obtained from the analysis of the vorbis repo. In a nutshell, each line in the files is a JSON representing the analysis for a given commit. In the attribute data.analysis you will find the output of scancode/nomos. vorbis-analysis-nomos-vs-scancode.zip (other repos are under analysis, once they are done I can share them with you if you want) |
ok... so there is something that @armijnhemel was asking me about a while back which would be a way to have a pre-fork daemon of sorts such that there is always a pre-loaded process ready to scan. @armijnhemel how would you do this? Alternatively scanning a list of paths may amortize the startup costs too. |
I would agree with what @pombredanne says: running scancode file by file is the worst case scenario. What I could imagine is something similar to clamd where you can send a path and a persistent process scans that path. |
Can scancode analyze lists of files (in a single invocation, I mean). Maybe that could speed up the thing, since each commit may touch several files, and in that case, running would be sort of linear with the number of commits instead of number of files touched for all commits. (I think this is what you mention above as "list of paths"). On a related note, do you have plans to support Python3 in the near future? That could make things easier too... |
@jgbarah Hey!
Not yet but I started a branch that can do that now at #1399 and this supports doing things such as
Yes, see #295 ... this is going to be a GSoC project I also have a simple remoting solution using execnet until then and this works nicely: I will push it later today for your enjoyment. |
@valeriocos re
Thank you! I will post my eval later. |
See ticket #1400 for more details This is an example of how to call Scancode as a function from Python2 or Python3. The benefits are that when the server process has loaded the license index, and imported its modules there is no per-call import/loading penalty anymore. This is using execnet which is the multiprocessing library used by py.test and therefore a rather stable and high quality engine. Signed-off-by: Philippe Ombredanne <[email protected]>
@valeriocos do you mind trying this 8afa686 in the 1397-multiple-inputs branch? There is a README |
thank you @pombredanne for this, I'll use it and report. |
@valeriocos sure, note that the code has now been merged (in the develop branch) |
Thank you @pombredanne and sorry for the late reply. I have prepared a branch to test scancli.py (https://github.com/valeriocos/grimoirelab-graal/blob/test-scancli/graal/backends/core/analyzers/scancode.py#L48). I'll execute some tests and report on the performance. |
@valeriocos ping. How things are working out for you? |
Hi @pombredanne , sorry for the late reply.
I inspected some of the results and in some cases I see some differences (I guess this is due to some improvements in scancode itself). For instance, for the repo https://github.com/xiph/vorbis and the commit
If you are interested, I can share with you the complete analysis obtained via Graal and |
@pombredanne should we close this issue or there is something left to discuss? Thanks! |
@valeriocos I would like to make sure everything is a OK before closing this. It also needs some doc for sure! |
Thank you for answering @pombredanne , I'll keep an eye on the issue. |
@armijnhemel that scanserv is something you wanted too BTW ;) |
Hi @pombredanne,
I've embedded scancode (a really nice tool) in Graal and now I'm evaluating scancode against nomos (another popular tool for license analysis) wrt precision and performance.
In a nutshell, the evaluation consists in iterating over the commits of a set of git repositories, for each commit graal performs a checkout and launches scancode/nomos on each file present in the commit, finally the results are persisted on disk.
While nomos is pretty fast (it processed 5 repos of around 3000 commits each in 2 hours), scancode is stilll processing the first repo. I'm wondering if I'm missing some parameters (or if you have some suggestions) to make the analysis faster. Currently I'm using the release 3.0.0 and I launch it with the following params:
https://github.com/chaoss/grimoirelab-graal/blob/master/graal/backends/core/analyzers/scancode.py#L58
Thank you
The text was updated successfully, but these errors were encountered: