-
-
Notifications
You must be signed in to change notification settings - Fork 574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracing "Start Line" of ScanCode report back to the Binary file. #2874
Comments
@kiranravindran90 Thank you ++ for this. |
In particular 72faf0c did bring up some improvements |
Dear Philippe, See if it would be possible that along with "Start-line", the "Binary address" in the original file also can be part of the report? ------------------------------------------------As per our Corporate FOSS guidelines------------------------------------------------ Office: +91-80-61492139 |
I created a small quick and dirty Python script which generates a binary file with an embedded MIT license text: #!/usr/bin/env python
text_file = open("MIT.txt")
space = "abc\n" * 200
content = text_file.read()
content = space + content + space
binary_content = bytes(10) + bytes(content, 'utf-8')
binary_file = open("mit.bin", "w+b")
binary_file.write(binary_content)
binary_file.close() After scanning it with ScanCode, one will get an output similar to:
NOTE: The output is shortened for readability purposes. ScanCode recognizes the license and copyrights well. However, how are the
Therefore, my expectation would have been the Furthermore, it would be better to calculate the position of text snippets in a binary file from the beginning of the file in bytes (hexadecimal) and as range. The SPDX format does exactly that: https://spdx.github.io/spdx-spec/snippet-information/#93-snippet-byte-range-field. |
@Jeeppler Thanks for this report!
The start and end lines for binaries are kind of special. Now, practically:
I am not sure if this would be better. It surely would be much more complicated in practice as explained above.The gains in precision may not be worth the effort.
Agreed, (disclosure: I am a co-founder of SPDX) but in earnest I am neither sure byte range in the spec is a good idea nor that I know of any tool that uses this in practice. I would also question the value of such data: what would I do this afterwards? Especially if I have the exact matched copyrights and texts. With all this said, I would be fine with having option 2 and/or option 3. Is this something you could be interested to contribute? (I can guide you alright) |
@pombredanne the problem I want to solve is the following: one receives large binary files (several MBs to several GBs) which one has to scan for license compliance. The files are the result of compiling software from source code. However, the source code is not available, but one needs to know what licenses were detected. In case, the detected license is likely a false-positive, it is necessary to cross-check the detected license text. Without the location of the license text in the file it is difficult or impossible to figure out whether the issue is a false-positive or not. Furthermore, I would like to only rely on the SPDX format for the final scan report. As a result, option 1 is not an viable solution. Option 3, seems a lot of work. Option 2, seems to be the best option from my perspective. While scanning for the text in the binary (binary-to-text conversion), ScanCode-Toolkit should remember the location where the text was found. I am truly surprised, that there is no tool using the byte range from the SPDX spec. My hope was, that ScanCode-Toolkit is one tool using it for binary files. The byte range or offset (bytes from the start) make a lot of sense to me for binary files. However, maybe I am missing something.
I am interested. However, I am still trying to figure out the details. I will come back to you with an answer. |
IMHO the location is a possible and maybe useful but not essential and this is why ScanCode can extract and return the exact text that was matched with
You should reconsider this IMHO... That's going to be a problem otherwise as SPDX is a reporting format for exchange and not a license detection diagnostic format, e.g., this is going to always be lossy when compared with what you get in the ScanCode JSON/YAML formats wrt. to license detection details. That not a fault of SPDX nor a fault of ScanCode: each format serves a different purpose. An analogy would be accounting where you have a general ledger (SPDX) and specific sub ledgers like account payable (ScanCode). You can never get the detail of the sub ledgers in your general ledger nor should you want it there.
This is something that is theoretically possible and seems appealing and but is hard to implement and can require so much resources at run time that it is practically not done. This is what I would call a great bad idea! ScanCode does not operate directly on the binaries, but on sequence of words themselves abstracted to numbers, and relative positions of these numbers. We only track the relative position of these numbers and this is very far from the original binary offsets... which could be single or multi bytes and more considerations. Note that an alternative to the 1, 2 or 3 proposed solution above would to reparse the binaries in a post-match step to research the offsets: This could be done at least in a way that's not impacting the rest of the license detection performance-wise and could be only activated by a command line option. I can see how this could be possibly done by keeping track of offsets in binaries in a similar piece of code and a specialized In contrast, just dumping a binary strings with offsets would be much simpler |
I have scanned few Binaries using Scancode toolkit. I wanted to know how to find the "Start Line" from the report back to Binary? For e.g. the Scancode toolkit report says "Start Line 64343" (for License text option) but I am unable to locate the exact Binaries in the original Binary file.
The reason I am posting is because I found a GPL v1 license as per the ScanCode report. The rule it used to identify was "GPL-Bare-word". But I wasn't able to find the word GPL in the actual raw binary (Ascii converted). If I could trace the exact location then asserting false positives would be easier.
Possible Labels
Start-Line, License text, License location in Binary
Select Category
License text "Start line"
Describe the Update
Able to trace the Start-line to its location in the Raw Binary.
How This Feature will help you/your organization
This will help us identify false positives.
Kiran Ravindran ([email protected]) on behalf of
Mercedes Benz Research & Development India Pvt. Ltd. (subsidiary of Mercedes-Benz AG)
Embassy Crest Phase I
Plot No. 5-P, EPIP 1st Phase
Whitefield, Krishnarajapuram
Bangalore - 560066, India
Office: +91-80-61492139
Mob.: +91-9986871380
The text was updated successfully, but these errors were encountered: