Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracing "Start Line" of ScanCode report back to the Binary file. #2874

Open
1 task done
kiranravindran90 opened this issue Feb 24, 2022 · 7 comments
Open
1 task done

Comments

@kiranravindran90
Copy link

I have scanned few Binaries using Scancode toolkit. I wanted to know how to find the "Start Line" from the report back to Binary? For e.g. the Scancode toolkit report says "Start Line 64343" (for License text option) but I am unable to locate the exact Binaries in the original Binary file.
The reason I am posting is because I found a GPL v1 license as per the ScanCode report. The rule it used to identify was "GPL-Bare-word". But I wasn't able to find the word GPL in the actual raw binary (Ascii converted). If I could trace the exact location then asserting false positives would be easier.

Possible Labels

Start-Line, License text, License location in Binary

  • new feature

Select Category

License text "Start line"

  • Enhancement

Describe the Update

Able to trace the Start-line to its location in the Raw Binary.

How This Feature will help you/your organization

This will help us identify false positives.

Kiran Ravindran ([email protected]) on behalf of
Mercedes Benz Research & Development India Pvt. Ltd. (subsidiary of Mercedes-Benz AG)
Embassy Crest Phase I
Plot No. 5-P, EPIP 1st Phase
Whitefield, Krishnarajapuram
Bangalore - 560066, India

Office: +91-80-61492139
Mob.: +91-9986871380

@pombredanne
Copy link
Member

@kiranravindran90 Thank you ++ for this.
Do you mind to try this out on the latest code from the HEAD of the develop branch?
There could be less issues in this code... Also see this related issue with elements of a design to solve this #2403

@pombredanne
Copy link
Member

In particular 72faf0c did bring up some improvements

@kiranravindran90
Copy link
Author

Dear Philippe,
While thanks for the initiatives on the False Positives. While as you would already know Binaries can be very challenging with so much gibberish around the keywords.

See if it would be possible that along with "Start-line", the "Binary address" in the original file also can be part of the report?
It would help me to identify the location & manually verify some false positives, not sure if it would be useful to the ScanCode user community. Please give it a thought :)

------------------------------------------------As per our Corporate FOSS guidelines------------------------------------------------
Regards,
Kiran Ravindran ([email protected]) on behalf of
Mercedes Benz Research & Development India Pvt. Ltd. (subsidiary of Mercedes-Benz AG)
Embassy Crest Phase I
Plot No. 5-P, EPIP 1st Phase
Whitefield, Krishnarajapuram
Bangalore - 560066, India

Office: +91-80-61492139
Mob.: +91-9986871380

@Jeeppler
Copy link

I created a small quick and dirty Python script which generates a binary file with an embedded MIT license text:

#!/usr/bin/env python

text_file = open("MIT.txt")

space = "abc\n" * 200
content = text_file.read()

content = space + content + space

binary_content = bytes(10) + bytes(content, 'utf-8')

binary_file = open("mit.bin", "w+b")

binary_file.write(binary_content)

binary_file.close()

MIT.txt

After scanning it with ScanCode, one will get an output similar to:

{
    "path": "sourcecode/mit.bin",
    "type": "file",
    "name": "mit.bin",
    "base_name": "mit",
    "extension": ".bin",
    "size": 2689,
    …
    "sha256": "9de323951d20e3b242ae729f342ca89b7fa2d9a7fca922591e45f3c3b4910ba4",
    "mime_type": "application/octet-stream",
    "file_type": "data",
    "programming_language": null,
    "is_binary": true,
    …
    "licenses": [
        {
            "key": "mit",
            "score": 100.0,
            "name": "MIT License",
            "short_name": "MIT License",
            …
            "start_line": 1,
            "end_line": 1,
            "matched_rule": {
                "identifier": "mit_14.RULE",
                "license_expression": "mit",
                "licenses": [
                    "mit"
                ],
                …
            }
        },
        {
            "key": "mit",
            "score": 100.0,
            "name": "MIT License",
            "short_name": "MIT License",
            …
            "start_line": 3,
            "end_line": 17,
            "matched_rule": {
                "identifier": "mit.LICENSE",
                "license_expression": "mit",
                "licenses": [
                    "mit"
                ],
                …
            }
        }
    ],
    "license_expressions": [
        "mit",
        "mit"
    ],
    "percentage_of_license_text": 95.88,
    "copyrights": [
        {
            "value": "Copyright (c) 2019 Test Test Example Inc.",
            "start_line": 2,
            "end_line": 2
        }
    ],
    "holders": [
        {
            "value": "Test Test Example Inc.",
            "start_line": 2,
            "end_line": 2
        }
    ],
    …
}

NOTE: The output is shortened for readability purposes.

ScanCode recognizes the license and copyrights well. However, how are the start_line and end_line numbers calculated? They don't seem to make any sense. The binary file looks roughly like this:

       abc
… 198 times abc \newline
abc
MIT-License
abc
… 198 times abc \newline
abc

Therefore, my expectation would have been the start_line would be at 200 or 201. The same with the copyright holder. It should start at approx. line 203. There is no way one can find the position of the license using the information from start_line and end_line, if the file is much larger.

Furthermore, it would be better to calculate the position of text snippets in a binary file from the beginning of the file in bytes (hexadecimal) and as range. The SPDX format does exactly that: https://spdx.github.io/spdx-spec/snippet-information/#93-snippet-byte-range-field.

@pombredanne
Copy link
Member

@Jeeppler Thanks for this report!

However, how are the start_line and end_line numbers calculated? They don't seem to make any sense. The binary file looks roughly like this:

The start and end lines for binaries are kind of special.
The text is first processed through a binary to strings converter and the license numbers are the ones in this conversion. We do not track binary offsets in practice as ScanCode rarely operates on bytes (or even strings) but rather either lists of text lines and/or lists of words (themselves abstracted commonly as an integer index in a word list)

Now, practically:
option 1. you can collect the actual license text with --license-text which will make the need for license number much less important
option 2. we should have an option to dump the binry-to-text conversion results of a file to a text file for diagnostics
option 3. we could have a way to track the binary offsets (see https://github.com/nexB/scancode-toolkit/blob/7bc0782fdfda9da5dba0500446ff3e8d58623e99/src/textcode/strings.py#L93 ) but this will be practically a fairly involved effort as this may need to be tracked down everywhere which may not be reasonable as cost/benefit tradeoff.... though another option would be to have a separate post-processing option to reconstruct and recover binary offsets?

Furthermore, it would be better to calculate the position of text snippets in a binary file from the beginning of the file in bytes (hexadecimal) and as range.

I am not sure if this would be better. It surely would be much more complicated in practice as explained above.The gains in precision may not be worth the effort.

The SPDX format does exactly that: https://spdx.github.io/spdx-spec/snippet-information/#93-snippet-byte-range-field.

Agreed, (disclosure: I am a co-founder of SPDX) but in earnest I am neither sure byte range in the spec is a good idea nor that I know of any tool that uses this in practice. I would also question the value of such data: what would I do this afterwards? Especially if I have the exact matched copyrights and texts.

With all this said, I would be fine with having option 2 and/or option 3. Is this something you could be interested to contribute? (I can guide you alright)

@Jeeppler
Copy link

@pombredanne the problem I want to solve is the following: one receives large binary files (several MBs to several GBs) which one has to scan for license compliance. The files are the result of compiling software from source code. However, the source code is not available, but one needs to know what licenses were detected. In case, the detected license is likely a false-positive, it is necessary to cross-check the detected license text. Without the location of the license text in the file it is difficult or impossible to figure out whether the issue is a false-positive or not.

Furthermore, I would like to only rely on the SPDX format for the final scan report. As a result, option 1 is not an viable solution. Option 3, seems a lot of work. Option 2, seems to be the best option from my perspective. While scanning for the text in the binary (binary-to-text conversion), ScanCode-Toolkit should remember the location where the text was found.

I am truly surprised, that there is no tool using the byte range from the SPDX spec. My hope was, that ScanCode-Toolkit is one tool using it for binary files. The byte range or offset (bytes from the start) make a lot of sense to me for binary files. However, maybe I am missing something.

With all this said, I would be fine with having option 2 and/or option 3. Is this something you could be interested to contribute? (I can guide you alright)

I am interested. However, I am still trying to figure out the details. I will come back to you with an answer.

@pombredanne
Copy link
Member

In case, the detected license is likely a false-positive, it is necessary to cross-check the detected license text. Without the location of the license text in the file it is difficult or impossible to figure out whether the issue is a false-positive or not.

IMHO the location is a possible and maybe useful but not essential and this is why ScanCode can extract and return the exact text that was matched with --license-text ... this is what is IMHO best suited for diagnostics.

Furthermore, I would like to only rely on the SPDX format for the final scan report. As a result, option 1 is not an viable solution.

You should reconsider this IMHO... That's going to be a problem otherwise as SPDX is a reporting format for exchange and not a license detection diagnostic format, e.g., this is going to always be lossy when compared with what you get in the ScanCode JSON/YAML formats wrt. to license detection details. That not a fault of SPDX nor a fault of ScanCode: each format serves a different purpose. An analogy would be accounting where you have a general ledger (SPDX) and specific sub ledgers like account payable (ScanCode). You can never get the detail of the sub ledgers in your general ledger nor should you want it there.

I am truly surprised, that there is no tool using the byte range from the SPDX spec. My hope was, that ScanCode-Toolkit is one tool using it for binary files. The byte range or offset (bytes from the start) make a lot of sense to me for binary files. However, maybe I am missing something.

This is something that is theoretically possible and seems appealing and but is hard to implement and can require so much resources at run time that it is practically not done. This is what I would call a great bad idea!

ScanCode does not operate directly on the binaries, but on sequence of words themselves abstracted to numbers, and relative positions of these numbers. We only track the relative position of these numbers and this is very far from the original binary offsets... which could be single or multi bytes and more considerations.

Note that an alternative to the 1, 2 or 3 proposed solution above would to reparse the binaries in a post-match step to research the offsets: This could be done at least in a way that's not impacting the rest of the license detection performance-wise and could be only activated by a command line option.
See how this is done for regular text in https://github.com/nexB/scancode-toolkit/blob/7bc0782fdfda9da5dba0500446ff3e8d58623e99/src/licensedcode/match.py#L2657-L3087

I can see how this could be possibly done by keeping track of offsets in binaries in a similar piece of code and a specialized Token extracted from binaries. This would not be for the faint of heart though!

In contrast, just dumping a binary strings with offsets would be much simpler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants