Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

License scanner #12

Merged
merged 24 commits into from
Jan 11, 2023
Merged

License scanner #12

merged 24 commits into from
Jan 11, 2023

Conversation

rmfranken
Copy link
Member

Code is supposed to:

  • scan a given repository path for files that look like license files
  • extract the license text and match it against a library of licenses (scancode.api)
  • store the license spdx URL's in a list which the function returns
  • Optionally, add triples to a graph that is to be defined relating the repository subject to the license objects using predicate schema.org/license

@rmfranken rmfranken requested a review from marftn December 20, 2022 14:20
@cmdoret
Copy link
Member

cmdoret commented Dec 20, 2022

Moving the comment from @martinfontanet here:

Actually, I think it would be better to retrieve the licenses from a method in the FilesMetadata class (or somewhere else), and then feed the licenses path/content to the LicenseMetadata class to analyze what's in them. Walking through the files again feels a bit redundant to me, and doing it from inside the LicenseMetadata is a bit unclear.

Question: The Repo class currently instantiate each source (web, license, git, files). Each source receives all the info only from Repo. If we get the files from FilesMetadata, it would introduce a dependency between sources (they have to be instantiated in the right and communicate.

I can see two options:

  1. FilesMetadata was poorly defined and should not actually be a source but some kind of helper class (?)
  2. Repo should be responsible for file management, clone stuff and provide files/content

What do you think would be the best solution for this ? (basically so that LicenseMetadata can get license content in the simplest way)

@marftn
Copy link
Member

marftn commented Jan 4, 2023

I can see two options:

1. `FilesMetadata` was poorly defined and should not actually be a source but some kind of helper class (?)

2. `Repo` should be responsible for file management, clone stuff and provide files/content

What do you think would be the best solution for this ? (basically so that LicenseMetadata can get license content in the simplest way)

It's difficult to choose what would be the best option. IMO, we should redefine FilesMetadata and what we would like to extract from it.

It would make sense to make Repo responsible for the files management and cloning, and use FilesMetadata as a helper class or set of functions to extract the metadata from the files (programming language, size, etc.). In that sense, the answer to the question would be "both options" 😁

Here is an example of what Repo would do in one of its methods:

if path is a url:
    self.path = git.clone(url) // path is now the folder where the repo was cloned

self.files = FilesMetadata(path)
self.lang = self.files.get_prog_lang()
license_path = self.files.get_license_path()
self.license = LicenseMetadata(license_path)

What do you think?

@cmdoret
Copy link
Member

cmdoret commented Jan 4, 2023

make Repo responsible for the files management and cloning, and use FilesMetadata as a helper

Makes sense to me. It means that we don't need to worry about finding the license file in license.py and can assume the input to be the correct license path(s).

@rmfranken
Copy link
Member Author

Ok. I will rewrite the function a bit so that it takes a valid license path as a parameter, and move the rest of the function to the FilesMetaData/Repo class (?). Then the search-for-a-licensefile part of the function can be re-used.

Also I will make a new .py file for "turning stuff into triples" so I can move the add_license_to_graph function there.

@cmdoret
Copy link
Member

cmdoret commented Jan 5, 2023

FilesMetaData/Repo class (?)

Yes, I think this could go into FilesMetadata

I will make a new .py file for "turning stuff into triples"

Great! Thanks, this could also be part of a separate PR

@rmfranken
Copy link
Member Author

Ok, I think this should work. I did ask chatGPT to help me with the class part, so please pay extra attention there. It looks like it works and makes sense to me, but I'm still struggling with wrapping my head around classes: I think that will still take some time.

I also have a suspicion that my auto-black reformat is slightly different than yours... I think in the future I don't want to commit the auto-reformat of the other files it also found in the repo (like the tests, git, and init .py's). Should I find a different way to reformat my code into Black so that it matches yours (do you have a link to whatever tool you use?)

@cmdoret
Copy link
Member

cmdoret commented Jan 6, 2023

auto-black reformat is slightly different than yours

No it works properly :) I think the files in question had never been reformatted 😄

* recursive directory search with `os.walk`
* more elaborate regex to avoid picking up source files
* rename method to an action (locate_licenses)
* add docstrings and doctest
* LicenseMetadata restored with docstring
* single get_licenses method adapted from find_licenses for multiple paths
@cmdoret
Copy link
Member

cmdoret commented Jan 6, 2023

For some reason, when running

from scancode.api import get_licenses
get_licenses('LICENSE')

I keep running into this error:

    349 def get_index(force=False, index_all_languages=False):
    350     """
    351     Return and eventually build and cache a LicenseIndex.
    352     """
--> 353     return get_cache(force=force, index_all_languages=index_all_languages).index

AttributeError: 'LicenseCache' object has no attribute 'index'

I suspect this is a version issue with scancode-toolkit, as they seem to have recently refactored the license part.
It does not seem to work with the version in pyproject.toml on my side.

@cmdoret
Copy link
Member

cmdoret commented Jan 6, 2023

OK this is actually a known bug in scancode-toolkit aboutcode-org/scancode-toolkit#3179. Whenever this is fixed, we will just have to bump the scancode-toolkit version in our dependencies, but in the meantime, running scancode --reindex-licenses once fixes the issue.

@cmdoret
Copy link
Member

cmdoret commented Jan 6, 2023

I introduced a few bugs when refactoring the code 👀 now the unit test from the docstring passes (i.e. license correctly parsed). If this is OK for @martinfontanet I think this could be merged.

Copy link
Member

@marftn marftn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay!
It looks good to me :)

@cmdoret cmdoret merged commit 9cda1ec into main Jan 11, 2023
@cmdoret cmdoret deleted the License_scanner branch January 16, 2023 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants