License scanner #12

rmfranken · 2022-12-20T14:20:17Z

Code is supposed to:

scan a given repository path for files that look like license files
extract the license text and match it against a library of licenses (scancode.api)
store the license spdx URL's in a list which the function returns
Optionally, add triples to a graph that is to be defined relating the repository subject to the license objects using predicate schema.org/license

…inding. Wrapped the graph part in its own function which we may move to another file in the future?

cmdoret · 2022-12-20T15:36:33Z

Moving the comment from @martinfontanet here:

Actually, I think it would be better to retrieve the licenses from a method in the FilesMetadata class (or somewhere else), and then feed the licenses path/content to the LicenseMetadata class to analyze what's in them. Walking through the files again feels a bit redundant to me, and doing it from inside the LicenseMetadata is a bit unclear.

Question: The Repo class currently instantiate each source (web, license, git, files). Each source receives all the info only from Repo. If we get the files from FilesMetadata, it would introduce a dependency between sources (they have to be instantiated in the right and communicate.

I can see two options:

FilesMetadata was poorly defined and should not actually be a source but some kind of helper class (?)
Repo should be responsible for file management, clone stuff and provide files/content

What do you think would be the best solution for this ? (basically so that LicenseMetadata can get license content in the simplest way)

marftn · 2023-01-04T13:56:31Z

I can see two options:
1. `FilesMetadata` was poorly defined and should not actually be a source but some kind of helper class (?)

2. `Repo` should be responsible for file management, clone stuff and provide files/content
What do you think would be the best solution for this ? (basically so that LicenseMetadata can get license content in the simplest way)

It's difficult to choose what would be the best option. IMO, we should redefine FilesMetadata and what we would like to extract from it.

It would make sense to make Repo responsible for the files management and cloning, and use FilesMetadata as a helper class or set of functions to extract the metadata from the files (programming language, size, etc.). In that sense, the answer to the question would be "both options" 😁

Here is an example of what Repo would do in one of its methods:

if path is a url:
    self.path = git.clone(url) // path is now the folder where the repo was cloned

self.files = FilesMetadata(path)
self.lang = self.files.get_prog_lang()
license_path = self.files.get_license_path()
self.license = LicenseMetadata(license_path)

What do you think?

cmdoret · 2023-01-04T15:21:15Z

make Repo responsible for the files management and cloning, and use FilesMetadata as a helper

Makes sense to me. It means that we don't need to worry about finding the license file in license.py and can assume the input to be the correct license path(s).

rmfranken · 2023-01-05T15:11:22Z

Ok. I will rewrite the function a bit so that it takes a valid license path as a parameter, and move the rest of the function to the FilesMetaData/Repo class (?). Then the search-for-a-licensefile part of the function can be re-used.

Also I will make a new .py file for "turning stuff into triples" so I can move the add_license_to_graph function there.

cmdoret · 2023-01-05T17:17:04Z

FilesMetaData/Repo class (?)

Yes, I think this could go into FilesMetadata

I will make a new .py file for "turning stuff into triples"

Great! Thanks, this could also be part of a separate PR

…. This is now in the files.py class.

…ular expression.

rmfranken · 2023-01-06T09:58:07Z

Ok, I think this should work. I did ask chatGPT to help me with the class part, so please pay extra attention there. It looks like it works and makes sense to me, but I'm still struggling with wrapping my head around classes: I think that will still take some time.

I also have a suspicion that my auto-black reformat is slightly different than yours... I think in the future I don't want to commit the auto-reformat of the other files it also found in the repo (like the tests, git, and init .py's). Should I find a different way to reformat my code into Black so that it matches yours (do you have a link to whatever tool you use?)

cmdoret · 2023-01-06T11:20:26Z

auto-black reformat is slightly different than yours

No it works properly :) I think the files in question had never been reformatted 😄

* recursive directory search with `os.walk` * more elaborate regex to avoid picking up source files * rename method to an action (locate_licenses) * add docstrings and doctest

* LicenseMetadata restored with docstring * single get_licenses method adapted from find_licenses for multiple paths

cmdoret · 2023-01-06T15:29:51Z

For some reason, when running

from scancode.api import get_licenses
get_licenses('LICENSE')

I keep running into this error:

    349 def get_index(force=False, index_all_languages=False):
    350     """
    351     Return and eventually build and cache a LicenseIndex.
    352     """
--> 353     return get_cache(force=force, index_all_languages=index_all_languages).index

AttributeError: 'LicenseCache' object has no attribute 'index'

I suspect this is a version issue with scancode-toolkit, as they seem to have recently refactored the license part.
It does not seem to work with the version in pyproject.toml on my side.

cmdoret · 2023-01-06T16:53:09Z

OK this is actually a known bug in scancode-toolkit aboutcode-org/scancode-toolkit#3179. Whenever this is fixed, we will just have to bump the scancode-toolkit version in our dependencies, but in the meantime, running scancode --reindex-licenses once fixes the issue.

cmdoret · 2023-01-06T17:42:24Z

I introduced a few bugs when refactoring the code 👀 now the unit test from the docstring passes (i.e. license correctly parsed). If this is OK for @martinfontanet I think this could be merged.

marftn

Sorry for the delay!
It looks good to me :)

rmfranken and others added 6 commits December 20, 2022 06:13

feat: Created a license finder using scancode toolkit

04c9acd

chore: reformatted using black

e9d973e

refactor: gimie.html -> gimie.web to fix weird name collision bug

d681946

feat: Added triple serialization of license result (spdx url)

70f4f29

refactor: With the help of ChatGPT and Cyril: Changed the path+file f…

089a17b

…inding. Wrapped the graph part in its own function which we may move to another file in the future?

chore: black reformat

ea234e3

rmfranken requested a review from marftn December 20, 2022 14:20

Merge branch 'main' into License_scanner

15942fb

Merge branch 'main' into License_scanner

fe13675

rmfranken added 3 commits January 6, 2023 10:47

refactor: removed the searching for the license file part of the code…

c286a7b

…. This is now in the files.py class.

refactor: Added a function that finds the license file based on a reg…

b564307

…ular expression.

chore: black reformat

170b011

cmdoret added 3 commits January 6, 2023 13:57

refactor: improve license file search

66a0b43

* recursive directory search with `os.walk` * more elaborate regex to avoid picking up source files * rename method to an action (locate_licenses) * add docstrings and doctest

refactor: Restore placeholder LicenseMetadata class

7658ed1

* LicenseMetadata restored with docstring * single get_licenses method adapted from find_licenses for multiple paths

fix: exclude hidden files from license search

3778852

cmdoret added 2 commits January 6, 2023 18:40

fix: correctly handle one or multiple license paths

6a2cbce

chore: drop support for python<=3.10 due to extruct dep

3749072

cmdoret added 3 commits January 6, 2023 18:49

refactor: use *args in LicenseMetadata to simplify logic

162de74

docs: specify type hints and rm unused imports in LicenseMetadata

2769745

test: bump scancode version and fix paths to fix unit tests

96245f1

marftn approved these changes Jan 11, 2023

View reviewed changes

cmdoret and others added 5 commits January 11, 2023 15:17

Merge branch 'main' into License_scanner

8771206

chore: fix black formatting

b31c81f

test: fix git test

53fda07

ci/cd: fetch full git history to fix test_git_creator

335245e

ci/cd: temporarily drop py 3.11 due to poetry bug

61bdfb8

cmdoret merged commit 9cda1ec into main Jan 11, 2023

cmdoret deleted the License_scanner branch January 16, 2023 10:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License scanner #12

License scanner #12

rmfranken commented Dec 20, 2022

cmdoret commented Dec 20, 2022 •

edited

Loading

marftn commented Jan 4, 2023 •

edited

Loading

cmdoret commented Jan 4, 2023

rmfranken commented Jan 5, 2023

cmdoret commented Jan 5, 2023 •

edited

Loading

rmfranken commented Jan 6, 2023

cmdoret commented Jan 6, 2023

cmdoret commented Jan 6, 2023

cmdoret commented Jan 6, 2023

cmdoret commented Jan 6, 2023

marftn left a comment

License scanner #12

License scanner #12

Conversation

rmfranken commented Dec 20, 2022

cmdoret commented Dec 20, 2022 • edited Loading

marftn commented Jan 4, 2023 • edited Loading

cmdoret commented Jan 4, 2023

rmfranken commented Jan 5, 2023

cmdoret commented Jan 5, 2023 • edited Loading

rmfranken commented Jan 6, 2023

cmdoret commented Jan 6, 2023

cmdoret commented Jan 6, 2023

cmdoret commented Jan 6, 2023

cmdoret commented Jan 6, 2023

marftn left a comment

Choose a reason for hiding this comment

cmdoret commented Dec 20, 2022 •

edited

Loading

marftn commented Jan 4, 2023 •

edited

Loading

cmdoret commented Jan 5, 2023 •

edited

Loading