Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

License scanner #12

Merged
merged 24 commits into from
Jan 11, 2023
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
04c9acd
feat: Created a license finder using scancode toolkit
rmfranken Dec 20, 2022
e9d973e
chore: reformatted using black
rmfranken Dec 20, 2022
d681946
refactor: gimie.html -> gimie.web to fix weird name collision bug
cmdoret Dec 17, 2022
70f4f29
feat: Added triple serialization of license result (spdx url)
rmfranken Dec 20, 2022
089a17b
refactor: With the help of ChatGPT and Cyril: Changed the path+file f…
rmfranken Dec 20, 2022
ea234e3
chore: black reformat
rmfranken Dec 20, 2022
15942fb
Merge branch 'main' into License_scanner
cmdoret Dec 20, 2022
fe13675
Merge branch 'main' into License_scanner
cmdoret Jan 4, 2023
c286a7b
refactor: removed the searching for the license file part of the code…
rmfranken Jan 6, 2023
b564307
refactor: Added a function that finds the license file based on a reg…
rmfranken Jan 6, 2023
170b011
chore: black reformat
rmfranken Jan 6, 2023
66a0b43
refactor: improve license file search
cmdoret Jan 6, 2023
7658ed1
refactor: Restore placeholder LicenseMetadata class
cmdoret Jan 6, 2023
3778852
fix: exclude hidden files from license search
cmdoret Jan 6, 2023
6a2cbce
fix: correctly handle one or multiple license paths
cmdoret Jan 6, 2023
3749072
chore: drop support for python<=3.10 due to extruct dep
cmdoret Jan 6, 2023
162de74
refactor: use *args in LicenseMetadata to simplify logic
cmdoret Jan 6, 2023
2769745
docs: specify type hints and rm unused imports in LicenseMetadata
cmdoret Jan 6, 2023
96245f1
test: bump scancode version and fix paths to fix unit tests
cmdoret Jan 11, 2023
8771206
Merge branch 'main' into License_scanner
cmdoret Jan 11, 2023
b31c81f
chore: fix black formatting
cmdoret Jan 11, 2023
53fda07
test: fix git test
cmdoret Jan 11, 2023
335245e
ci/cd: fetch full git history to fix test_git_creator
cmdoret Jan 11, 2023
61bdfb8
ci/cd: temporarily drop py 3.11 due to poetry bug
cmdoret Jan 11, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions gimie/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@
# Copyright 2022 - Swiss Data Science Center (SDSC)
# A partnership between École Polytechnique Fédérale de Lausanne (EPFL) and
# Eidgenössische Technische Hochschule Zürich (ETHZ).
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
Expand Down
Empty file added gimie/sources/__init__.py
Empty file.
47 changes: 42 additions & 5 deletions gimie/sources/files.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,55 @@
# Copyright 2022 - Swiss Data Science Center (SDSC)
# A partnership between École Polytechnique Fédérale de Lausanne (EPFL) and
# Eidgenössische Technische Hochschule Zürich (ETHZ).
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re

from typing import List


class FilesMetadata:
def __init__(self, path: str):
raise NotImplementedError
"""This classes provides helpers to navigate and read metadata files
from a project directory.

Examples
--------
>>> FilesMetadata('.').locate_licenses()
['./LICENSE']
"""

def __init__(self, project_path: str):
self.project_path = project_path

def locate_licenses(self) -> List[str]:
"""Returns valid potential paths to license files in the project.
This uses pattern-matching on file names.
"""
license_files = []
pattern = r".*(license(s)?|reus(e|ing)|copy(ing)?)(\.(txt|md|rst))?$"
for root, _, files in os.walk(self.project_path):
# skip toplevel hidden dirs (e.g. .git/)
subdir = os.path.relpath(root, self.project_path)
if subdir.startswith(".") and subdir != ".":
continue
for file in files:
# skip hidden files
if file.startswith("."):
continue

if re.match(pattern, file, flags=re.IGNORECASE):
license_path = os.path.join(root, file)
license_files.append(license_path)

return license_files
17 changes: 14 additions & 3 deletions gimie/sources/git.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,12 @@ def __init__(self, path: str):
@cached_property
def authors(self) -> Tuple[str]:
"""Get the authors of the repository."""
return tuple(set(commit.author.name for commit in self.repository.traverse_commits()))
return tuple(
set(
commit.author.name
for commit in self.repository.traverse_commits()
)
)

@cached_property
def creation_date(self) -> Optional[datetime.datetime]:
Expand All @@ -90,8 +95,14 @@ def releases(self) -> Tuple[Release]:
try:
# This is necessary to initialize the repository
next(self.repository.traverse_commits())
releases = tuple(Release(tag=tag.name, date=tag.commit.authored_datetime,
commit_hash=tag.commit.hexsha) for tag in self.repository.git.repo.tags)
releases = tuple(
Release(
tag=tag.name,
date=tag.commit.authored_datetime,
commit_hash=tag.commit.hexsha,
)
for tag in self.repository.git.repo.tags
)
return sorted(releases)
except StopIteration:
return None
53 changes: 48 additions & 5 deletions gimie/sources/license.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,61 @@
# Copyright 2022 - Swiss Data Science Center (SDSC)
# A partnership between École Polytechnique Fédérale de Lausanne (EPFL) and
# Eidgenössische Technische Hochschule Zürich (ETHZ).
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import List, Tuple
from scancode.api import get_licenses


class LicenseMetadata:
def __init__(self, path: str):
raise NotImplementedError
"""
This class provides metadata about software licenses.
It requires paths to files containing the license text.

Attributes
----------
paths:
The collection of paths containing license information.

Examples
--------
>>> LicenseMetadata('./LICENSE').get_licenses()
['https://spdx.org/licenses/Apache-2.0']
"""

def __init__(self, *paths: str):
self.paths: Tuple[str] = paths

def get_licenses(self, min_score: int = 50) -> List[str]:
"""Returns the SPDX URLs of detected licenses.
Performs a diff comparison between file contents and a
database of licenses via the scancode API.

Parameters
----------
min_score:
The minimal matching score used by scancode (from 0 to 100)
to return a license match.

Returns
-------
licenses:
A list of SPDX URLs matching provided licenses,
e.g. https://spdx.org/licenses/Apache-2.0.html.
"""
mappings = get_licenses(self.paths[0], min_score=min_score)
licenses = [
mapping["spdx_url"] for mapping in mappings.get('licenses')
]

return licenses
Loading