Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package scanning update #482

Closed
wants to merge 25 commits into from
Closed

Package scanning update #482

wants to merge 25 commits into from

Conversation

JonoYang
Copy link
Member

@JonoYang JonoYang commented Aug 2, 2022

This PR updates application package scanning in scancode.io to reflect the updates made to package scanning in scancode-toolkit. Most notably, the DiscoveredDependency model has been added to scancode.io (as the analogue to the Dependency model from scancode-toolkit), and the scan_for_application_packages scancode pipe function has been updated to run the assemble() methods from scancode-toolkit Package handlers. This allows us to create DiscoveredDependencies for DiscoveredPackages and to properly associate application packages to CodebaseResources.

The Project API has been updated with the following:

  • dependency_count
    • The number of DiscoveredDependencies associated with the project.
  • discovered_dependency_summary
    • A mapping that contains following fields:
      • total
        • The number of DiscoveredDependencies associated with the project.
      • is_runtime
        • The number of runtime dependencies.
      • is_optional
        • The number of optional dependencies.
      • is_resolved
        • The number of resolved dependencies.

The Project detail view has been updated to show the DiscoveredDependencies stats and a DiscoveredDependency view, but more work on the UI will be needed. In particular, the field labels (I'm not sure what the best labels are) and colors used for boolean values should be updated.
image

The DiscoveredDependency view field labels, column widths, and search box (doesn't filter) need to be updated as well.
image

This feature requires aboutcode-org/scancode-toolkit#3035, which will be available in scancode-toolkit 31.0.0rc4

JonoYang added 17 commits July 20, 2022 17:03
    * Update scan_for_application_packages to save detected Package data to the CodebaseResource it is from, then iterate through the CodebaseResources with Package data and use the proper Package handler to process the Package data
    * Create DiscoveredDependency model
    * Add package_data JSON field to CodebaseResource

Signed-off-by: Jono Yang <[email protected]>
    * Increase field sizes in DiscoveredDependency

Signed-off-by: Jono Yang <[email protected]>
    * Create DiscoveredDependencySerializer

Signed-off-by: Jono Yang <[email protected]>
    * We never run into the situation where we are updating a DiscoveredDependency when scanning packages
    * Clean up migrations

Signed-off-by: Jono Yang <[email protected]>
    * Update scancode-toolkit to 31.0.0rc3 #447
    * Update expected test results
    * Update code formatting

Signed-off-by: Jono Yang <[email protected]>
    * This is to reflect the changes made to the Package model in scancode-toolkit
    * Update test expectations

Signed-off-by: Jono Yang <[email protected]>
    * purl and dependency_uid are the only required fields on DiscoveredDependency
    * Update migrations
    * Check for created DiscoveredDependency in tests

Signed-off-by: Jono Yang <[email protected]>
    * Create DiscoveredDependencies in load_codebase pipeline
    * Update tests to check for DiscoveredDependencies
    * Update test expectations

Signed-off-by: Jono Yang <[email protected]>
    * Update test expectations

Signed-off-by: Jono Yang <[email protected]>
    * This will fail until scancode-toolkit 31.0.0rc4 is released
    * Update test expectations

Signed-off-by: Jono Yang <[email protected]>
Signed-off-by: Jono Yang <[email protected]>
@JonoYang JonoYang requested a review from tdruez August 2, 2022 01:51
@JonoYang
Copy link
Member Author

JonoYang commented Aug 2, 2022

Some more to do's:

  • ManyToMany field on DiscoveredPackage to DiscoveredDependencies
  • Implement dependency detail view
    • Have hyperlink to the Package a dependency is from
  • Resources tab in package detail view does not work I can click on the Resources tab in firefox, but not in chrome

    * Remove related dependencies when resetting a project

Signed-off-by: Jono Yang <[email protected]>
@tdruez
Copy link
Contributor

tdruez commented Aug 3, 2022

ManyToMany field on DiscoveredPackage to DiscoveredDependencies

Let's try to make it part of the same migration to limit the number of migration files.

I can click on the Resources tab in firefox, but not in chrome

Can you try to refresh/reset your cache? The tab system is working fine on my side in chrome.


On a more general note, I think there's a bit too much within this one PR and it makes it harder to review and test.
It would have been better to split the resource name, the package assemble, and the dependencies (model and UI) in separate PRs. Not a big deal, but let's try to keep 1 PR per issue in the future. Code quality is great though ;)

Copy link
Contributor

@tdruez tdruez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code quality is great! I think we are also missing a few tests for the new functions and methods.

Comment on lines 1611 to 1660
def parent_path(self):
"""
Return the parent path for this CodebaseResource or None.
"""
return parent_directory(self.path, with_trail=False)

def has_parent(self):
"""
Return True if this CodebaseResource has a parent CodebaseResource or
False otherwise.
"""
parent_path = self.parent_path()
if not parent_path:
return False
if self.project.codebaseresources.filter(path=parent_path).exists():
return True
return False

def parent(self, codebase=None):
"""
Return the parent CodebaseResource object for this CodebaseResource or
None.

`codebase` is not used in this context but required for compatibility
with the commoncode.resource.Codebase class API.
"""
parent_path = self.parent_path()
return parent_path and self.project.codebaseresources.get(path=parent_path)

def has_siblings(self, codebase=None):
"""
Return True is this CodebaseResource has siblings.

`codebase` is not used in this context but required for compatibility
with the commoncode.resource.Codebase class API.
"""
return self.has_parent() and self.parent(codebase).has_children()

def siblings(self, codebase=None):
"""
Return a sequence of sibling Resource objects for this Resource
or an empty sequence.

`codebase` is not used in this context but required for compatibility
with the commoncode.resource.Codebase class API.
"""
if self.has_parent():
return self.parent(codebase).children(codebase)
return []

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those new method related to this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods are called in the .assembly() methods of different Package handlers in packagedcode. Some of these methods has logic that determine the Resources of a Package based on the files around a Package manifest. e.g. https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/pypi.py#L185

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also updated the CodebaseResource walk test to also test for these methods

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok let's make sure we have full test coverage for those.

scanpipe/pipes/scancode.py Outdated Show resolved Hide resolved
_scan_and_save(
resource_qs=resource_qs,
scan_func=scan_for_package_data,
save_func=save_scan_package_results,
)

# Iterate through CodebaseResources with Package data and handle them using
# the proper Package handler from packagedcode
assemble_packages(project=project)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are the performances of the new assemble? Do we have some stats about the time spent to assemble compared to the scan_and_save processing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any metrics on the performance impact that assemble_packages has on application package scanning. What is the process you do to collect performance metrics on different scancode.io parts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We run the pipeline using various inputs pre-changes then again post-changes and we can compare the execution time of the relevant steps.

Also, a profiling utility is available to go deeper into optimizations: https://github.com/nexB/scancode.io/blob/main/scanpipe/pipelines/__init__.py#L155

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are the performances of the new assemble? Do we have some stats about the time spent to assemble compared to the scan_and_save processing?

The new assemble is a must have functional change that we should get working and tested before we can optimize if we ever need to optimize? ... @tdruez did you experience an issue?

scanpipe/pipes/scancode.py Outdated Show resolved Hide resolved
scanpipe/pipes/scancode.py Outdated Show resolved Hide resolved
scanpipe/templates/scanpipe/package_list.html Outdated Show resolved Hide resolved
scanpipe/templates/scanpipe/project_detail.html Outdated Show resolved Hide resolved
@JonoYang JonoYang force-pushed the package-scanning-update branch from 5671373 to 2640c2f Compare August 3, 2022 17:42
    * Show and link relations in the views

Signed-off-by: Jono Yang <[email protected]>
    * Remove dependency field from CodebaseResourceSerializer output
    * Update dependency donut chart labels
    * Revert Package URL hover to package_uid value in package_list view

Signed-off-by: Jono Yang <[email protected]>
@JonoYang JonoYang force-pushed the package-scanning-update branch from 83f545d to 4da9b62 Compare August 4, 2022 00:59
    * Update walk test to also test for new parent and sibling related methods on CodebaseResource
    * Remove has_sibling method from CodebaseResource

Signed-off-by: Jono Yang <[email protected]>
@JonoYang JonoYang force-pushed the package-scanning-update branch from 4da9b62 to 205dddf Compare August 4, 2022 01:03
@JonoYang
Copy link
Member Author

JonoYang commented Aug 4, 2022

@tdruez Thanks! I've responded to your comments. Now, I am wondering about the best approach for #444, where we want to represent individual instances of Packages and Dependencies. Right now, in the case where we have detected the same package multiple times, we create unique packages for each instance even though it is essentially the same package.

@JonoYang
Copy link
Member Author

JonoYang commented Aug 4, 2022

I am closing this PR and splitting it into two:

  • One where we implement the package assembly step
  • One where we create the DiscoveredDependency model

@JonoYang JonoYang closed this Aug 4, 2022
@tdruez tdruez deleted the package-scanning-update branch September 12, 2022 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants