Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: check to see if incoming SIP has already been ingested #106

Open
sallain opened this issue Jan 13, 2025 · 1 comment
Open

Feature: check to see if incoming SIP has already been ingested #106

sallain opened this issue Jan 13, 2025 · 1 comment
Assignees

Comments

@sallain
Copy link
Contributor

sallain commented Jan 13, 2025

Is your feature request related to a problem? Please describe.

When processing a large volume of packages, it can be easy to upload the same package twice. This could also happen if package deposit is automated or in a variety of other situations. In any case, processing the same package twice is an unnecessary use of compute resources and storage space and should be avoided.

Describe the solution you'd like

I would like Enduro to check to see if the package has already been ingested. This could be done by computing a checksum for the package and recording this in a database, to be checked against in the future. All incoming packages will be compressed, so computing a checksum should be fast. Any checksum that is a repeat of one already recorded in the database will trigger a failure.

This should be optional 😉

Describe alternatives you've considered

This is a pretty high-level check that will not catch very similar packages. It would be possible to be much more specific - checking individual file checksums, for example, or checking certain metadata elements - but I think this will suffice for the migration (and therefore MVP).

Additional context

@sallain sallain added this to Enduro Jan 13, 2025
@sallain sallain moved this to 👍 Ready in Enduro Jan 13, 2025
@sallain sallain moved this from 👍 Ready to ⏳ In Progress in Enduro Jan 14, 2025
@jraddaoui
Copy link
Contributor

An initial implementation for this issue has been included in the v0.7.0 release, it required three main changes:

Receive compressed SIPs from Enduro:

To be able to calculate a checksum for the entire SIP, they are now shared compressed between Enduro and the preprocessing workflow and the extraction happens in preprocessing.

Add persistence layer:

Include a persistence package for database interactions and define a small schema using Ent. For now, it only supports MySQL and creates a single table for the SIPs with their file name and calculated checksum.

Check for duplicate SIP:

The actual check is optional and can be enabled/disabled using the checkDuplicates config variable, when it's disabled the persistence layer/config is not needed and the DB is not considered.

When enabled, the check is the first thing that happens in the preprocessing workflow. It calculates the SIP checksum and tries to create and entry in the DB, if the checksum already exists it fails the workflow directly. This means that, if a SIP has already been ingested you won't be able to ingest it again unless the contents change.


Most of the expected errors should be content validation errors and would be okay to require different contents, but this doesn't consider unexpected or system errors (like we saw during testing, where the PIPs failed in AM due to a system error). Because this initial implementation is done in a custom preprocessing child workflow, the issue mentioned above doesn't have an "easy fix".

If it's needed, in the next iteration we could try to find a general and configurable solution in the parent Enduro workflow, or investigate options using the poststorage child workflows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ⏳ In Progress
Development

No branches or pull requests

2 participants