You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
When processing a large volume of packages, it can be easy to upload the same package twice. This could also happen if package deposit is automated or in a variety of other situations. In any case, processing the same package twice is an unnecessary use of compute resources and storage space and should be avoided.
Describe the solution you'd like
I would like Enduro to check to see if the package has already been ingested. This could be done by computing a checksum for the package and recording this in a database, to be checked against in the future. All incoming packages will be compressed, so computing a checksum should be fast. Any checksum that is a repeat of one already recorded in the database will trigger a failure.
This should be optional 😉
Describe alternatives you've considered
This is a pretty high-level check that will not catch very similar packages. It would be possible to be much more specific - checking individual file checksums, for example, or checking certain metadata elements - but I think this will suffice for the migration (and therefore MVP).
Additional context
The text was updated successfully, but these errors were encountered:
An initial implementation for this issue has been included in the v0.7.0 release, it required three main changes:
Receive compressed SIPs from Enduro:
To be able to calculate a checksum for the entire SIP, they are now shared compressed between Enduro and the preprocessing workflow and the extraction happens in preprocessing.
Add persistence layer:
Include a persistence package for database interactions and define a small schema using Ent. For now, it only supports MySQL and creates a single table for the SIPs with their file name and calculated checksum.
Check for duplicate SIP:
The actual check is optional and can be enabled/disabled using the checkDuplicates config variable, when it's disabled the persistence layer/config is not needed and the DB is not considered.
When enabled, the check is the first thing that happens in the preprocessing workflow. It calculates the SIP checksum and tries to create and entry in the DB, if the checksum already exists it fails the workflow directly. This means that, if a SIP has already been ingested you won't be able to ingest it again unless the contents change.
Most of the expected errors should be content validation errors and would be okay to require different contents, but this doesn't consider unexpected or system errors (like we saw during testing, where the PIPs failed in AM due to a system error). Because this initial implementation is done in a custom preprocessing child workflow, the issue mentioned above doesn't have an "easy fix".
If it's needed, in the next iteration we could try to find a general and configurable solution in the parent Enduro workflow, or investigate options using the poststorage child workflows.
Is your feature request related to a problem? Please describe.
When processing a large volume of packages, it can be easy to upload the same package twice. This could also happen if package deposit is automated or in a variety of other situations. In any case, processing the same package twice is an unnecessary use of compute resources and storage space and should be avoided.
Describe the solution you'd like
I would like Enduro to check to see if the package has already been ingested. This could be done by computing a checksum for the package and recording this in a database, to be checked against in the future. All incoming packages will be compressed, so computing a checksum should be fast. Any checksum that is a repeat of one already recorded in the database will trigger a failure.
This should be optional 😉
Describe alternatives you've considered
This is a pretty high-level check that will not catch very similar packages. It would be possible to be much more specific - checking individual file checksums, for example, or checking certain metadata elements - but I think this will suffice for the migration (and therefore MVP).
Additional context
The text was updated successfully, but these errors were encountered: