Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Images that are in the wrong folder when uploading via ingestion client #109

Open
bhsi-snm opened this issue Jul 12, 2024 · 15 comments
Open
Assignees
Labels
2 Priority data issue issues with the data ingestion issue issue ingesting images metadata associated metadata

Comments

@bhsi-snm
Copy link

When uploading the images via the ingestion client from the pinned insect workstation in room 411, some of the images ended up in the wrong workstation folder of the n-drive namely in WORKHERB0001.
There have been 2019 images that are in the wrong folder (ended up in WORKHERB0001 but should be WORKPIOF0001).
The folders which have the wrong images (pinned insects in herbarium): 2024-1-19, 2024-1-23, 2024-1-26

Probably, a case of not having the right folder selected when uploading the files, but it could be something else. This happened back in Jan.
As far as I know, there are 2 cases of error reports when the ingestion client has misbehaved but we haven't been able to replicate or pinpoint the error.
The dates of this error do not coincide with the error reported on Slack which was on 29th May and 28th June.

It might be a good idea to investigate it further to see if it is just a human error or something else.

@PipBrewer
Copy link
Contributor

@ThomasAlscher1991 @bhsi-snm I am unable to open the tifs to look at the images, and the jpegs don't seem to be there in the folder. Could we generate the jpegs for these folders and put them on the N drive?

@ThomasAlscher1991
Copy link

The automatic script stopped somehow. But I restarted it, so most folders are now converted to JPEGs again. If there are still some missing, wait until tomorrow, then the whole backlog is cleaned up.

@chelseagraham
Copy link
Collaborator

I've seen the folders in
WORKHERB0001:
19 Jan with 54 files
23 Jan with 177 files
26 Jan with 358 files

WORKPIOF0001:
23 Jan with 54 files

@PipBrewer
Copy link
Contributor

PipBrewer commented Jul 23, 2024

I noticed that different people were uploading on those dates. I asked the digitisers on 12th July on Slack if anyone remembered any issues from this time, but no response. With this in mind, it is strange that the issues are clustered in time. Could this have been a bug? However, it is so long ago that no-one will be able to remember and so I think we need to close this issue. Is there a way to schedule checks so that we can be informed more quickly if this happens again and investigate it when things are fresh in people's minds? @bhsi-snm

In the meantime:

WORKHERB0001 (all should be in folder WORKPIOF0001):
19 Jan with 54 files.
Pipeline needs changing from PIPEHERB0001 to PIPEPIOF0001
Workstation needs changing from WORKHERB0001 to WORKPIOF0001

23 Jan with 177 files
Preparation_type needs changing from sheet to pinned
Pipeline needs changing from PIPEHERB0001 to PIPEPIOF0001
Workstation needs changing from WORKHERB0001 to WORKPIOF0001

26 Jan with 358 files
Preparation_type needs changing from sheet to pinned
Pipeline needs changing from PIPEHERB0001 to PIPEPIOF0001
Workstation needs changing from WORKHERB0001 to WORKPIOF0001

@k-zamzam @ThomasAlscher1991 Are you able to "fix" this metadata and move the files to correct folder?

@PipBrewer PipBrewer assigned bhsi-snm and unassigned PipBrewer and chelseagraham Jul 23, 2024
@ThomasAlscher1991
Copy link

I looked into some JPEG folders to gather data for some tests and found pinned insects in WOKRHERB0001 as late as 2024-7-24.
The steps to correct faulty uploads have been documented in this repo.

@bhsi-snm
Copy link
Author

@PipBrewer This is just one example of how things can go wrong. To address this issue completely, we need to have quality control in place. This means identifying the possible places things can go wrong and documenting them. Once we have that in place, we can then explore various ways to correct them.
A small effort made by @ThomasAlscher1991 to suggest a potential solution for this particular issue. It is also in the readme of the DaSSCo-Image-IngestionAPI repo mentioned above by Thomas. Posting the relevant part here for ease of reading.

Correct faulty uploads

This section contains a guideline for correcting errors that have been encountered so far.

General advice: Run these corrections when no other uploads are being processed.

NOTE: Check if the pipeline has been assigned incorrectly, too.

In case images have been uploaded to the wrong folder in your MEDIA_URL_base, go through this guide.

Check if it was a server-side error (unlikely)
Check if it was a client-side error (likely)
Has the workstation value been assigned incorrectly? If yes, it means the images have been uploaded to the wrong workstation folder in your MEDIA_URL_base and their GUID is wrong.
Move the following things to the right workstation folder:
Assets (consists of image.tif and metadata.json)
If you are using JPEG conversion, also move the respective JPEGs to the right folder
Check if the institution and collection values have also been assigned incorrectly. That means their GUID is wrong.
Identify the wrong value(s) and replace the wrong value(s) in the current GUID with the correct value on paper (don't rename anything yet!)
Check in the refinery_db database whether this GUID already exists
If it doesn't exists, proceed
If it does, don't proceed and escalate this issue (GUID is not unique anymore)
Correct file names by renaming them to the new GUID
metadata.json
image.tif
If you are using JPEG conversion, also rename the respective JPEGs
Rename the database entry in the following databases
refinery_db
in assets table the GUID column
If you are using JPEG conversion, also rename the entries in the JPEG database
Check if GUID is in errorImages, queuedImages or convertedImages tables a and change the respective new GUID. NOTE: also remember to update column jpegPath/tifPath to point to new workstation folder
Update metadata.json file contents
Depending on the metadata template version see, rename the following tags in the etadata.json file to the new GUID:
media_guid
asset_guid

@chelseagraham
Copy link
Collaborator

To sum up, issues encountered while running QA of image content for images taken all-time (up to and including week 32) at all SNM workstations that are related to processing fall in the following categories:
• JPEGs with discoloration that is not present in the TIFF (7 instances)
• TIFFS exhibiting artifacts (2 instances)
• Blank TIFF (1 instance)
• White balance and/or rotation not applied (28 instances)
• Folder contains images from another pipeline (6 instances)
• Folder contains images from mixed pipelines (6 instances)

You can see visuals of these issues here
WORKPIOF0001: https://github.com/NHMDenmark/DaSSCo-Image-Refinery/issues/290
WORKPIOF0002: https://github.com/NHMDenmark/DaSSCo-Image-Refinery/issues/291
WORKHERB0003: https://github.com/NHMDenmark/DaSSCo-Image-Refinery/issues/300
WORKHERB0001: https://github.com/NHMDenmark/DaSSCo-Image-Refinery/issues/292

@chelseagraham
Copy link
Collaborator

I continue to find folders containing specimens that do not correspond, even in week 35. I have communicated with the Digitizers at NHMD and NHMA about this to ask them to be vigilant when selecting fields in the Ingestion Client.

@chelseagraham chelseagraham moved this from For Review to Done in Herbarium workstation and workflow Sep 5, 2024
@PipBrewer PipBrewer added data issue issues with the data metadata associated metadata 2 Priority and removed WORKHERB0001 (@NHMD) First workstation at Hebarium C, NHMD labels Sep 12, 2024
@PipBrewer
Copy link
Contributor

As discussed in IT team meeting in 26/08/2024, renaming GUIDs is low priority. I wonder whether there is a way to do all of this smartly and in bulk (at least at folder level). Possibly discuss this with Allison to see if this is something she can take on in the future (once she is up and running with other things - which have priority). Added this to the data board.

@chelseagraham @bhsi-snm Have either of you listed in a document somewhere all of the incorrect ones found and what they should be?

@chelseagraham
Copy link
Collaborator

As discussed in IT team meeting in 26/08/2024, renaming GUIDs is low priority. I wonder whether there is a way to do all of this smartly and in bulk (at least at folder level). Possibly discuss this with Allison to see if this is something she can take on in the future (once she is up and running with other things - which have priority). Added this to the data board.

@chelseagraham @bhsi-snm Have either of you listed in a document somewhere all of the incorrect ones found and what they should be?

I've included these folders in each of my QA reports on GitHub / N:/ and I've tagged Bhupjit to make him aware of each instance.

@PipBrewer
Copy link
Contributor

@chelseagraham @bhsi-snm @beckerah That will be hard to work with (having things in multiple reports/tickets), we need a way to consolidate the info and document what has been done on it (as it is a multistep process). We also need to monitor the reasons for these issues and how frequently they occur. A single place is needed therefore.
We need to find out what happened
Document what is where
What is incorrect
What needs changing

  • includes physical location of folders, data in json files, jpeg, refinery_db

We also need to identify and isolate them as soon as possible BEFORE they go through image processing pipeline.

@beckerah
Copy link

@chelseagraham @PipBrewer @bhsi-snm
Does it make sense to have a spreadsheet that gets updated as I'm checking exports and includes:

  • Export filename
  • Number of rows in export (corresponding to number of barcodes)
  • Folder name where the images are stored (.tifs and/or .jpegs)
  • Total number of images in folder

Maybe having a log like this could kill two birds with one stone: help us see when something's gone awry with ingestion, and also locate the images that match the barcodes. Thoughts?

@PipBrewer
Copy link
Contributor

For now, it seems like a good solution to keep track of problems. Might be annoying to keep up long term, but definitely needed now.

@beckerah
Copy link

I've started a spreadsheet on the N drive, in DaSSCo\Data. It's currently named trackingBarcodesAndImages.ods. Feel free to take a look and offer feedback. Hopefully this will help us keep better track of things in the interim, at least while we're waiting on getting the barcodes added to the image metadata.

@chelseagraham
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 Priority data issue issues with the data ingestion issue issue ingesting images metadata associated metadata
Development

No branches or pull requests

5 participants