☄️ Derivative Rodeo Integration Epic #56

jeremyf · 2023-06-12T16:39:11Z

The goal of this punchiest is to outline the steps necessary to verify that IIIF print picks up the changes

Standardize fields display order in the "Descriptions" tab adventist_knapsack#659
Set/ensure space stone's logger level is :info or more granular
Mint new patch version of Derivative Rodeo
Update SpaceStone's Derivative Rodeo version to the new patch version
Deploy SpaceStone server less with change
Remove from the bucket the 20121816 entry (https://s3.console.aws.amazon.com/s3/buckets/space-stone-dev-preprocessedbucketf21466dd-bxjjlz4251re?region=us-west-2&tab=objects)
Switch to the "working-on-posting-to-serverless" branch
Submit the below CSV to SpaceStone and review logs of the various lambdas; you're looking for the logged information
Re-submit the below CSV to SpaceStone and see that the logs are mentioning "we already have the file"

With the above SpaceStone and Derivative Rodeo adjustments

Set IIIF Print's application's logger level to :info
Update IIIF Print to use above Derivative Rodeo gem version
Update the IIIF Print configuration to leverage SpaceStone; this will require AWS credentials for the Pre Processed Buckets.
Run import of 20121816 entry (likely want to get a single CSV of this file)
Review logs; we should not see generating derivatives but instead should see log entries regarding found location
- See Split PDF and constituent pages as works with expected derivative files (e.g. thumbnail and JSON)
Run import for a single image entry that has been ingested

Derivative Rodeo Integration Tests for PDF Splitting

The following are the scenarios I’m working through for integration testing:

Scenario: PDF Split does not exist

Given a work with a PDF
And SpaceStone has not split the PDF
When we import the PDF
Then the application should split the PDF
And attach the resulting split pages as child works

Scenario: Thumbnail of PDF does not exist

Given a work with a PDF
And SpaceStone has not pre-processed the thumbnail
When we import the PDF
Then the application generates a thumbnail
And attaches the thumbnail to the work

Scenario: PDF Split exists

Given a work with a PDF
And SpaceStone has not split the PDF
When we import the PDF
Then the application should split the PDF
And attach the resulting split pages as child works

Scenario: Thumbnail of PDF exists

Given a work with a PDF
And SpaceStone has pre-processed the thumbnail
When we import the PDF
Then the application retrieves the pre-processed thumbnail
And attaches the thumbnail to the work

After the integration test we will need to:

The text was updated successfully, but these errors were encountered:

Related to: - #56

It is useful to see the inner works of decision making regarding the derivative rodeo. That is to say: - "Does the file already exist at the target location?" - "Does the file exist at the pregerenate location?" - "Do we need to generate the at the target location?" Related to: - notch8/derivative_rodeo#56

This commit contains two things: 1. Updated documentation 2. Updated submodule Related to: - notch8/derivative_rodeo#56 The derivative rodeo commit changes are as follows: - notch8/derivative_rodeo@6fd304f :: 🎁 Adding logging to generators (2023-06-12) - notch8/derivative_rodeo@795a7d2 :: 🐛 Hacking away the .mono suffix for 2nd order derivatives (2023-06-09) - notch8/derivative_rodeo@2502c4c :: 🐛 Ensuring we submit any stray batch messages (2023-06-09)

Related to: - #56

This commit incorporates the logic for handling the reader versions of the files. It addresses a bug in our queue management; namely needing to submit the final buffered entries. Related to: - notch8/derivative_rodeo#56

jeremyf · 2023-07-11T13:35:12Z

I removed the 20121816 folder from the bucket, then ran the following to re-process:

<2023-07-11 Tue 09:19>: I submitted a single row for processing. This row was for a single PDF with 3 pages.
<2023-07-11 Tue 09:24>: Reviewed the bucket folder 20121816 and saw each page had been processed.
<2023-07-11 Tue 09:25>: Reviewed logs to see latest Derivative Rodeo changes in place.
<2023-07-11 Tue 09:29> : Resubmitted the row for processing, the expected behavior is that none of the files will change. Reviewed the split-ocr-thumbnailWorker and saw the logs mention:

DerivativeRodeo::Generators::PdfSplitGenerator#destination :: input_location file_uri s3://space-stone-dev-preprocessedbucketf21466dd-bxjjlz4251re.s3.us-west-1.amazonaws.com/20121816/20121816.ARCHIVAL–page-1.tiff: Found output_location file_uri s3://space-stone-dev-preprocessedbucketf21466dd-bxjjlz4251re.s3.us-west-1.amazonaws.com/20121816/20121816.ARCHIVAL–page-1.tiff.

In other words, the conditional generation is working.

For testing the rodeo, I want to use the version that has more verbose logging. Related to: - notch8/derivative_rodeo#56

Related to: - notch8/derivative_rodeo#56

This commit contains four primary changes: 1. Fixing a misnamed constant. 2. Moving setting the optional filename to a point before we use the filename. 3. Leveraging if include instead of case statements 4. Adding exception decorating to provide additional context. Of these, the quality of debugging change to exceptions pays the most dividends. It's helped provide insight into the specific URI that's failing. Related to: - notch8/derivative_rodeo#56

This commit includes 3 changes: 1. Replacing a misspelled name with the correct name 2. Improving logging 3. Adding a parameter to an exception. At one point we had a method `globbed_tail_locations` however with some refactoring I renamed that to `matching_locations_in_file_dir`; however I missed the `globbed_tail_loations`. In addition to fixing the misnamed method name, I'm adding some improved logging that helps with a problem I've encountered. Namely that I got a `#<ArgumentError: invalid byte sequence in UTF-8>` on a string. That string was the contents of the file found at the `path_to_hocr`; however the exception didn't provide insight into the filename. With this change, I fix a method that was broken and improve logging. Last, the exception for buckets used an unprovided variable (e.g. `file_uri`). This now provides that expected file_uri. Related to: - #56

jeremyf · 2023-07-13T20:44:37Z

Working Notes from 2023-07-13

First, I am working a local copy that has the following Gemfile adjustment:

gem 'iiif_print', path: 'vendor/gems/iiif_print'
gem 'derivative-rodeo', path: 'vendor/gems/derivative_rodeo'

And then I have cloned the iiif_print and derivative_rodeo repositories into vendor/gems.

I’m working with AARK ID 20121816; which can be found at http://oai.adventistdigitallibrary.org/OAI-script?verb=GetRecord&identifier=20121816&metadataPrefix=oai_adl

The above AARK ID can be found in http://oai.adventistdigitallibrary.org/OAI-script?verb=ListRecords&metadataPrefix=oai_adl&set=adl:other as the 25th entry. I created an ingester for the OAI feed and have been working through problems locally.

I have setup my docker compose to not start the web application nor the workers. Instead I bring it up and then have two terminals: one for rails console and one for good jobs. I'm often restarting the good_job workers/service.

Following on https://playbook-staging.notch8.com/en/samvera/bulkrax/re-running-a-single-entry-from-an-import; I bash into the worker and run rails console and enter the following:

switch!('adventist'); entry = Bulkrax::Entry.find(136)
GoodJob::Job.where(finished_at: nil).where(GoodJob::Job.arel_table['created_at'].lteq(DateTime.current)).joins_advisory_locks.where(pg_locks: { locktype: nil }).destroy_all
entry.factory.find&.destroy(eradicate: true) 
entry.build

In the worker:

export AWS_ACCESS_KEY_ID=xxx ; export AWS_SECRET_ACCESS_KEY=xxx ; export AWS_REGION=us-west-2 ; export AWS_S3_BUCKET=space-stone-dev-preprocessedbucketf21466dd-bxjjlz4251re ;  bundle exec good_job start

Observed Problem

I am encountering an issue where the path_to_hocr is using an image file. As this is a thumbnail, I don’t really need to run derivatives on it. In fact, as part of the OAI import, we should skip attaching this file.

However, the underlying problem remains.


🤠🐮 DerivativeRodeo::Generators::WordCoordinatesGenerator#generated\_files encountered \`RuntimeError':
   “🤠🐮 DerivativeRodeo::Generators::WordCoordinatesGenerator#convert\_to\_coordinates encountered \`ArgumentError' error “invalid byte sequence in UTF-8” for path\_to\_hocr: "*tmp/d20230713-248-1vwuyjg/adl-ebstore-repo.s3.amazonaws.com/20/1218/20121816/20121816.TN.jpg" and path\_to\_coordinate: "/tmp/d20230713-248-1jpg9k1/app/samvera/hyrax-webapp/tmp/derivatives/a6/2f/f0/75*-f/57/d-*43/f7*-b/7d/5-/29/dd/74/74/1d/f1-json.json"”
 
 for input\_uri: "![img](https://adl-ebstore-repo.s3.amazonaws.com/20/1218/20121816/20121816.TN.jpg)",
 output\_location\_template: "<file:///app/samvera/hyrax-webapp/tmp/derivatives/a6/2f/f0/75/-f/57/d-/43/f7/-b/7d/5-/29/dd/74/74/1d/f1-json.json>",
 and preprocessed\_location\_template: "s3://space-stone-dev-preprocessedbucketf21466dd-bxjjlz4251re.s3.us-east-1.amazonaws.com/20121816/20121816.TN.jpg.coordinates.json".

With this commit, I'm providing insight into what we're requesting be generated. Later logging will report more granular information. I have found this helpful to understand what is happening in a rather expanse and sometimes opaque process. Related to: - #56

**Context:** We're incorporating the derivative rodeo into the ingest process. This is first intended to be used by the OAI importer. The situation is as follows: In the OAI feed there are URLs for both the digital objects and a thumbnail. Due to prior constraints of ingest, we had to add the thumbnail as a FileSet to the work. However, with the derivative rodeo, we can look in S3 for an existing thumbnail (that was generated via SpaceStone) and assign that thumbnail to the digital object(s)'s file set. In other words, we can avoid adding the thumbnail file set (and running all the derivatives on that file set as well). However, this may conflict with some work done in GitLab I62 (as detailed in commit @433b66a8d95f49bd40335b5483621bb4e4a41227). **Discussion:** Prior to adding this change when the derivative service was working on the TN.jpg file (the thumbnail_url that comes with the OAI feed), it was raising a `ArgumentError' error “invalid byte sequence in UTF-8` exception. What was happening is that the derivative rodeo was wrongly assuming that the TN.jpg was a HOCR file. It was reading the contents of the file and asking if it was XML. And that raised the exception. As mentioned in [this comment][1], "the underlying problem remains." Namely the Derivate Rodeo service in IIIF print needs to better handle second order derivatives (e.g. generate a HOCR file). **Question:** - Is this the right approach? - What is the problem we were solving in @433b66a8d95f49bd40335b5483621bb4e4a41227 - What is the context of I62? I believe this is best resolved in a pairing/mobbing session. However, I put this forward for conversation. **Related to:** - notch8/derivative_rodeo#56 [1]:notch8/derivative_rodeo#56 (comment)

**Context:** We're incorporating the derivative rodeo into the ingest process. This is first intended to be used by the OAI importer. The situation is as follows: In the OAI feed there are URLs for both the digital objects and a thumbnail. Due to prior constraints of ingest, we had to add the thumbnail as a FileSet to the work. However, with the derivative rodeo, we can look in S3 for an existing thumbnail (that was generated via SpaceStone) and assign that thumbnail to the digital object(s)'s file set. In other words, we can avoid adding the thumbnail file set (and running all the derivatives on that file set as well). However, this may conflict with some work done in GitLab I62 (as detailed in commit 433b66a). **Discussion:** Prior to adding this change when the derivative service was working on the TN.jpg file (the thumbnail_url that comes with the OAI feed), it was raising a `ArgumentError' error “invalid byte sequence in UTF-8` exception. What was happening is that the derivative rodeo was wrongly assuming that the TN.jpg was a HOCR file. It was reading the contents of the file and asking if it was XML. And that raised the exception. As mentioned in [this comment][1], "the underlying problem remains." Namely the Derivate Rodeo service in IIIF print needs to better handle second order derivatives (e.g. generate a HOCR file). **Question:** - Is this the right approach? - What is the problem we were solving in 433b66a - What is the context of I62? I believe this is best resolved in a pairing/mobbing session. However, I put this forward for conversation. **Related to:** - notch8/derivative_rodeo#56 [1]:notch8/derivative_rodeo#56 (comment)

Prior to this commit, the requesting the thumbnail for the file basename "1234.ARCHIVAL.pdf" would result in a basename of "1234.ARCHIVAL.pdf.thumbnail.jpeg". With this commit, we now return the basename of "1234.ARCHIVAL.thumbnail.jpeg". Related to: - notch8/derivative_rodeo#56

Prior to this commit, during run time we didn't have much insight as to whether or not we were finding the split pages (and if so how many). With this commit we log information regarding the number of files we find, the path/index for each one OR if we're generating our own. Related to: - #56

Prior to this commit, the requesting the thumbnail for the file basename "1234.ARCHIVAL.pdf" would result in a basename of "1234.ARCHIVAL.pdf.thumbnail.jpeg". With this commit, we now return the basename of "1234.ARCHIVAL.thumbnail.jpeg". Related to: - notch8/derivative_rodeo#56

Prior to this commit, during run time we didn't have much insight as to whether or not we were finding the split pages (and if so how many). With this commit we log information regarding the number of files we find, the path/index for each one OR if we're generating our own. Related to: - #56

As I'm working through how files are moving through the DerivativeRodeo, I am needing lower level information to get insight into what's happening. This commit, introduces debug level logging regarding finding files that were generated as part of the pre-processing of splitting the PDF. Related to: - #56

jeremyf · 2023-07-26T14:07:25Z

Starting from a fresh Hyku tenant. When I import a work, files are associated with that work. When I eradicate the work and children then reimport the work, then sometimes (but not often) no files show in the UI.

One observation in the code is that the child works and file sets have an is_child true but their is_child_of is empty. Which leverages ordered_by_ids.

Order of Operations

Ensure I have a clean Fedora and SOLR
- docker compose down -v works
Check out the main branch
- git checkout main
Build and bring up the image
- docker compose up --force-recreate --build
Create the new tenant
- Login to hyku.test and create "adventist" tenant
Stop the docker environment
- docker compose stop
Set the correct branch and conditions
- This is a mix of stash and different branches.
- Branch updating-to-test-the-rodeo incorporates skipping the thumbnails
- Branch without-slug-bug is setup to ignore the slug features of Adventist; it is based on updating-to-test-the-rodeo
- Git Stash I'm referencing iiif_print and derivative_rodeo in vendor/gems; my stashed Gemfile and lock reflect this.
- Note: I have bumped the relationship job interval from 10 minutes to 1 minute
- In the config/initializers/iiif_print.rb I have add the following two lines at the bottom:
  - # DerivativeRodeo::Generators::PdfSplitGenerator.output_extension = 'jpg'
  - DerivativeRodeo.config.logger = Logger.new(Rails.root.join("dr.log").to_s, level: Logger::DEBUG)
- The above two lines use the TIFF for splitting (the format we used in the initial SpaceStone test) and setting the DerivativeRodeo logger means we send the very chatty information from the rodeo to a durable and separate location. This is helpful for reviewing to see how and what is being used.
Bring up the images without starting services
- I don't want to auto-start the Rails server nor Good Job because we need to clean up two rotten jobs
Bash into Worker and Delete All Jobs
- There should be two jobs the Embargo and Lease job.
- In Rails console ru GoodJob::Job.destroy_all
Start the Rails server and good jobs
- Bash into web and run bundle exec puma -v -b tcp://0.0.0.0:3000
- Bash into worker
  - export the following ENV variables
    - AWS_ACCESS_KEY_ID
    - AWS_SECRET_ACCESS_KEY
    - AWS_REGION
    - AWS_S3_BUCKET
- Run bundle exec good_job start
Create an importer for the adventist tenant
- Using the OAI Parser
- With URL of http://oai.adventistdigitallibrary.org/OAI-script
- metadataPrefix of "oai_adl"
- set of "adl:other"
This import should take somewhere between 5 and 10 minutes
Find one of the following AARK_ID import entries: 20121816 (3 page PDF) or 20121862 (1 page PDF)
- Make note of the Bulkrax::Entry ID of the work, hereto referred to as entry_id
VERIFY WORK STEP: Verify in the UI that the work has at least 3 items: the ARCHIVAL PDF, RAW.txt, and one image per page of the PDF.
Assuming that is the case, shell into the web console and start the rails console and run the following:
- switch!('adventist')
- entry = Bulkrax::Entry.find(entry_id) # NOTE entry_id
- GoodJob::Job.where(finished_at: nil).where(GoodJob::Job.arel_table['created_at'].lteq(DateTime.current)).joins_advisory_locks.where(pg_locks: { locktype: nil }).destroy_all
- entry.factory.find&.child_works&.map {|cw| cw.destroy(eradicate: true) }
- This may output a =ActiveTriples::ParentStrategy::UnmutableParentError= message; if so run again
- entry.factory.find&.destroy(eradicate: true)
- This may output a =ActiveTriples::ParentStrategy::UnmutableParentError= message; if so I don't believe running it again will matter.
- entry.build
Then repeat the VERIFY WORK STEP; you'll likely want to use the importer entries page for the entry_id

If we do not have all of the same items as in the VERIFY WORK STEP then we continue to have a deep seeded bug; and may need to change something in the code and repeat some aspect of the checklist. Perhaps going all the way back to the first step.

What I have observed is that the parent work (e.g. the object associated with entry.factory.find) has the correct associations assigned.

IiifPrint::LineageService.descendent_member_ids_for(work) returns file_set and work ids that represent the items in the UI. However, any of those ids when reified to an ActiveFedora object as child and then we call IiifPrint::LineageService.ancestor_ids_for(child), we get an empty array. We should get an array that is [work.id]

Digging deeper, each of the child works has is_child of true but their ordered_by_ids is an empty array.

Conjectures

My current conjecture is one of the following is :

The SlugBug overrides are problematic either alone or in relation to IIIF Print
- We can look at https://github.com/scientist-softserv/adventist-dl/blob/main/config/initializers/slug_override.rb#L58-L69 to see how we amend the retrieval of ordered_by_ids
We are missing the often copied config/initializers/active_fedora_override.rb in which we apply some changes to ActiveFedora::Aggregation::ListSource.attribute_will_change!
- essi
- louisville-hyku
- palni-palci
- nnp
IIIF Print has some other underlying issue with relationships

This is a bit of a stretch, based on other testing. But the included code comment may be relevant. Without this change I've consistently encounter the following observed behavior: ```gherkin Given an Bulkrax imported work with child works (e.g. pages split from a PDF) And I eradicate the imported work and child works When I re-build (e.g. Bulkrax::Entry#build) the import entry Then I do not see the attached files nor child works ``` With this change I've consistently encountered the following observed behavior: ```gherkin Given an Bulkrax imported work with child works (e.g. pages split from a PDF) And I eradicate the imported work and child works When I re-build (e.g. Bulkrax::Entry#build) the import entry Then I will see the work, child works, and derivatives, in the UI ``` **Context** This is copied from notch8/derivative_rodeo#56 (comment) to highlight the process. > Starting from a fresh Hyku tenant. When I import a work, files are associated with that work. When I eradicate the work and children then reimport the work, then no files show in the UI. > > One observation in the code is that the child works and file sets have an `is_child` true but their `is_child_of` is empty. Which leverages `ordered_by_ids`. > > Order of Operations > > - Ensure I have a clean Fedora and SOLR > - `docker compose down -v` works > - Check out the main branch > - `git checkout main` > - Build and bring up the image > - `docker compose up --force-recreate --build` > - Create the new tenant > - Login to hyku.test and create "adventist" tenant > - Stop the docker environment > - `docker compose stop` > - Set the correct branch and conditions > - This is a mix of stash and different branches. > - Branch `updating-to-test-the-rodeo` incorporates skipping the thumbnails > - Branch `without-slug-bug` is setup to ignore the slug features of Adventist; it is based on `updating-to-test-the-rodeo` > - Git Stash I'm referencing iiif\_print and derivative\_rodeo in `vendor/gems`; my stashed Gemfile and lock reflect this. > - Note: I have bumped the relationship job interval from 10 minutes to 1 minute > - In the `config/initializers/iiif_print.rb` I have add the following two lines at the bottom: > - `# DerivativeRodeo::Generators::PdfSplitGenerator.output_extension = 'jpg'` > - `DerivativeRodeo.config.logger = Logger.new(Rails.root.join("dr.log").to_s, level: Logger::DEBUG)` > - The above two lines use the TIFF for splitting (the format we used in the initial SpaceStone test) and setting the DerivativeRodeo logger means we send the very chatty information from the rodeo to a durable and separate location. This is helpful for reviewing to see how and what is being used. > - Bring up the images without starting services > - I don't want to auto-start the Rails server nor Good Job because we need to clean up two rotten jobs > - Bash into Worker and Delete All Jobs > - There should be two jobs the Embargo and Lease job. > - In Rails console ru `GoodJob::Job.destroy_all` > - Start the Rails server and good jobs > - Bash into web and run `bundle exec puma -v -b tcp://0.0.0.0:3000` > - Bash into worker > - export the following ENV variables > - `AWS_ACCESS_KEY_ID` > - `AWS_SECRET_ACCESS_KEY` > - `AWS_REGION` > - `AWS_S3_BUCKET` > - Run `bundle exec good_job start` > - Create an importer for the adventist tenant > - Using the OAI Parser > - With URL of <http://oai.adventistdigitallibrary.org/OAI-script> > - metadataPrefix of "oai\_adl" > - set of "adl:other" > - This import should take somewhere between 5 and 10 minutes > - Find one of the following AARK\_ID import entries: 20121816 (3 page PDF) or 20121862 (1 page PDF) > - Make note of the Bulkrax::Entry ID of the work, hereto referred to as `entry_id` > - VERIFY WORK STEP: Verify in the UI that the work has at least 3 items: the ARCHIVAL PDF, RAW.txt, and one image per page of the PDF. > - Assuming that is the case, shell into the web console and start the rails console and run the following: > - `switch!('adventist')` > - `entry = Bulkrax::Entry.find(entry_id)` # NOTE `entry_id` > - `GoodJob::Job.where(finished_at: nil).where(GoodJob::Job.arel_table['created_at'].lteq(DateTime.current)).joins_advisory_locks.where(pg_locks: { locktype: nil }).destroy_all` > - `entry.factory.find&.child_works&.map {|cw| cw.destroy(eradicate: true) }` > - This may output a =ActiveTriples::ParentStrategy::UnmutableParentError= message; if so run again > - `entry.factory.find&.destroy(eradicate: true)` > - This may output a =ActiveTriples::ParentStrategy::UnmutableParentError= message; if so I don't believe running it again will matter. > - `entry.build` > - Then repeat the VERIFY WORK STEP; you'll likely want to use the importer entries page for the `entry_id` > > If we do not have all of the same items as in the VERIFY WORK STEP then we continue to have a deep seeded bug; and may need to change something in the code and repeat some aspect of the checklist. Perhaps going all the way back to the first step. > > What I have observed is that the parent work (e.g. the object associated with `entry.factory.find`) has the correct associations assigned. > > `IiifPrint::LineageService.descendent_member_ids_for(work)` returns file\_set and work ids that represent the items in the UI. However, any of those ids when reified to an ActiveFedora object as `child` and then we call `IiifPrint::LineageService.ancestor_ids_for(child)`, we get an empty array. We should get an array that is `[work.id]` > > Digging deeper, each of the child works has `is_child` of `true` but their `ordered_by_ids` is an empty array. > > My current conjecture is one of the following is : > > - The [SlugBug overrides](https://github.com/scientist-softserv/adventist-dl/blob/main/app/models/concerns/slug_bug.rb) are problematic either alone or in relation to IIIF Print > - We can look at <https://github.com/scientist-softserv/adventist-dl/blob/main/config/initializers/slug_override.rb#L58-L69> to see how we amend the retrieval of `ordered_by_ids` > - We are missing the often copied `config/initializers/active_fedora_override.rb` in which we apply some changes to `ActiveFedora::Aggregation::ListSource.attribute_will_change!` > - essi > - louisville-hyku > - palni-palci > - nnp > - IIIF Print has some other underlying issue with relationships Related to: - notch8/derivative_rodeo#56 - https://github.com/scientist-softserv/nnp/commit/dc970b910bd29918f8d5dc420ffd33f940053e0c - notch8/palni-palci@687f361 - notch8/louisville-hyku@8bdead0 **NOTE:** We may want to add this to UTK.

This commit contains 3 separate concepts but they are all in service of improved legibility: 1. Changing the hash to assume multi-line 2. Changing the return value to be a string instead of Array of Rodeo Locations 3. Adding a bit of documentation All of this is in service of helping triage what's going on. Related to: - notch8/derivative_rodeo#56

Apologies in advance, this commit conflates two things, but I'll explain. This commit is in service of completing the DerivativeService interface; namely the `#cleanup_derivatives` method. Originally, I was thinking I would only delete the derivatives generated by this service. So I began refactoring to reduce knowledge. That refactor meant extracting `#named_derivatives_and_generators`, and as a matter of hygiene and legibility, I moved the method closer to the configuration. The hope being that if one thing changes the other might. This then involved rethinking the `#create_derivatives` and `#cleanup_derivatives` to use this new method. I was looking for symmetry in method implementation (e.g. loop over the named derivatives and either create them or delete them). However, as I looked at the other reference implementations I noticed that I could get all of the derivatives by calling `Hyrax::DerivativePath.derivatives_for_reference` ([see code][1]). I spent a bit of time thinking, as the comments indicate, as to which approach to take: delete all derivatives OR only those that would be created by the present configuration. It makes sense to me to delete all of them, in part due to the implementation details of finding the correct `valid?` derivative service but also the fact that any `valid?` service is subject to configuration, which might change over time, and thus leave orphaned derivatives dangling in the file system. Closes #270 Related to: - notch8/derivative_rodeo#56 [1]:https://github.com/samvera/hyrax/blob/b28d8ff35d2fb708483d2ce0c4e687450b7f5aef/app/services/hyrax/derivative_path.rb#L14-L18

Related to: - notch8/derivative_rodeo#56

**Context:** We're incorporating the derivative rodeo into the ingest process. This is first intended to be used by the OAI importer. The situation is as follows: In the OAI feed there are URLs for both the digital objects and a thumbnail. Due to prior constraints of ingest, we had to add the thumbnail as a FileSet to the work. However, with the derivative rodeo, we can look in S3 for an existing thumbnail (that was generated via SpaceStone) and assign that thumbnail to the digital object(s)'s file set. In other words, we can avoid adding the thumbnail file set (and running all the derivatives on that file set as well). However, this may conflict with some work done in GitLab I62 (as detailed in commit @433b66a8d95f49bd40335b5483621bb4e4a41227). **Discussion:** Prior to adding this change when the derivative service was working on the TN.jpg file (the thumbnail_url that comes with the OAI feed), it was raising a `ArgumentError' error “invalid byte sequence in UTF-8` exception. What was happening is that the derivative rodeo was wrongly assuming that the TN.jpg was a HOCR file. It was reading the contents of the file and asking if it was XML. And that raised the exception. As mentioned in [this comment][1], "the underlying problem remains." Namely the Derivate Rodeo service in IIIF print needs to better handle second order derivatives (e.g. generate a HOCR file). **Question:** - Is this the right approach? - What is the problem we were solving in @433b66a8d95f49bd40335b5483621bb4e4a41227 - What is the context of I62? I believe this is best resolved in a pairing/mobbing session. However, I put this forward for conversation. **Related to:** - notch8/derivative_rodeo#56 [1]:notch8/derivative_rodeo#56 (comment)

This is a bit of a stretch, based on other testing. But the included code comment may be relevant. Without this change I've consistently encounter the following observed behavior: ```gherkin Given an Bulkrax imported work with child works (e.g. pages split from a PDF) And I eradicate the imported work and child works When I re-build (e.g. Bulkrax::Entry#build) the import entry Then I do not see the attached files nor child works ``` With this change I've consistently encountered the following observed behavior: ```gherkin Given an Bulkrax imported work with child works (e.g. pages split from a PDF) And I eradicate the imported work and child works When I re-build (e.g. Bulkrax::Entry#build) the import entry Then I will see the work, child works, and derivatives, in the UI ``` **Context** This is copied from notch8/derivative_rodeo#56 (comment) to highlight the process. > Starting from a fresh Hyku tenant. When I import a work, files are associated with that work. When I eradicate the work and children then reimport the work, then no files show in the UI. > > One observation in the code is that the child works and file sets have an `is_child` true but their `is_child_of` is empty. Which leverages `ordered_by_ids`. > > Order of Operations > > - Ensure I have a clean Fedora and SOLR > - `docker compose down -v` works > - Check out the main branch > - `git checkout main` > - Build and bring up the image > - `docker compose up --force-recreate --build` > - Create the new tenant > - Login to hyku.test and create "adventist" tenant > - Stop the docker environment > - `docker compose stop` > - Set the correct branch and conditions > - This is a mix of stash and different branches. > - Branch `updating-to-test-the-rodeo` incorporates skipping the thumbnails > - Branch `without-slug-bug` is setup to ignore the slug features of Adventist; it is based on `updating-to-test-the-rodeo` > - Git Stash I'm referencing iiif\_print and derivative\_rodeo in `vendor/gems`; my stashed Gemfile and lock reflect this. > - Note: I have bumped the relationship job interval from 10 minutes to 1 minute > - In the `config/initializers/iiif_print.rb` I have add the following two lines at the bottom: > - `# DerivativeRodeo::Generators::PdfSplitGenerator.output_extension = 'jpg'` > - `DerivativeRodeo.config.logger = Logger.new(Rails.root.join("dr.log").to_s, level: Logger::DEBUG)` > - The above two lines use the TIFF for splitting (the format we used in the initial SpaceStone test) and setting the DerivativeRodeo logger means we send the very chatty information from the rodeo to a durable and separate location. This is helpful for reviewing to see how and what is being used. > - Bring up the images without starting services > - I don't want to auto-start the Rails server nor Good Job because we need to clean up two rotten jobs > - Bash into Worker and Delete All Jobs > - There should be two jobs the Embargo and Lease job. > - In Rails console ru `GoodJob::Job.destroy_all` > - Start the Rails server and good jobs > - Bash into web and run `bundle exec puma -v -b tcp://0.0.0.0:3000` > - Bash into worker > - export the following ENV variables > - `AWS_ACCESS_KEY_ID` > - `AWS_SECRET_ACCESS_KEY` > - `AWS_REGION` > - `AWS_S3_BUCKET` > - Run `bundle exec good_job start` > - Create an importer for the adventist tenant > - Using the OAI Parser > - With URL of <http://oai.adventistdigitallibrary.org/OAI-script> > - metadataPrefix of "oai\_adl" > - set of "adl:other" > - This import should take somewhere between 5 and 10 minutes > - Find one of the following AARK\_ID import entries: 20121816 (3 page PDF) or 20121862 (1 page PDF) > - Make note of the Bulkrax::Entry ID of the work, hereto referred to as `entry_id` > - VERIFY WORK STEP: Verify in the UI that the work has at least 3 items: the ARCHIVAL PDF, RAW.txt, and one image per page of the PDF. > - Assuming that is the case, shell into the web console and start the rails console and run the following: > - `switch!('adventist')` > - `entry = Bulkrax::Entry.find(entry_id)` # NOTE `entry_id` > - `GoodJob::Job.where(finished_at: nil).where(GoodJob::Job.arel_table['created_at'].lteq(DateTime.current)).joins_advisory_locks.where(pg_locks: { locktype: nil }).destroy_all` > - `entry.factory.find&.child_works&.map {|cw| cw.destroy(eradicate: true) }` > - This may output a =ActiveTriples::ParentStrategy::UnmutableParentError= message; if so run again > - `entry.factory.find&.destroy(eradicate: true)` > - This may output a =ActiveTriples::ParentStrategy::UnmutableParentError= message; if so I don't believe running it again will matter. > - `entry.build` > - Then repeat the VERIFY WORK STEP; you'll likely want to use the importer entries page for the `entry_id` > > If we do not have all of the same items as in the VERIFY WORK STEP then we continue to have a deep seeded bug; and may need to change something in the code and repeat some aspect of the checklist. Perhaps going all the way back to the first step. > > What I have observed is that the parent work (e.g. the object associated with `entry.factory.find`) has the correct associations assigned. > > `IiifPrint::LineageService.descendent_member_ids_for(work)` returns file\_set and work ids that represent the items in the UI. However, any of those ids when reified to an ActiveFedora object as `child` and then we call `IiifPrint::LineageService.ancestor_ids_for(child)`, we get an empty array. We should get an array that is `[work.id]` > > Digging deeper, each of the child works has `is_child` of `true` but their `ordered_by_ids` is an empty array. > > My current conjecture is one of the following is : > > - The [SlugBug overrides](https://github.com/scientist-softserv/adventist-dl/blob/main/app/models/concerns/slug_bug.rb) are problematic either alone or in relation to IIIF Print > - We can look at <https://github.com/scientist-softserv/adventist-dl/blob/main/config/initializers/slug_override.rb#L58-L69> to see how we amend the retrieval of `ordered_by_ids` > - We are missing the often copied `config/initializers/active_fedora_override.rb` in which we apply some changes to `ActiveFedora::Aggregation::ListSource.attribute_will_change!` > - essi > - louisville-hyku > - palni-palci > - nnp > - IIIF Print has some other underlying issue with relationships Related to: - notch8/derivative_rodeo#56 - https://github.com/scientist-softserv/nnp/commit/dc970b910bd29918f8d5dc420ffd33f940053e0c - notch8/palni-palci@687f361 - notch8/louisville-hyku@8bdead0 **NOTE:** We may want to add this to UTK.

Yes, we could have a DerivativeRodeo initializer...but we leverage the Rodeo by way of IiifPrint. So this makes sense as to the place to configure these things. In addition, I figured I'd share one of the things I did to help in the debugging of the DerivativeRodeo integration. Which has me thinking that perhaps we should look at doing this with the IiifPrint gem. After all, debugging that is also challenging. Related to: - notch8/derivative_rodeo#56

These changes reflect work done in verifying the following ticket: - notch8/derivative_rodeo#56

This commit reverts using the Derivative Rodeo for PDF splitting and derivative generation. It is also breadcrumbs for how to restore those functions. We revert from the Derivative Rodeo to the already established IIIF Print pluggable derivatives derived from the Newspaper works gem. The reason to revert is that this branch includes several changes that went into local testing of the DerivativeRodeo; and I want to capture those wins and merge in an already long-running branch, to reduce the chance of further branch drift. For reference, local testing of the DerivativeRodeo has worked both with and without having SpaceStone data for both PDF splitting and generating derivatives (e.g. thumbnails, word coordinates, alto files, and plain text). However, I had only done localized testing and I believe more testing is warranted; namely how does the full text search work. To consider is how we will: - Test on staging with the Rodeo but not have it in play for Production But that is an exercise for the person undoing this commit :) Related to: - notch8/derivative_rodeo#56 - https://github.com/scientist-softserv/adventist-dl/issues/500

Related to: - notch8/derivative_rodeo#56

**Context:** We're incorporating the derivative rodeo into the ingest process. This is first intended to be used by the OAI importer. The situation is as follows: In the OAI feed there are URLs for both the digital objects and a thumbnail. Due to prior constraints of ingest, we had to add the thumbnail as a FileSet to the work. However, with the derivative rodeo, we can look in S3 for an existing thumbnail (that was generated via SpaceStone) and assign that thumbnail to the digital object(s)'s file set. In other words, we can avoid adding the thumbnail file set (and running all the derivatives on that file set as well). However, this may conflict with some work done in GitLab I62 (as detailed in commit @433b66a8d95f49bd40335b5483621bb4e4a41227). **Discussion:** Prior to adding this change when the derivative service was working on the TN.jpg file (the thumbnail_url that comes with the OAI feed), it was raising a `ArgumentError' error “invalid byte sequence in UTF-8` exception. What was happening is that the derivative rodeo was wrongly assuming that the TN.jpg was a HOCR file. It was reading the contents of the file and asking if it was XML. And that raised the exception. As mentioned in [this comment][1], "the underlying problem remains." Namely the Derivate Rodeo service in IIIF print needs to better handle second order derivatives (e.g. generate a HOCR file). **Question:** - Is this the right approach? - What is the problem we were solving in @433b66a8d95f49bd40335b5483621bb4e4a41227 - What is the context of I62? I believe this is best resolved in a pairing/mobbing session. However, I put this forward for conversation. **Related to:** - notch8/derivative_rodeo#56 [1]:notch8/derivative_rodeo#56 (comment)

Yes, we could have a DerivativeRodeo initializer...but we leverage the Rodeo by way of IiifPrint. So this makes sense as to the place to configure these things. In addition, I figured I'd share one of the things I did to help in the debugging of the DerivativeRodeo integration. Which has me thinking that perhaps we should look at doing this with the IiifPrint gem. After all, debugging that is also challenging. Related to: - notch8/derivative_rodeo#56

These changes reflect work done in verifying the following ticket: - notch8/derivative_rodeo#56

This commit reverts using the Derivative Rodeo for PDF splitting and derivative generation. It is also breadcrumbs for how to restore those functions. We revert from the Derivative Rodeo to the already established IIIF Print pluggable derivatives derived from the Newspaper works gem. The reason to revert is that this branch includes several changes that went into local testing of the DerivativeRodeo; and I want to capture those wins and merge in an already long-running branch, to reduce the chance of further branch drift. For reference, local testing of the DerivativeRodeo has worked both with and without having SpaceStone data for both PDF splitting and generating derivatives (e.g. thumbnails, word coordinates, alto files, and plain text). However, I had only done localized testing and I believe more testing is warranted; namely how does the full text search work. To consider is how we will: - Test on staging with the Rodeo but not have it in play for Production But that is an exercise for the person undoing this commit :) Related to: - notch8/derivative_rodeo#56 - https://github.com/scientist-softserv/adventist-dl/issues/500

Without this commit, there's nothing in Hyrax/Hyku/IIIFPrint that will extract plain text from a plain text file. Related to: - notch8/adventist-dl@1d3e1a9 - notch8/derivative_rodeo#56 - https://github.com/scientist-softserv/adventist-dl/issues/500 - https://github.com/scientist-softserv/adventist-dl/issues/538

jeremyf changed the title ~~Derivative Rodeo Integration Test~~ ☄️ Derivative Rodeo Integration Test Jun 12, 2023

jeremyf self-assigned this Jul 6, 2023

jeremyf added a commit that referenced this issue Jul 6, 2023

💸 Bumping to v0.4.1

d70ca92

Related to: - #56

jeremyf mentioned this issue Jul 6, 2023

💸 Bumping to v0.4.1 #58

Merged

jeremyf mentioned this issue Jul 6, 2023

Introducing more logging notch8/space_stone-serverless#35

Merged

jeremyf added a commit that referenced this issue Jul 6, 2023

💸 Bumping to v0.4.1

9bc595d

Related to: - #56

jeremyf mentioned this issue Jul 11, 2023

🎁 Adding handling for "reader" files for CSV loader to spacestone notch8/adventist-dl#490

Merged

jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 11, 2023

⚙️ Updating to 0.4.2 of the Rodeo

7316fee

For testing the rodeo, I want to use the version that has more verbose logging. Related to: - notch8/derivative_rodeo#56

jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 11, 2023

🎁 WIP Configuring DerivativeRodeo splitter

b15039a

Related to: - notch8/derivative_rodeo#56

jeremyf mentioned this issue Jul 13, 2023

🐛 Fixing issue with misnamed constant and using variable before declared notch8/iiif_print#261

Merged

jeremyf mentioned this issue Jul 13, 2023

🐛 Fixing incorrect method name #61

Merged

jeremyf mentioned this issue Jul 14, 2023

🎁 Adding logging to generator #62

Merged

jeremyf mentioned this issue Jul 14, 2023

🎁 Default Skip Thumbnails for OAI Feed notch8/adventist-dl#497

Closed

jeremyf mentioned this issue Jul 14, 2023

🐛 Fixing bug in how we find a derived file notch8/iiif_print#262

Merged

jeremyf mentioned this issue Jul 14, 2023

🎁 Adding logging for PDF splitting information #63

Merged

jeremyf mentioned this issue Jul 19, 2023

⚙️ Configure PDFs to split into JPG pages notch8/adventist-dl#501

Merged

jeremyf mentioned this issue Jul 21, 2023

⚙️ Adding low-level logging #66

Merged

jeremyf mentioned this issue Jul 26, 2023

🐛 Adding monkey patch for ActiveFedora dirty optimizations notch8/adventist-dl#504

Merged

This was referenced Jul 27, 2023

📚 Minor changes to improve legibility notch8/iiif_print#269

Merged

Add IiifPrint::DerivativeRodeoService#cleanup_derivatives notch8/iiif_print#270

Closed

This was referenced Jul 27, 2023

🎁 Adding DerivativeRodeoService#cleanup_derivatives ♻️ notch8/iiif_print#271

Merged

Spike: bulkrax reimporting with a new FileSet causes weird behavior notch8/utk-hyku#202

Closed

jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 27, 2023

🎁 WIP Configuring DerivativeRodeo splitter

6c28006

Related to: - notch8/derivative_rodeo#56

jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 27, 2023

🎁 Incorporating v0.5.0 of Derivative Rodeo

77d0936

These changes reflect work done in verifying the following ticket: - notch8/derivative_rodeo#56

jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 27, 2023

🎁 WIP Configuring DerivativeRodeo splitter

6e6a782

Related to: - notch8/derivative_rodeo#56

jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 27, 2023

🎁 Incorporating v0.5.0 of Derivative Rodeo

6eeced0

These changes reflect work done in verifying the following ticket: - notch8/derivative_rodeo#56

jeremyf mentioned this issue Jul 27, 2023

Adding the Rodeo without configuring to use the Rodeo notch8/adventist-dl#507

Merged

jillpe added the needs discussion has open questions or need for discussion label Aug 31, 2023

jillpe changed the title ~~☄️ Derivative Rodeo Integration Test~~ ☄️ Derivative Rodeo Integration Epic Aug 31, 2023

jeremyf removed their assignment May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

☄️ Derivative Rodeo Integration Epic #56

☄️ Derivative Rodeo Integration Epic #56

jeremyf commented Jun 12, 2023 •

edited by ShanaLMoore

Loading

jeremyf commented Jul 11, 2023

jeremyf commented Jul 13, 2023 •

edited

Loading

jeremyf commented Jul 26, 2023 •

edited

Loading

☄️ Derivative Rodeo Integration Epic #56

☄️ Derivative Rodeo Integration Epic #56

Comments

jeremyf commented Jun 12, 2023 • edited by ShanaLMoore Loading

jeremyf commented Jul 11, 2023

jeremyf commented Jul 13, 2023 • edited Loading

Working Notes from 2023-07-13

Observed Problem

jeremyf commented Jul 26, 2023 • edited Loading

Order of Operations

Conjectures

jeremyf commented Jun 12, 2023 •

edited by ShanaLMoore

Loading

jeremyf commented Jul 13, 2023 •

edited

Loading

jeremyf commented Jul 26, 2023 •

edited

Loading