Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

☄️ Derivative Rodeo Integration Epic #56

Open
22 of 24 tasks
jeremyf opened this issue Jun 12, 2023 · 3 comments
Open
22 of 24 tasks

☄️ Derivative Rodeo Integration Epic #56

jeremyf opened this issue Jun 12, 2023 · 3 comments
Labels
needs discussion has open questions or need for discussion

Comments

@jeremyf
Copy link
Contributor

jeremyf commented Jun 12, 2023

The goal of this punchiest is to outline the steps necessary to verify that IIIF print picks up the changes

With the above SpaceStone and Derivative Rodeo adjustments

  • Set IIIF Print's application's logger level to :info
  • Update IIIF Print to use above Derivative Rodeo gem version
  • Update the IIIF Print configuration to leverage SpaceStone; this will require AWS credentials for the Pre Processed Buckets.
  • Run import of 20121816 entry (likely want to get a single CSV of this file)
  • Review logs; we should not see generating derivatives but instead should see log entries regarding found location
    • See Split PDF and constituent pages as works with expected derivative files (e.g. thumbnail and JSON)
  • Run import for a single image entry that has been ingested
Derivative Rodeo Integration Tests for PDF Splitting

The following are the scenarios I’m working through for integration testing:

  • Scenario: PDF Split does not exist
Given a work with a PDF
And SpaceStone has not split the PDF
When we import the PDF
Then the application should split the PDF
And attach the resulting split pages as child works
  • Scenario: Thumbnail of PDF does not exist
Given a work with a PDF
And SpaceStone has not pre-processed the thumbnail
When we import the PDF
Then the application generates a thumbnail
And attaches the thumbnail to the work
  • Scenario: PDF Split exists
Given a work with a PDF
And SpaceStone has not split the PDF
When we import the PDF
Then the application should split the PDF
And attach the resulting split pages as child works
  • Scenario: Thumbnail of PDF exists
Given a work with a PDF
And SpaceStone has pre-processed the thumbnail
When we import the PDF
Then the application retrieves the pre-processed thumbnail
And attaches the thumbnail to the work

After the integration test we will need to:

@jeremyf jeremyf changed the title Derivative Rodeo Integration Test ☄️ Derivative Rodeo Integration Test Jun 12, 2023
@jeremyf jeremyf self-assigned this Jul 6, 2023
jeremyf added a commit that referenced this issue Jul 6, 2023
jeremyf added a commit to notch8/space_stone-serverless that referenced this issue Jul 6, 2023
It is useful to see the inner works of decision making regarding the
derivative rodeo.  That is to say:

- "Does the file already exist at the target location?"
- "Does the file exist at the pregerenate location?"
- "Do we need to generate the at the target location?"

Related to:

- notch8/derivative_rodeo#56
jeremyf added a commit to notch8/space_stone-serverless that referenced this issue Jul 6, 2023
This commit contains two things:

1. Updated documentation
2. Updated submodule

Related to:

- notch8/derivative_rodeo#56

The derivative rodeo commit changes are as follows:

- notch8/derivative_rodeo@6fd304f :: 🎁 Adding logging to generators (2023-06-12)
- notch8/derivative_rodeo@795a7d2 :: 🐛 Hacking away the .mono suffix for 2nd order derivatives (2023-06-09)
- notch8/derivative_rodeo@2502c4c :: 🐛 Ensuring we submit any stray batch messages (2023-06-09)
jeremyf added a commit that referenced this issue Jul 6, 2023
jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 11, 2023
This commit incorporates the logic for handling the reader versions of
the files.

It addresses a bug in our queue management; namely needing to submit the
final buffered entries.

Related to:

- notch8/derivative_rodeo#56
@jeremyf
Copy link
Contributor Author

jeremyf commented Jul 11, 2023

I removed the 20121816 folder from the bucket, then ran the following to re-process:

  • <2023-07-11 Tue 09:19>: I submitted a single row for processing. This row was for a single PDF with 3 pages.
  • <2023-07-11 Tue 09:24>: Reviewed the bucket folder 20121816 and saw each page had been processed.
  • <2023-07-11 Tue 09:25>: Reviewed logs to see latest Derivative Rodeo changes in place.
  • <2023-07-11 Tue 09:29> : Resubmitted the row for processing, the expected behavior is that none of the files will change. Reviewed the split-ocr-thumbnailWorker and saw the logs mention:

DerivativeRodeo::Generators::PdfSplitGenerator#destination :: input_location file_uri s3://space-stone-dev-preprocessedbucketf21466dd-bxjjlz4251re.s3.us-west-1.amazonaws.com/20121816/20121816.ARCHIVAL–page-1.tiff: Found output_location file_uri s3://space-stone-dev-preprocessedbucketf21466dd-bxjjlz4251re.s3.us-west-1.amazonaws.com/20121816/20121816.ARCHIVAL–page-1.tiff.

In other words, the conditional generation is working.

jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 11, 2023
For testing the rodeo, I want to use the version that has more verbose
logging.

Related to:

- notch8/derivative_rodeo#56
jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 11, 2023
jeremyf added a commit to notch8/iiif_print that referenced this issue Jul 13, 2023
This commit contains four primary changes:

1. Fixing a misnamed constant.
2. Moving setting the optional filename to a point before we use the
   filename.
3. Leveraging if include instead of case statements
4. Adding exception decorating to provide additional context.

Of these, the quality of debugging change to exceptions pays the most
dividends.  It's helped provide insight into the specific URI that's
failing.

Related to:

- notch8/derivative_rodeo#56
jeremyf added a commit that referenced this issue Jul 13, 2023
This commit includes 3 changes:

1. Replacing a misspelled name with the correct name
2. Improving logging
3. Adding a parameter to an exception.

At one point we had a method `globbed_tail_locations` however with some
refactoring I renamed that to `matching_locations_in_file_dir`; however
I missed the `globbed_tail_loations`.

In addition to fixing the misnamed method name, I'm adding some improved
logging that helps with a problem I've encountered.  Namely that I got a

`#<ArgumentError: invalid byte sequence in UTF-8>` on a string.  That
string was the contents of the file found at the `path_to_hocr`; however
the exception didn't provide insight into the filename.

With this change, I fix a method that was broken and improve logging.

Last, the exception for buckets used an unprovided variable (e.g.
`file_uri`).  This now provides that expected file_uri.

Related to:

- #56
@jeremyf
Copy link
Contributor Author

jeremyf commented Jul 13, 2023

Working Notes from 2023-07-13

First, I am working a local copy that has the following Gemfile adjustment:

gem 'iiif_print', path: 'vendor/gems/iiif_print'
gem 'derivative-rodeo', path: 'vendor/gems/derivative_rodeo'

And then I have cloned the iiif_print and derivative_rodeo repositories into vendor/gems.

I’m working with AARK ID 20121816; which can be found at http://oai.adventistdigitallibrary.org/OAI-script?verb=GetRecord&identifier=20121816&metadataPrefix=oai_adl

The above AARK ID can be found in http://oai.adventistdigitallibrary.org/OAI-script?verb=ListRecords&metadataPrefix=oai_adl&set=adl:other as the 25th entry. I created an ingester for the OAI feed and have been working through problems locally.

I have setup my docker compose to not start the web application nor the workers. Instead I bring it up and then have two terminals: one for rails console and one for good jobs. I'm often restarting the good_job workers/service.

Following on https://playbook-staging.notch8.com/en/samvera/bulkrax/re-running-a-single-entry-from-an-import; I bash into the worker and run rails console and enter the following:

switch!('adventist'); entry = Bulkrax::Entry.find(136)
GoodJob::Job.where(finished_at: nil).where(GoodJob::Job.arel_table['created_at'].lteq(DateTime.current)).joins_advisory_locks.where(pg_locks: { locktype: nil }).destroy_all
entry.factory.find&.destroy(eradicate: true) 
entry.build

In the worker:

export AWS_ACCESS_KEY_ID=xxx ; export AWS_SECRET_ACCESS_KEY=xxx ; export AWS_REGION=us-west-2 ; export AWS_S3_BUCKET=space-stone-dev-preprocessedbucketf21466dd-bxjjlz4251re ;  bundle exec good_job start

Observed Problem

I am encountering an issue where the path_to_hocr is using an image file. As this is a thumbnail, I don’t really need to run derivatives on it. In fact, as part of the OAI import, we should skip attaching this file.

However, the underlying problem remains.


🤠🐮 DerivativeRodeo::Generators::WordCoordinatesGenerator#generated\_files encountered \`RuntimeError':
   “🤠🐮 DerivativeRodeo::Generators::WordCoordinatesGenerator#convert\_to\_coordinates encountered \`ArgumentError' error “invalid byte sequence in UTF-8” for path\_to\_hocr: "*tmp/d20230713-248-1vwuyjg/adl-ebstore-repo.s3.amazonaws.com/20/1218/20121816/20121816.TN.jpg" and path\_to\_coordinate: "/tmp/d20230713-248-1jpg9k1/app/samvera/hyrax-webapp/tmp/derivatives/a6/2f/f0/75*-f/57/d-*43/f7*-b/7d/5-/29/dd/74/74/1d/f1-json.json"”
 
 for input\_uri: "![img](https://adl-ebstore-repo.s3.amazonaws.com/20/1218/20121816/20121816.TN.jpg)",
 output\_location\_template: "<file:///app/samvera/hyrax-webapp/tmp/derivatives/a6/2f/f0/75/-f/57/d-/43/f7/-b/7d/5-/29/dd/74/74/1d/f1-json.json>",
 and preprocessed\_location\_template: "s3://space-stone-dev-preprocessedbucketf21466dd-bxjjlz4251re.s3.us-east-1.amazonaws.com/20121816/20121816.TN.jpg.coordinates.json".

jeremyf added a commit that referenced this issue Jul 14, 2023
With this commit, I'm providing insight into what we're requesting be generated.
Later logging will report more granular information.

I have found this helpful to understand what is happening in a rather expanse
and sometimes opaque process.

Related to:

- #56
jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 14, 2023
**Context:**

We're incorporating the derivative rodeo into the ingest process.  This
is first intended to be used by the OAI importer.

The situation is as follows: In the OAI feed there are URLs for both the
digital objects and a thumbnail.

Due to prior constraints of ingest, we had to add the thumbnail as a
FileSet to the work.  However, with the derivative rodeo, we can
look in S3 for an existing thumbnail (that was generated via SpaceStone)
and assign that thumbnail to the digital object(s)'s file set.

In other words, we can avoid adding the thumbnail file set (and running
all the derivatives on that file set as well).

However, this may conflict with some work done in GitLab I62 (as
detailed in commit @433b66a8d95f49bd40335b5483621bb4e4a41227).

**Discussion:**

Prior to adding this change when the derivative service was working on
the TN.jpg file (the thumbnail_url that comes with the OAI feed), it was
raising a `ArgumentError' error “invalid byte sequence in UTF-8`
exception.

What was happening is that the derivative rodeo was wrongly assuming
that the TN.jpg was a HOCR file.  It was reading the contents of the
file and asking if it was XML.  And that raised the exception.

As mentioned in [this comment][1], "the underlying problem remains."
Namely the Derivate Rodeo service in IIIF print needs to better handle
second order derivatives (e.g. generate a HOCR file).

**Question:**

- Is this the right approach?
- What is the problem we were solving in @433b66a8d95f49bd40335b5483621bb4e4a41227
- What is the context of I62?

I believe this is best resolved in a pairing/mobbing session.  However,
I put this forward for conversation.

**Related to:**

- notch8/derivative_rodeo#56

[1]:notch8/derivative_rodeo#56 (comment)
jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 14, 2023
**Context:**

We're incorporating the derivative rodeo into the ingest process.  This
is first intended to be used by the OAI importer.

The situation is as follows: In the OAI feed there are URLs for both the
digital objects and a thumbnail.

Due to prior constraints of ingest, we had to add the thumbnail as a
FileSet to the work.  However, with the derivative rodeo, we can
look in S3 for an existing thumbnail (that was generated via SpaceStone)
and assign that thumbnail to the digital object(s)'s file set.

In other words, we can avoid adding the thumbnail file set (and running
all the derivatives on that file set as well).

However, this may conflict with some work done in GitLab I62 (as
detailed in commit 433b66a).

**Discussion:**

Prior to adding this change when the derivative service was working on
the TN.jpg file (the thumbnail_url that comes with the OAI feed), it was
raising a `ArgumentError' error “invalid byte sequence in UTF-8`
exception.

What was happening is that the derivative rodeo was wrongly assuming
that the TN.jpg was a HOCR file.  It was reading the contents of the
file and asking if it was XML.  And that raised the exception.

As mentioned in [this comment][1], "the underlying problem remains."
Namely the Derivate Rodeo service in IIIF print needs to better handle
second order derivatives (e.g. generate a HOCR file).

**Question:**

- Is this the right approach?
- What is the problem we were solving in 433b66a
- What is the context of I62?

I believe this is best resolved in a pairing/mobbing session.  However,
I put this forward for conversation.

**Related to:**

- notch8/derivative_rodeo#56

[1]:notch8/derivative_rodeo#56 (comment)
jeremyf added a commit to notch8/iiif_print that referenced this issue Jul 14, 2023
Prior to this commit, the requesting the thumbnail for the file basename
"1234.ARCHIVAL.pdf" would result in a basename of
"1234.ARCHIVAL.pdf.thumbnail.jpeg".

With this commit, we now return the basename of
"1234.ARCHIVAL.thumbnail.jpeg".

Related to:

- notch8/derivative_rodeo#56
jeremyf added a commit that referenced this issue Jul 14, 2023
Prior to this commit, during run time we didn't have much insight as to
whether or not we were finding the split pages (and if so how many).

With this commit we log information regarding the number of files we
find, the path/index for each one OR if we're generating our own.

Related to:

- #56
jeremyf added a commit to notch8/iiif_print that referenced this issue Jul 14, 2023
Prior to this commit, the requesting the thumbnail for the file basename
"1234.ARCHIVAL.pdf" would result in a basename of
"1234.ARCHIVAL.pdf.thumbnail.jpeg".

With this commit, we now return the basename of
"1234.ARCHIVAL.thumbnail.jpeg".

Related to:

- notch8/derivative_rodeo#56
jeremyf added a commit that referenced this issue Jul 14, 2023
Prior to this commit, during run time we didn't have much insight as to
whether or not we were finding the split pages (and if so how many).

With this commit we log information regarding the number of files we
find, the path/index for each one OR if we're generating our own.

Related to:

- #56
jeremyf added a commit that referenced this issue Jul 14, 2023
Prior to this commit, during run time we didn't have much insight as to
whether or not we were finding the split pages (and if so how many).

With this commit we log information regarding the number of files we
find, the path/index for each one OR if we're generating our own.

Related to:

- #56
jeremyf added a commit that referenced this issue Jul 21, 2023
As I'm working through how files are moving through the DerivativeRodeo,
I am needing lower level information to get insight into what's
happening.

This commit, introduces debug level logging regarding finding files that
were generated as part of the pre-processing of splitting the PDF.

Related to:

- #56
@jeremyf
Copy link
Contributor Author

jeremyf commented Jul 26, 2023

Starting from a fresh Hyku tenant. When I import a work, files are associated with that work. When I eradicate the work and children then reimport the work, then sometimes (but not often) no files show in the UI.

One observation in the code is that the child works and file sets have an is_child true but their is_child_of is empty. Which leverages ordered_by_ids.

Order of Operations

  • Ensure I have a clean Fedora and SOLR
    • docker compose down -v works
  • Check out the main branch
    • git checkout main
  • Build and bring up the image
    • docker compose up --force-recreate --build
  • Create the new tenant
    • Login to hyku.test and create "adventist" tenant
  • Stop the docker environment
    • docker compose stop
  • Set the correct branch and conditions
    • This is a mix of stash and different branches.
    • Branch updating-to-test-the-rodeo incorporates skipping the thumbnails
    • Branch without-slug-bug is setup to ignore the slug features of Adventist; it is based on updating-to-test-the-rodeo
    • Git Stash I'm referencing iiif_print and derivative_rodeo in vendor/gems; my stashed Gemfile and lock reflect this.
    • Note: I have bumped the relationship job interval from 10 minutes to 1 minute
    • In the config/initializers/iiif_print.rb I have add the following two lines at the bottom:
      • # DerivativeRodeo::Generators::PdfSplitGenerator.output_extension = 'jpg'
      • DerivativeRodeo.config.logger = Logger.new(Rails.root.join("dr.log").to_s, level: Logger::DEBUG)
    • The above two lines use the TIFF for splitting (the format we used in the initial SpaceStone test) and setting the DerivativeRodeo logger means we send the very chatty information from the rodeo to a durable and separate location. This is helpful for reviewing to see how and what is being used.
  • Bring up the images without starting services
    • I don't want to auto-start the Rails server nor Good Job because we need to clean up two rotten jobs
  • Bash into Worker and Delete All Jobs
    • There should be two jobs the Embargo and Lease job.
    • In Rails console ru GoodJob::Job.destroy_all
  • Start the Rails server and good jobs
    • Bash into web and run bundle exec puma -v -b tcp://0.0.0.0:3000
    • Bash into worker
      • export the following ENV variables
        • AWS_ACCESS_KEY_ID
        • AWS_SECRET_ACCESS_KEY
        • AWS_REGION
        • AWS_S3_BUCKET
    • Run bundle exec good_job start
  • Create an importer for the adventist tenant
  • This import should take somewhere between 5 and 10 minutes
  • Find one of the following AARK_ID import entries: 20121816 (3 page PDF) or 20121862 (1 page PDF)
    • Make note of the Bulkrax::Entry ID of the work, hereto referred to as entry_id
  • VERIFY WORK STEP: Verify in the UI that the work has at least 3 items: the ARCHIVAL PDF, RAW.txt, and one image per page of the PDF.
  • Assuming that is the case, shell into the web console and start the rails console and run the following:
    • switch!('adventist')
    • entry = Bulkrax::Entry.find(entry_id) # NOTE entry_id
    • GoodJob::Job.where(finished_at: nil).where(GoodJob::Job.arel_table['created_at'].lteq(DateTime.current)).joins_advisory_locks.where(pg_locks: { locktype: nil }).destroy_all
    • entry.factory.find&.child_works&.map {|cw| cw.destroy(eradicate: true) }
    • This may output a =ActiveTriples::ParentStrategy::UnmutableParentError= message; if so run again
    • entry.factory.find&.destroy(eradicate: true)
    • This may output a =ActiveTriples::ParentStrategy::UnmutableParentError= message; if so I don't believe running it again will matter.
    • entry.build
  • Then repeat the VERIFY WORK STEP; you'll likely want to use the importer entries page for the entry_id

If we do not have all of the same items as in the VERIFY WORK STEP then we continue to have a deep seeded bug; and may need to change something in the code and repeat some aspect of the checklist. Perhaps going all the way back to the first step.

What I have observed is that the parent work (e.g. the object associated with entry.factory.find) has the correct associations assigned.

IiifPrint::LineageService.descendent_member_ids_for(work) returns file_set and work ids that represent the items in the UI. However, any of those ids when reified to an ActiveFedora object as child and then we call IiifPrint::LineageService.ancestor_ids_for(child), we get an empty array. We should get an array that is [work.id]

Digging deeper, each of the child works has is_child of true but their ordered_by_ids is an empty array.

Conjectures

My current conjecture is one of the following is :

jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 26, 2023
This is a bit of a stretch, based on other testing.  But the included
code comment may be relevant.

Without this change I've consistently encounter the following observed behavior:

```gherkin
Given an Bulkrax imported work with child works (e.g. pages split from a PDF)
And I eradicate the imported work and child works
When I re-build (e.g. Bulkrax::Entry#build) the import entry
Then I do not see the attached files nor child works
```

With this change I've consistently encountered the following observed behavior:

```gherkin
Given an Bulkrax imported work with child works (e.g. pages split from a PDF)
And I eradicate the imported work and child works
When I re-build (e.g. Bulkrax::Entry#build) the import entry
Then I will see the work, child works, and derivatives, in the UI
```

**Context**

This is copied from
notch8/derivative_rodeo#56 (comment)
to highlight the process.

> Starting from a fresh Hyku tenant.  When I import a work, files are associated with that work.  When I eradicate the work and children then reimport the work, then no files show in the UI.
>
> One observation in the code is that the child works and file sets have an `is_child` true but their `is_child_of` is empty.  Which leverages `ordered_by_ids`.
>
> Order of Operations
>
> -   Ensure I have a clean Fedora and SOLR
>     -   `docker compose down -v` works
> -   Check out the main branch
>     -   `git checkout main`
> -   Build and bring up the image
>     -   `docker compose up --force-recreate --build`
> -   Create the new tenant
>     -   Login to hyku.test and create "adventist" tenant
> -   Stop the docker environment
>     -   `docker compose stop`
> -   Set the correct branch and conditions
>     -   This is a mix of stash and different branches.
>     -   Branch `updating-to-test-the-rodeo` incorporates skipping the thumbnails
>     -   Branch `without-slug-bug` is setup to ignore the slug features of Adventist; it is based on `updating-to-test-the-rodeo`
>     -   Git Stash I'm referencing iiif\_print and derivative\_rodeo in `vendor/gems`; my stashed Gemfile and lock reflect this.
>     -   Note: I have bumped the relationship job interval from 10 minutes to 1 minute
>     -   In the `config/initializers/iiif_print.rb` I have add the following two lines at the bottom:
>         -   `# DerivativeRodeo::Generators::PdfSplitGenerator.output_extension = 'jpg'`
>         -   `DerivativeRodeo.config.logger = Logger.new(Rails.root.join("dr.log").to_s, level: Logger::DEBUG)`
>     -   The above two lines use the TIFF for splitting (the format we used in the initial SpaceStone test) and setting the DerivativeRodeo logger means we send the very chatty information from the rodeo to a durable and separate location.  This is helpful for reviewing to see how and what is being used.
> -   Bring up the images without starting services
>     -   I don't want to auto-start the Rails server nor Good Job because we need to clean up two rotten jobs
> -   Bash into Worker and Delete All Jobs
>     -   There should be two jobs the Embargo and Lease job.
>     -   In Rails console ru `GoodJob::Job.destroy_all`
> -   Start the Rails server and good jobs
>     -   Bash into web and run `bundle exec puma -v -b tcp://0.0.0.0:3000`
>     -   Bash into worker
>         -   export the following ENV variables
>             -   `AWS_ACCESS_KEY_ID`
>             -   `AWS_SECRET_ACCESS_KEY`
>             -   `AWS_REGION`
>             -   `AWS_S3_BUCKET`
>     -   Run `bundle exec good_job start`
> -   Create an importer for the adventist tenant
>     -   Using the OAI Parser
>     -   With URL of <http://oai.adventistdigitallibrary.org/OAI-script>
>     -   metadataPrefix of "oai\_adl"
>     -   set of "adl:other"
> -   This import should take somewhere between 5 and 10 minutes
> -   Find one of the following AARK\_ID import entries: 20121816 (3 page PDF) or 20121862 (1 page PDF)
>     -   Make note of the Bulkrax::Entry ID of the work, hereto referred to as `entry_id`
> -   VERIFY WORK STEP: Verify in the UI that the work has at least 3 items: the ARCHIVAL PDF, RAW.txt, and one image per page of the PDF.
> -   Assuming that is the case, shell into the web console and start the rails console and run the following:
>     -   `switch!('adventist')`
>     -   `entry = Bulkrax::Entry.find(entry_id)` # NOTE `entry_id`
>     -   `GoodJob::Job.where(finished_at: nil).where(GoodJob::Job.arel_table['created_at'].lteq(DateTime.current)).joins_advisory_locks.where(pg_locks: { locktype: nil }).destroy_all`
>     -   `entry.factory.find&.child_works&.map {|cw| cw.destroy(eradicate: true) }`
>        - This may output a =ActiveTriples::ParentStrategy::UnmutableParentError= message; if so run again
>     -   `entry.factory.find&.destroy(eradicate: true)`
>        - This may output a =ActiveTriples::ParentStrategy::UnmutableParentError= message; if so I don't believe running it again will matter.
>     -   `entry.build`
> -   Then repeat the VERIFY WORK STEP; you'll likely want to use the importer entries page for the `entry_id`
>
> If we do not have all of the same items as in the VERIFY WORK STEP then we continue to have a deep seeded bug; and may need to change something in the code and repeat some aspect of the checklist.  Perhaps going all the way back to the first step.
>
> What I have observed is that the parent work (e.g. the object associated with `entry.factory.find`) has the correct associations assigned.
>
> `IiifPrint::LineageService.descendent_member_ids_for(work)` returns file\_set and work ids that represent the items in the UI.  However, any of those ids when reified to an ActiveFedora object as `child` and then we call `IiifPrint::LineageService.ancestor_ids_for(child)`, we get an empty array.  We should get an array that is `[work.id]`
>
> Digging deeper, each of the child works has `is_child` of `true` but their `ordered_by_ids` is an empty array.
>
> My current conjecture is one of the following is :
>
> -   The [SlugBug overrides](https://github.com/scientist-softserv/adventist-dl/blob/main/app/models/concerns/slug_bug.rb) are problematic either alone or in relation to IIIF Print
>     -   We can look at <https://github.com/scientist-softserv/adventist-dl/blob/main/config/initializers/slug_override.rb#L58-L69> to see how we amend the retrieval of `ordered_by_ids`
> -   We are missing the often copied `config/initializers/active_fedora_override.rb` in which we apply some changes to `ActiveFedora::Aggregation::ListSource.attribute_will_change!`
>     -   essi
>     -   louisville-hyku
>     -   palni-palci
>     -   nnp
> -   IIIF Print has some other underlying issue with relationships

Related to:

- notch8/derivative_rodeo#56
- https://github.com/scientist-softserv/nnp/commit/dc970b910bd29918f8d5dc420ffd33f940053e0c
- notch8/palni-palci@687f361
- notch8/louisville-hyku@8bdead0

**NOTE:** We may want to add this to UTK.
jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 26, 2023
This is a bit of a stretch, based on other testing.  But the included
code comment may be relevant.

Without this change I've consistently encounter the following observed behavior:

```gherkin
Given an Bulkrax imported work with child works (e.g. pages split from a PDF)
And I eradicate the imported work and child works
When I re-build (e.g. Bulkrax::Entry#build) the import entry
Then I do not see the attached files nor child works
```

With this change I've consistently encountered the following observed behavior:

```gherkin
Given an Bulkrax imported work with child works (e.g. pages split from a PDF)
And I eradicate the imported work and child works
When I re-build (e.g. Bulkrax::Entry#build) the import entry
Then I will see the work, child works, and derivatives, in the UI
```

**Context**

This is copied from
notch8/derivative_rodeo#56 (comment)
to highlight the process.

> Starting from a fresh Hyku tenant.  When I import a work, files are associated with that work.  When I eradicate the work and children then reimport the work, then no files show in the UI.
>
> One observation in the code is that the child works and file sets have an `is_child` true but their `is_child_of` is empty.  Which leverages `ordered_by_ids`.
>
> Order of Operations
>
> -   Ensure I have a clean Fedora and SOLR
>     -   `docker compose down -v` works
> -   Check out the main branch
>     -   `git checkout main`
> -   Build and bring up the image
>     -   `docker compose up --force-recreate --build`
> -   Create the new tenant
>     -   Login to hyku.test and create "adventist" tenant
> -   Stop the docker environment
>     -   `docker compose stop`
> -   Set the correct branch and conditions
>     -   This is a mix of stash and different branches.
>     -   Branch `updating-to-test-the-rodeo` incorporates skipping the thumbnails
>     -   Branch `without-slug-bug` is setup to ignore the slug features of Adventist; it is based on `updating-to-test-the-rodeo`
>     -   Git Stash I'm referencing iiif\_print and derivative\_rodeo in `vendor/gems`; my stashed Gemfile and lock reflect this.
>     -   Note: I have bumped the relationship job interval from 10 minutes to 1 minute
>     -   In the `config/initializers/iiif_print.rb` I have add the following two lines at the bottom:
>         -   `# DerivativeRodeo::Generators::PdfSplitGenerator.output_extension = 'jpg'`
>         -   `DerivativeRodeo.config.logger = Logger.new(Rails.root.join("dr.log").to_s, level: Logger::DEBUG)`
>     -   The above two lines use the TIFF for splitting (the format we used in the initial SpaceStone test) and setting the DerivativeRodeo logger means we send the very chatty information from the rodeo to a durable and separate location.  This is helpful for reviewing to see how and what is being used.
> -   Bring up the images without starting services
>     -   I don't want to auto-start the Rails server nor Good Job because we need to clean up two rotten jobs
> -   Bash into Worker and Delete All Jobs
>     -   There should be two jobs the Embargo and Lease job.
>     -   In Rails console ru `GoodJob::Job.destroy_all`
> -   Start the Rails server and good jobs
>     -   Bash into web and run `bundle exec puma -v -b tcp://0.0.0.0:3000`
>     -   Bash into worker
>         -   export the following ENV variables
>             -   `AWS_ACCESS_KEY_ID`
>             -   `AWS_SECRET_ACCESS_KEY`
>             -   `AWS_REGION`
>             -   `AWS_S3_BUCKET`
>     -   Run `bundle exec good_job start`
> -   Create an importer for the adventist tenant
>     -   Using the OAI Parser
>     -   With URL of <http://oai.adventistdigitallibrary.org/OAI-script>
>     -   metadataPrefix of "oai\_adl"
>     -   set of "adl:other"
> -   This import should take somewhere between 5 and 10 minutes
> -   Find one of the following AARK\_ID import entries: 20121816 (3 page PDF) or 20121862 (1 page PDF)
>     -   Make note of the Bulkrax::Entry ID of the work, hereto referred to as `entry_id`
> -   VERIFY WORK STEP: Verify in the UI that the work has at least 3 items: the ARCHIVAL PDF, RAW.txt, and one image per page of the PDF.
> -   Assuming that is the case, shell into the web console and start the rails console and run the following:
>     -   `switch!('adventist')`
>     -   `entry = Bulkrax::Entry.find(entry_id)` # NOTE `entry_id`
>     -   `GoodJob::Job.where(finished_at: nil).where(GoodJob::Job.arel_table['created_at'].lteq(DateTime.current)).joins_advisory_locks.where(pg_locks: { locktype: nil }).destroy_all`
>     -   `entry.factory.find&.child_works&.map {|cw| cw.destroy(eradicate: true) }`
>        - This may output a =ActiveTriples::ParentStrategy::UnmutableParentError= message; if so run again
>     -   `entry.factory.find&.destroy(eradicate: true)`
>        - This may output a =ActiveTriples::ParentStrategy::UnmutableParentError= message; if so I don't believe running it again will matter.
>     -   `entry.build`
> -   Then repeat the VERIFY WORK STEP; you'll likely want to use the importer entries page for the `entry_id`
>
> If we do not have all of the same items as in the VERIFY WORK STEP then we continue to have a deep seeded bug; and may need to change something in the code and repeat some aspect of the checklist.  Perhaps going all the way back to the first step.
>
> What I have observed is that the parent work (e.g. the object associated with `entry.factory.find`) has the correct associations assigned.
>
> `IiifPrint::LineageService.descendent_member_ids_for(work)` returns file\_set and work ids that represent the items in the UI.  However, any of those ids when reified to an ActiveFedora object as `child` and then we call `IiifPrint::LineageService.ancestor_ids_for(child)`, we get an empty array.  We should get an array that is `[work.id]`
>
> Digging deeper, each of the child works has `is_child` of `true` but their `ordered_by_ids` is an empty array.
>
> My current conjecture is one of the following is :
>
> -   The [SlugBug overrides](https://github.com/scientist-softserv/adventist-dl/blob/main/app/models/concerns/slug_bug.rb) are problematic either alone or in relation to IIIF Print
>     -   We can look at <https://github.com/scientist-softserv/adventist-dl/blob/main/config/initializers/slug_override.rb#L58-L69> to see how we amend the retrieval of `ordered_by_ids`
> -   We are missing the often copied `config/initializers/active_fedora_override.rb` in which we apply some changes to `ActiveFedora::Aggregation::ListSource.attribute_will_change!`
>     -   essi
>     -   louisville-hyku
>     -   palni-palci
>     -   nnp
> -   IIIF Print has some other underlying issue with relationships

Related to:

- notch8/derivative_rodeo#56
- https://github.com/scientist-softserv/nnp/commit/dc970b910bd29918f8d5dc420ffd33f940053e0c
- notch8/palni-palci@687f361
- notch8/louisville-hyku@8bdead0

**NOTE:** We may want to add this to UTK.
jeremyf added a commit to notch8/iiif_print that referenced this issue Jul 27, 2023
This commit contains 3 separate concepts but they are all in service of
improved legibility:

1. Changing the hash to assume multi-line
2. Changing the return value to be a string instead of Array of Rodeo
   Locations
3. Adding a bit of documentation

All of this is in service of helping triage what's going on.

Related to:

- notch8/derivative_rodeo#56
jeremyf added a commit to notch8/iiif_print that referenced this issue Jul 27, 2023
Apologies in advance, this commit conflates two things, but I'll
explain.

This commit is in service of completing the DerivativeService interface;
namely the `#cleanup_derivatives` method.

Originally, I was thinking I would only delete the derivatives generated
by this service.  So I began refactoring to reduce knowledge.  That
refactor meant extracting `#named_derivatives_and_generators`, and as a
matter of hygiene and legibility, I moved the method closer to the
configuration.  The hope being that if one thing changes the other
might.

This then involved rethinking the `#create_derivatives` and
`#cleanup_derivatives` to use this new method.  I was looking for
symmetry in method implementation (e.g. loop over the named derivatives
and either create them or delete them).

However, as I looked at the other reference implementations I noticed
that I could get all of the derivatives by calling
`Hyrax::DerivativePath.derivatives_for_reference` ([see code][1]).  I
spent a bit of time thinking, as the comments indicate, as to which
approach to take: delete all derivatives OR only those that would be
created by the present configuration.

It makes sense to me to delete all of them, in part due to the
implementation details of finding the correct `valid?` derivative
service but also the fact that any `valid?` service is subject to
configuration, which might change over time, and thus leave orphaned
derivatives dangling in the file system.

Closes #270

Related to:

- notch8/derivative_rodeo#56

[1]:https://github.com/samvera/hyrax/blob/b28d8ff35d2fb708483d2ce0c4e687450b7f5aef/app/services/hyrax/derivative_path.rb#L14-L18
jeremyf added a commit to notch8/iiif_print that referenced this issue Jul 27, 2023
Apologies in advance, this commit conflates two things, but I'll
explain.

This commit is in service of completing the DerivativeService interface;
namely the `#cleanup_derivatives` method.

Originally, I was thinking I would only delete the derivatives generated
by this service.  So I began refactoring to reduce knowledge.  That
refactor meant extracting `#named_derivatives_and_generators`, and as a
matter of hygiene and legibility, I moved the method closer to the
configuration.  The hope being that if one thing changes the other
might.

This then involved rethinking the `#create_derivatives` and
`#cleanup_derivatives` to use this new method.  I was looking for
symmetry in method implementation (e.g. loop over the named derivatives
and either create them or delete them).

However, as I looked at the other reference implementations I noticed
that I could get all of the derivatives by calling
`Hyrax::DerivativePath.derivatives_for_reference` ([see code][1]).  I
spent a bit of time thinking, as the comments indicate, as to which
approach to take: delete all derivatives OR only those that would be
created by the present configuration.

It makes sense to me to delete all of them, in part due to the
implementation details of finding the correct `valid?` derivative
service but also the fact that any `valid?` service is subject to
configuration, which might change over time, and thus leave orphaned
derivatives dangling in the file system.

Closes #270

Related to:

- notch8/derivative_rodeo#56

[1]:https://github.com/samvera/hyrax/blob/b28d8ff35d2fb708483d2ce0c4e687450b7f5aef/app/services/hyrax/derivative_path.rb#L14-L18
jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 27, 2023
jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 27, 2023
**Context:**

We're incorporating the derivative rodeo into the ingest process.  This
is first intended to be used by the OAI importer.

The situation is as follows: In the OAI feed there are URLs for both the
digital objects and a thumbnail.

Due to prior constraints of ingest, we had to add the thumbnail as a
FileSet to the work.  However, with the derivative rodeo, we can
look in S3 for an existing thumbnail (that was generated via SpaceStone)
and assign that thumbnail to the digital object(s)'s file set.

In other words, we can avoid adding the thumbnail file set (and running
all the derivatives on that file set as well).

However, this may conflict with some work done in GitLab I62 (as
detailed in commit @433b66a8d95f49bd40335b5483621bb4e4a41227).

**Discussion:**

Prior to adding this change when the derivative service was working on
the TN.jpg file (the thumbnail_url that comes with the OAI feed), it was
raising a `ArgumentError' error “invalid byte sequence in UTF-8`
exception.

What was happening is that the derivative rodeo was wrongly assuming
that the TN.jpg was a HOCR file.  It was reading the contents of the
file and asking if it was XML.  And that raised the exception.

As mentioned in [this comment][1], "the underlying problem remains."
Namely the Derivate Rodeo service in IIIF print needs to better handle
second order derivatives (e.g. generate a HOCR file).

**Question:**

- Is this the right approach?
- What is the problem we were solving in @433b66a8d95f49bd40335b5483621bb4e4a41227
- What is the context of I62?

I believe this is best resolved in a pairing/mobbing session.  However,
I put this forward for conversation.

**Related to:**

- notch8/derivative_rodeo#56

[1]:notch8/derivative_rodeo#56 (comment)
jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 27, 2023
This is a bit of a stretch, based on other testing.  But the included
code comment may be relevant.

Without this change I've consistently encounter the following observed behavior:

```gherkin
Given an Bulkrax imported work with child works (e.g. pages split from a PDF)
And I eradicate the imported work and child works
When I re-build (e.g. Bulkrax::Entry#build) the import entry
Then I do not see the attached files nor child works
```

With this change I've consistently encountered the following observed behavior:

```gherkin
Given an Bulkrax imported work with child works (e.g. pages split from a PDF)
And I eradicate the imported work and child works
When I re-build (e.g. Bulkrax::Entry#build) the import entry
Then I will see the work, child works, and derivatives, in the UI
```

**Context**

This is copied from
notch8/derivative_rodeo#56 (comment)
to highlight the process.

> Starting from a fresh Hyku tenant.  When I import a work, files are associated with that work.  When I eradicate the work and children then reimport the work, then no files show in the UI.
>
> One observation in the code is that the child works and file sets have an `is_child` true but their `is_child_of` is empty.  Which leverages `ordered_by_ids`.
>
> Order of Operations
>
> -   Ensure I have a clean Fedora and SOLR
>     -   `docker compose down -v` works
> -   Check out the main branch
>     -   `git checkout main`
> -   Build and bring up the image
>     -   `docker compose up --force-recreate --build`
> -   Create the new tenant
>     -   Login to hyku.test and create "adventist" tenant
> -   Stop the docker environment
>     -   `docker compose stop`
> -   Set the correct branch and conditions
>     -   This is a mix of stash and different branches.
>     -   Branch `updating-to-test-the-rodeo` incorporates skipping the thumbnails
>     -   Branch `without-slug-bug` is setup to ignore the slug features of Adventist; it is based on `updating-to-test-the-rodeo`
>     -   Git Stash I'm referencing iiif\_print and derivative\_rodeo in `vendor/gems`; my stashed Gemfile and lock reflect this.
>     -   Note: I have bumped the relationship job interval from 10 minutes to 1 minute
>     -   In the `config/initializers/iiif_print.rb` I have add the following two lines at the bottom:
>         -   `# DerivativeRodeo::Generators::PdfSplitGenerator.output_extension = 'jpg'`
>         -   `DerivativeRodeo.config.logger = Logger.new(Rails.root.join("dr.log").to_s, level: Logger::DEBUG)`
>     -   The above two lines use the TIFF for splitting (the format we used in the initial SpaceStone test) and setting the DerivativeRodeo logger means we send the very chatty information from the rodeo to a durable and separate location.  This is helpful for reviewing to see how and what is being used.
> -   Bring up the images without starting services
>     -   I don't want to auto-start the Rails server nor Good Job because we need to clean up two rotten jobs
> -   Bash into Worker and Delete All Jobs
>     -   There should be two jobs the Embargo and Lease job.
>     -   In Rails console ru `GoodJob::Job.destroy_all`
> -   Start the Rails server and good jobs
>     -   Bash into web and run `bundle exec puma -v -b tcp://0.0.0.0:3000`
>     -   Bash into worker
>         -   export the following ENV variables
>             -   `AWS_ACCESS_KEY_ID`
>             -   `AWS_SECRET_ACCESS_KEY`
>             -   `AWS_REGION`
>             -   `AWS_S3_BUCKET`
>     -   Run `bundle exec good_job start`
> -   Create an importer for the adventist tenant
>     -   Using the OAI Parser
>     -   With URL of <http://oai.adventistdigitallibrary.org/OAI-script>
>     -   metadataPrefix of "oai\_adl"
>     -   set of "adl:other"
> -   This import should take somewhere between 5 and 10 minutes
> -   Find one of the following AARK\_ID import entries: 20121816 (3 page PDF) or 20121862 (1 page PDF)
>     -   Make note of the Bulkrax::Entry ID of the work, hereto referred to as `entry_id`
> -   VERIFY WORK STEP: Verify in the UI that the work has at least 3 items: the ARCHIVAL PDF, RAW.txt, and one image per page of the PDF.
> -   Assuming that is the case, shell into the web console and start the rails console and run the following:
>     -   `switch!('adventist')`
>     -   `entry = Bulkrax::Entry.find(entry_id)` # NOTE `entry_id`
>     -   `GoodJob::Job.where(finished_at: nil).where(GoodJob::Job.arel_table['created_at'].lteq(DateTime.current)).joins_advisory_locks.where(pg_locks: { locktype: nil }).destroy_all`
>     -   `entry.factory.find&.child_works&.map {|cw| cw.destroy(eradicate: true) }`
>        - This may output a =ActiveTriples::ParentStrategy::UnmutableParentError= message; if so run again
>     -   `entry.factory.find&.destroy(eradicate: true)`
>        - This may output a =ActiveTriples::ParentStrategy::UnmutableParentError= message; if so I don't believe running it again will matter.
>     -   `entry.build`
> -   Then repeat the VERIFY WORK STEP; you'll likely want to use the importer entries page for the `entry_id`
>
> If we do not have all of the same items as in the VERIFY WORK STEP then we continue to have a deep seeded bug; and may need to change something in the code and repeat some aspect of the checklist.  Perhaps going all the way back to the first step.
>
> What I have observed is that the parent work (e.g. the object associated with `entry.factory.find`) has the correct associations assigned.
>
> `IiifPrint::LineageService.descendent_member_ids_for(work)` returns file\_set and work ids that represent the items in the UI.  However, any of those ids when reified to an ActiveFedora object as `child` and then we call `IiifPrint::LineageService.ancestor_ids_for(child)`, we get an empty array.  We should get an array that is `[work.id]`
>
> Digging deeper, each of the child works has `is_child` of `true` but their `ordered_by_ids` is an empty array.
>
> My current conjecture is one of the following is :
>
> -   The [SlugBug overrides](https://github.com/scientist-softserv/adventist-dl/blob/main/app/models/concerns/slug_bug.rb) are problematic either alone or in relation to IIIF Print
>     -   We can look at <https://github.com/scientist-softserv/adventist-dl/blob/main/config/initializers/slug_override.rb#L58-L69> to see how we amend the retrieval of `ordered_by_ids`
> -   We are missing the often copied `config/initializers/active_fedora_override.rb` in which we apply some changes to `ActiveFedora::Aggregation::ListSource.attribute_will_change!`
>     -   essi
>     -   louisville-hyku
>     -   palni-palci
>     -   nnp
> -   IIIF Print has some other underlying issue with relationships

Related to:

- notch8/derivative_rodeo#56
- https://github.com/scientist-softserv/nnp/commit/dc970b910bd29918f8d5dc420ffd33f940053e0c
- notch8/palni-palci@687f361
- notch8/louisville-hyku@8bdead0

**NOTE:** We may want to add this to UTK.
jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 27, 2023
Yes, we could have a DerivativeRodeo initializer...but we leverage the
Rodeo by way of IiifPrint.  So this makes sense as to the place to
configure these things.

In addition, I figured I'd share one of the things I did to help in the
debugging of the DerivativeRodeo integration.  Which has me thinking
that perhaps we should look at doing this with the IiifPrint gem.  After
all, debugging that is also challenging.

Related to:

- notch8/derivative_rodeo#56
jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 27, 2023
These changes reflect work done in verifying the following ticket:

- notch8/derivative_rodeo#56
jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 27, 2023
This commit reverts using the Derivative Rodeo for PDF splitting and
derivative generation.  It is also breadcrumbs for how to restore those
functions.  We revert from the Derivative Rodeo to the already
established IIIF Print pluggable derivatives derived from the Newspaper
works gem.

The reason to revert is that this branch includes several changes that
went into local testing of the DerivativeRodeo; and I want to capture
those wins and merge in an already long-running branch, to reduce the
chance of further branch drift.

For reference, local testing of the DerivativeRodeo has worked both with
and without having SpaceStone data for both PDF splitting and generating
derivatives (e.g.  thumbnails, word coordinates, alto files, and plain
text).  However, I had only done localized testing and I believe more
testing is warranted; namely how does the full text search work.

To consider is how we will:

- Test on staging with the Rodeo but not have it in play for Production

But that is an exercise for the person undoing this commit :)

Related to:

- notch8/derivative_rodeo#56
- https://github.com/scientist-softserv/adventist-dl/issues/500
jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 27, 2023
jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 27, 2023
**Context:**

We're incorporating the derivative rodeo into the ingest process.  This
is first intended to be used by the OAI importer.

The situation is as follows: In the OAI feed there are URLs for both the
digital objects and a thumbnail.

Due to prior constraints of ingest, we had to add the thumbnail as a
FileSet to the work.  However, with the derivative rodeo, we can
look in S3 for an existing thumbnail (that was generated via SpaceStone)
and assign that thumbnail to the digital object(s)'s file set.

In other words, we can avoid adding the thumbnail file set (and running
all the derivatives on that file set as well).

However, this may conflict with some work done in GitLab I62 (as
detailed in commit @433b66a8d95f49bd40335b5483621bb4e4a41227).

**Discussion:**

Prior to adding this change when the derivative service was working on
the TN.jpg file (the thumbnail_url that comes with the OAI feed), it was
raising a `ArgumentError' error “invalid byte sequence in UTF-8`
exception.

What was happening is that the derivative rodeo was wrongly assuming
that the TN.jpg was a HOCR file.  It was reading the contents of the
file and asking if it was XML.  And that raised the exception.

As mentioned in [this comment][1], "the underlying problem remains."
Namely the Derivate Rodeo service in IIIF print needs to better handle
second order derivatives (e.g. generate a HOCR file).

**Question:**

- Is this the right approach?
- What is the problem we were solving in @433b66a8d95f49bd40335b5483621bb4e4a41227
- What is the context of I62?

I believe this is best resolved in a pairing/mobbing session.  However,
I put this forward for conversation.

**Related to:**

- notch8/derivative_rodeo#56

[1]:notch8/derivative_rodeo#56 (comment)
jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 27, 2023
Yes, we could have a DerivativeRodeo initializer...but we leverage the
Rodeo by way of IiifPrint.  So this makes sense as to the place to
configure these things.

In addition, I figured I'd share one of the things I did to help in the
debugging of the DerivativeRodeo integration.  Which has me thinking
that perhaps we should look at doing this with the IiifPrint gem.  After
all, debugging that is also challenging.

Related to:

- notch8/derivative_rodeo#56
jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 27, 2023
These changes reflect work done in verifying the following ticket:

- notch8/derivative_rodeo#56
jeremyf added a commit to notch8/adventist-dl that referenced this issue Jul 27, 2023
This commit reverts using the Derivative Rodeo for PDF splitting and
derivative generation.  It is also breadcrumbs for how to restore those
functions.  We revert from the Derivative Rodeo to the already
established IIIF Print pluggable derivatives derived from the Newspaper
works gem.

The reason to revert is that this branch includes several changes that
went into local testing of the DerivativeRodeo; and I want to capture
those wins and merge in an already long-running branch, to reduce the
chance of further branch drift.

For reference, local testing of the DerivativeRodeo has worked both with
and without having SpaceStone data for both PDF splitting and generating
derivatives (e.g.  thumbnails, word coordinates, alto files, and plain
text).  However, I had only done localized testing and I believe more
testing is warranted; namely how does the full text search work.

To consider is how we will:

- Test on staging with the Rodeo but not have it in play for Production

But that is an exercise for the person undoing this commit :)

Related to:

- notch8/derivative_rodeo#56
- https://github.com/scientist-softserv/adventist-dl/issues/500
@jillpe jillpe added the needs discussion has open questions or need for discussion label Aug 31, 2023
@jillpe jillpe changed the title ☄️ Derivative Rodeo Integration Test ☄️ Derivative Rodeo Integration Epic Aug 31, 2023
jeremyf added a commit to notch8/adventist_knapsack that referenced this issue Oct 4, 2023
@jeremyf jeremyf removed their assignment May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs discussion has open questions or need for discussion
Projects
None yet
Development

No branches or pull requests

2 participants