-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
☄️ Derivative Rodeo Integration Epic #56
Comments
It is useful to see the inner works of decision making regarding the derivative rodeo. That is to say: - "Does the file already exist at the target location?" - "Does the file exist at the pregerenate location?" - "Do we need to generate the at the target location?" Related to: - notch8/derivative_rodeo#56
This commit contains two things: 1. Updated documentation 2. Updated submodule Related to: - notch8/derivative_rodeo#56 The derivative rodeo commit changes are as follows: - notch8/derivative_rodeo@6fd304f :: 🎁 Adding logging to generators (2023-06-12) - notch8/derivative_rodeo@795a7d2 :: 🐛 Hacking away the .mono suffix for 2nd order derivatives (2023-06-09) - notch8/derivative_rodeo@2502c4c :: 🐛 Ensuring we submit any stray batch messages (2023-06-09)
This commit incorporates the logic for handling the reader versions of the files. It addresses a bug in our queue management; namely needing to submit the final buffered entries. Related to: - notch8/derivative_rodeo#56
I removed the 20121816 folder from the bucket, then ran the following to re-process:
In other words, the conditional generation is working. |
For testing the rodeo, I want to use the version that has more verbose logging. Related to: - notch8/derivative_rodeo#56
This commit contains four primary changes: 1. Fixing a misnamed constant. 2. Moving setting the optional filename to a point before we use the filename. 3. Leveraging if include instead of case statements 4. Adding exception decorating to provide additional context. Of these, the quality of debugging change to exceptions pays the most dividends. It's helped provide insight into the specific URI that's failing. Related to: - notch8/derivative_rodeo#56
This commit includes 3 changes: 1. Replacing a misspelled name with the correct name 2. Improving logging 3. Adding a parameter to an exception. At one point we had a method `globbed_tail_locations` however with some refactoring I renamed that to `matching_locations_in_file_dir`; however I missed the `globbed_tail_loations`. In addition to fixing the misnamed method name, I'm adding some improved logging that helps with a problem I've encountered. Namely that I got a `#<ArgumentError: invalid byte sequence in UTF-8>` on a string. That string was the contents of the file found at the `path_to_hocr`; however the exception didn't provide insight into the filename. With this change, I fix a method that was broken and improve logging. Last, the exception for buckets used an unprovided variable (e.g. `file_uri`). This now provides that expected file_uri. Related to: - #56
Working Notes from 2023-07-13First, I am working a local copy that has the following Gemfile adjustment: gem 'iiif_print', path: 'vendor/gems/iiif_print'
gem 'derivative-rodeo', path: 'vendor/gems/derivative_rodeo' And then I have cloned the I’m working with AARK ID 20121816; which can be found at http://oai.adventistdigitallibrary.org/OAI-script?verb=GetRecord&identifier=20121816&metadataPrefix=oai_adl The above AARK ID can be found in http://oai.adventistdigitallibrary.org/OAI-script?verb=ListRecords&metadataPrefix=oai_adl&set=adl:other as the 25th entry. I created an ingester for the OAI feed and have been working through problems locally. I have setup my docker compose to not start the web application nor the workers. Instead I bring it up and then have two terminals: one for Following on https://playbook-staging.notch8.com/en/samvera/bulkrax/re-running-a-single-entry-from-an-import; I bash into the worker and run switch!('adventist'); entry = Bulkrax::Entry.find(136)
GoodJob::Job.where(finished_at: nil).where(GoodJob::Job.arel_table['created_at'].lteq(DateTime.current)).joins_advisory_locks.where(pg_locks: { locktype: nil }).destroy_all
entry.factory.find&.destroy(eradicate: true)
entry.build In the worker: export AWS_ACCESS_KEY_ID=xxx ; export AWS_SECRET_ACCESS_KEY=xxx ; export AWS_REGION=us-west-2 ; export AWS_S3_BUCKET=space-stone-dev-preprocessedbucketf21466dd-bxjjlz4251re ; bundle exec good_job start Observed ProblemI am encountering an issue where the However, the underlying problem remains.
|
With this commit, I'm providing insight into what we're requesting be generated. Later logging will report more granular information. I have found this helpful to understand what is happening in a rather expanse and sometimes opaque process. Related to: - #56
**Context:** We're incorporating the derivative rodeo into the ingest process. This is first intended to be used by the OAI importer. The situation is as follows: In the OAI feed there are URLs for both the digital objects and a thumbnail. Due to prior constraints of ingest, we had to add the thumbnail as a FileSet to the work. However, with the derivative rodeo, we can look in S3 for an existing thumbnail (that was generated via SpaceStone) and assign that thumbnail to the digital object(s)'s file set. In other words, we can avoid adding the thumbnail file set (and running all the derivatives on that file set as well). However, this may conflict with some work done in GitLab I62 (as detailed in commit @433b66a8d95f49bd40335b5483621bb4e4a41227). **Discussion:** Prior to adding this change when the derivative service was working on the TN.jpg file (the thumbnail_url that comes with the OAI feed), it was raising a `ArgumentError' error “invalid byte sequence in UTF-8` exception. What was happening is that the derivative rodeo was wrongly assuming that the TN.jpg was a HOCR file. It was reading the contents of the file and asking if it was XML. And that raised the exception. As mentioned in [this comment][1], "the underlying problem remains." Namely the Derivate Rodeo service in IIIF print needs to better handle second order derivatives (e.g. generate a HOCR file). **Question:** - Is this the right approach? - What is the problem we were solving in @433b66a8d95f49bd40335b5483621bb4e4a41227 - What is the context of I62? I believe this is best resolved in a pairing/mobbing session. However, I put this forward for conversation. **Related to:** - notch8/derivative_rodeo#56 [1]:notch8/derivative_rodeo#56 (comment)
**Context:** We're incorporating the derivative rodeo into the ingest process. This is first intended to be used by the OAI importer. The situation is as follows: In the OAI feed there are URLs for both the digital objects and a thumbnail. Due to prior constraints of ingest, we had to add the thumbnail as a FileSet to the work. However, with the derivative rodeo, we can look in S3 for an existing thumbnail (that was generated via SpaceStone) and assign that thumbnail to the digital object(s)'s file set. In other words, we can avoid adding the thumbnail file set (and running all the derivatives on that file set as well). However, this may conflict with some work done in GitLab I62 (as detailed in commit 433b66a). **Discussion:** Prior to adding this change when the derivative service was working on the TN.jpg file (the thumbnail_url that comes with the OAI feed), it was raising a `ArgumentError' error “invalid byte sequence in UTF-8` exception. What was happening is that the derivative rodeo was wrongly assuming that the TN.jpg was a HOCR file. It was reading the contents of the file and asking if it was XML. And that raised the exception. As mentioned in [this comment][1], "the underlying problem remains." Namely the Derivate Rodeo service in IIIF print needs to better handle second order derivatives (e.g. generate a HOCR file). **Question:** - Is this the right approach? - What is the problem we were solving in 433b66a - What is the context of I62? I believe this is best resolved in a pairing/mobbing session. However, I put this forward for conversation. **Related to:** - notch8/derivative_rodeo#56 [1]:notch8/derivative_rodeo#56 (comment)
Prior to this commit, the requesting the thumbnail for the file basename "1234.ARCHIVAL.pdf" would result in a basename of "1234.ARCHIVAL.pdf.thumbnail.jpeg". With this commit, we now return the basename of "1234.ARCHIVAL.thumbnail.jpeg". Related to: - notch8/derivative_rodeo#56
Prior to this commit, during run time we didn't have much insight as to whether or not we were finding the split pages (and if so how many). With this commit we log information regarding the number of files we find, the path/index for each one OR if we're generating our own. Related to: - #56
Prior to this commit, the requesting the thumbnail for the file basename "1234.ARCHIVAL.pdf" would result in a basename of "1234.ARCHIVAL.pdf.thumbnail.jpeg". With this commit, we now return the basename of "1234.ARCHIVAL.thumbnail.jpeg". Related to: - notch8/derivative_rodeo#56
Prior to this commit, during run time we didn't have much insight as to whether or not we were finding the split pages (and if so how many). With this commit we log information regarding the number of files we find, the path/index for each one OR if we're generating our own. Related to: - #56
Prior to this commit, during run time we didn't have much insight as to whether or not we were finding the split pages (and if so how many). With this commit we log information regarding the number of files we find, the path/index for each one OR if we're generating our own. Related to: - #56
As I'm working through how files are moving through the DerivativeRodeo, I am needing lower level information to get insight into what's happening. This commit, introduces debug level logging regarding finding files that were generated as part of the pre-processing of splitting the PDF. Related to: - #56
Starting from a fresh Hyku tenant. When I import a work, files are associated with that work. When I eradicate the work and children then reimport the work, then sometimes (but not often) no files show in the UI. One observation in the code is that the child works and file sets have an Order of Operations
If we do not have all of the same items as in the VERIFY WORK STEP then we continue to have a deep seeded bug; and may need to change something in the code and repeat some aspect of the checklist. Perhaps going all the way back to the first step. What I have observed is that the parent work (e.g. the object associated with
Digging deeper, each of the child works has ConjecturesMy current conjecture is one of the following is :
|
This is a bit of a stretch, based on other testing. But the included code comment may be relevant. Without this change I've consistently encounter the following observed behavior: ```gherkin Given an Bulkrax imported work with child works (e.g. pages split from a PDF) And I eradicate the imported work and child works When I re-build (e.g. Bulkrax::Entry#build) the import entry Then I do not see the attached files nor child works ``` With this change I've consistently encountered the following observed behavior: ```gherkin Given an Bulkrax imported work with child works (e.g. pages split from a PDF) And I eradicate the imported work and child works When I re-build (e.g. Bulkrax::Entry#build) the import entry Then I will see the work, child works, and derivatives, in the UI ``` **Context** This is copied from notch8/derivative_rodeo#56 (comment) to highlight the process. > Starting from a fresh Hyku tenant. When I import a work, files are associated with that work. When I eradicate the work and children then reimport the work, then no files show in the UI. > > One observation in the code is that the child works and file sets have an `is_child` true but their `is_child_of` is empty. Which leverages `ordered_by_ids`. > > Order of Operations > > - Ensure I have a clean Fedora and SOLR > - `docker compose down -v` works > - Check out the main branch > - `git checkout main` > - Build and bring up the image > - `docker compose up --force-recreate --build` > - Create the new tenant > - Login to hyku.test and create "adventist" tenant > - Stop the docker environment > - `docker compose stop` > - Set the correct branch and conditions > - This is a mix of stash and different branches. > - Branch `updating-to-test-the-rodeo` incorporates skipping the thumbnails > - Branch `without-slug-bug` is setup to ignore the slug features of Adventist; it is based on `updating-to-test-the-rodeo` > - Git Stash I'm referencing iiif\_print and derivative\_rodeo in `vendor/gems`; my stashed Gemfile and lock reflect this. > - Note: I have bumped the relationship job interval from 10 minutes to 1 minute > - In the `config/initializers/iiif_print.rb` I have add the following two lines at the bottom: > - `# DerivativeRodeo::Generators::PdfSplitGenerator.output_extension = 'jpg'` > - `DerivativeRodeo.config.logger = Logger.new(Rails.root.join("dr.log").to_s, level: Logger::DEBUG)` > - The above two lines use the TIFF for splitting (the format we used in the initial SpaceStone test) and setting the DerivativeRodeo logger means we send the very chatty information from the rodeo to a durable and separate location. This is helpful for reviewing to see how and what is being used. > - Bring up the images without starting services > - I don't want to auto-start the Rails server nor Good Job because we need to clean up two rotten jobs > - Bash into Worker and Delete All Jobs > - There should be two jobs the Embargo and Lease job. > - In Rails console ru `GoodJob::Job.destroy_all` > - Start the Rails server and good jobs > - Bash into web and run `bundle exec puma -v -b tcp://0.0.0.0:3000` > - Bash into worker > - export the following ENV variables > - `AWS_ACCESS_KEY_ID` > - `AWS_SECRET_ACCESS_KEY` > - `AWS_REGION` > - `AWS_S3_BUCKET` > - Run `bundle exec good_job start` > - Create an importer for the adventist tenant > - Using the OAI Parser > - With URL of <http://oai.adventistdigitallibrary.org/OAI-script> > - metadataPrefix of "oai\_adl" > - set of "adl:other" > - This import should take somewhere between 5 and 10 minutes > - Find one of the following AARK\_ID import entries: 20121816 (3 page PDF) or 20121862 (1 page PDF) > - Make note of the Bulkrax::Entry ID of the work, hereto referred to as `entry_id` > - VERIFY WORK STEP: Verify in the UI that the work has at least 3 items: the ARCHIVAL PDF, RAW.txt, and one image per page of the PDF. > - Assuming that is the case, shell into the web console and start the rails console and run the following: > - `switch!('adventist')` > - `entry = Bulkrax::Entry.find(entry_id)` # NOTE `entry_id` > - `GoodJob::Job.where(finished_at: nil).where(GoodJob::Job.arel_table['created_at'].lteq(DateTime.current)).joins_advisory_locks.where(pg_locks: { locktype: nil }).destroy_all` > - `entry.factory.find&.child_works&.map {|cw| cw.destroy(eradicate: true) }` > - This may output a =ActiveTriples::ParentStrategy::UnmutableParentError= message; if so run again > - `entry.factory.find&.destroy(eradicate: true)` > - This may output a =ActiveTriples::ParentStrategy::UnmutableParentError= message; if so I don't believe running it again will matter. > - `entry.build` > - Then repeat the VERIFY WORK STEP; you'll likely want to use the importer entries page for the `entry_id` > > If we do not have all of the same items as in the VERIFY WORK STEP then we continue to have a deep seeded bug; and may need to change something in the code and repeat some aspect of the checklist. Perhaps going all the way back to the first step. > > What I have observed is that the parent work (e.g. the object associated with `entry.factory.find`) has the correct associations assigned. > > `IiifPrint::LineageService.descendent_member_ids_for(work)` returns file\_set and work ids that represent the items in the UI. However, any of those ids when reified to an ActiveFedora object as `child` and then we call `IiifPrint::LineageService.ancestor_ids_for(child)`, we get an empty array. We should get an array that is `[work.id]` > > Digging deeper, each of the child works has `is_child` of `true` but their `ordered_by_ids` is an empty array. > > My current conjecture is one of the following is : > > - The [SlugBug overrides](https://github.com/scientist-softserv/adventist-dl/blob/main/app/models/concerns/slug_bug.rb) are problematic either alone or in relation to IIIF Print > - We can look at <https://github.com/scientist-softserv/adventist-dl/blob/main/config/initializers/slug_override.rb#L58-L69> to see how we amend the retrieval of `ordered_by_ids` > - We are missing the often copied `config/initializers/active_fedora_override.rb` in which we apply some changes to `ActiveFedora::Aggregation::ListSource.attribute_will_change!` > - essi > - louisville-hyku > - palni-palci > - nnp > - IIIF Print has some other underlying issue with relationships Related to: - notch8/derivative_rodeo#56 - https://github.com/scientist-softserv/nnp/commit/dc970b910bd29918f8d5dc420ffd33f940053e0c - notch8/palni-palci@687f361 - notch8/louisville-hyku@8bdead0 **NOTE:** We may want to add this to UTK.
This is a bit of a stretch, based on other testing. But the included code comment may be relevant. Without this change I've consistently encounter the following observed behavior: ```gherkin Given an Bulkrax imported work with child works (e.g. pages split from a PDF) And I eradicate the imported work and child works When I re-build (e.g. Bulkrax::Entry#build) the import entry Then I do not see the attached files nor child works ``` With this change I've consistently encountered the following observed behavior: ```gherkin Given an Bulkrax imported work with child works (e.g. pages split from a PDF) And I eradicate the imported work and child works When I re-build (e.g. Bulkrax::Entry#build) the import entry Then I will see the work, child works, and derivatives, in the UI ``` **Context** This is copied from notch8/derivative_rodeo#56 (comment) to highlight the process. > Starting from a fresh Hyku tenant. When I import a work, files are associated with that work. When I eradicate the work and children then reimport the work, then no files show in the UI. > > One observation in the code is that the child works and file sets have an `is_child` true but their `is_child_of` is empty. Which leverages `ordered_by_ids`. > > Order of Operations > > - Ensure I have a clean Fedora and SOLR > - `docker compose down -v` works > - Check out the main branch > - `git checkout main` > - Build and bring up the image > - `docker compose up --force-recreate --build` > - Create the new tenant > - Login to hyku.test and create "adventist" tenant > - Stop the docker environment > - `docker compose stop` > - Set the correct branch and conditions > - This is a mix of stash and different branches. > - Branch `updating-to-test-the-rodeo` incorporates skipping the thumbnails > - Branch `without-slug-bug` is setup to ignore the slug features of Adventist; it is based on `updating-to-test-the-rodeo` > - Git Stash I'm referencing iiif\_print and derivative\_rodeo in `vendor/gems`; my stashed Gemfile and lock reflect this. > - Note: I have bumped the relationship job interval from 10 minutes to 1 minute > - In the `config/initializers/iiif_print.rb` I have add the following two lines at the bottom: > - `# DerivativeRodeo::Generators::PdfSplitGenerator.output_extension = 'jpg'` > - `DerivativeRodeo.config.logger = Logger.new(Rails.root.join("dr.log").to_s, level: Logger::DEBUG)` > - The above two lines use the TIFF for splitting (the format we used in the initial SpaceStone test) and setting the DerivativeRodeo logger means we send the very chatty information from the rodeo to a durable and separate location. This is helpful for reviewing to see how and what is being used. > - Bring up the images without starting services > - I don't want to auto-start the Rails server nor Good Job because we need to clean up two rotten jobs > - Bash into Worker and Delete All Jobs > - There should be two jobs the Embargo and Lease job. > - In Rails console ru `GoodJob::Job.destroy_all` > - Start the Rails server and good jobs > - Bash into web and run `bundle exec puma -v -b tcp://0.0.0.0:3000` > - Bash into worker > - export the following ENV variables > - `AWS_ACCESS_KEY_ID` > - `AWS_SECRET_ACCESS_KEY` > - `AWS_REGION` > - `AWS_S3_BUCKET` > - Run `bundle exec good_job start` > - Create an importer for the adventist tenant > - Using the OAI Parser > - With URL of <http://oai.adventistdigitallibrary.org/OAI-script> > - metadataPrefix of "oai\_adl" > - set of "adl:other" > - This import should take somewhere between 5 and 10 minutes > - Find one of the following AARK\_ID import entries: 20121816 (3 page PDF) or 20121862 (1 page PDF) > - Make note of the Bulkrax::Entry ID of the work, hereto referred to as `entry_id` > - VERIFY WORK STEP: Verify in the UI that the work has at least 3 items: the ARCHIVAL PDF, RAW.txt, and one image per page of the PDF. > - Assuming that is the case, shell into the web console and start the rails console and run the following: > - `switch!('adventist')` > - `entry = Bulkrax::Entry.find(entry_id)` # NOTE `entry_id` > - `GoodJob::Job.where(finished_at: nil).where(GoodJob::Job.arel_table['created_at'].lteq(DateTime.current)).joins_advisory_locks.where(pg_locks: { locktype: nil }).destroy_all` > - `entry.factory.find&.child_works&.map {|cw| cw.destroy(eradicate: true) }` > - This may output a =ActiveTriples::ParentStrategy::UnmutableParentError= message; if so run again > - `entry.factory.find&.destroy(eradicate: true)` > - This may output a =ActiveTriples::ParentStrategy::UnmutableParentError= message; if so I don't believe running it again will matter. > - `entry.build` > - Then repeat the VERIFY WORK STEP; you'll likely want to use the importer entries page for the `entry_id` > > If we do not have all of the same items as in the VERIFY WORK STEP then we continue to have a deep seeded bug; and may need to change something in the code and repeat some aspect of the checklist. Perhaps going all the way back to the first step. > > What I have observed is that the parent work (e.g. the object associated with `entry.factory.find`) has the correct associations assigned. > > `IiifPrint::LineageService.descendent_member_ids_for(work)` returns file\_set and work ids that represent the items in the UI. However, any of those ids when reified to an ActiveFedora object as `child` and then we call `IiifPrint::LineageService.ancestor_ids_for(child)`, we get an empty array. We should get an array that is `[work.id]` > > Digging deeper, each of the child works has `is_child` of `true` but their `ordered_by_ids` is an empty array. > > My current conjecture is one of the following is : > > - The [SlugBug overrides](https://github.com/scientist-softserv/adventist-dl/blob/main/app/models/concerns/slug_bug.rb) are problematic either alone or in relation to IIIF Print > - We can look at <https://github.com/scientist-softserv/adventist-dl/blob/main/config/initializers/slug_override.rb#L58-L69> to see how we amend the retrieval of `ordered_by_ids` > - We are missing the often copied `config/initializers/active_fedora_override.rb` in which we apply some changes to `ActiveFedora::Aggregation::ListSource.attribute_will_change!` > - essi > - louisville-hyku > - palni-palci > - nnp > - IIIF Print has some other underlying issue with relationships Related to: - notch8/derivative_rodeo#56 - https://github.com/scientist-softserv/nnp/commit/dc970b910bd29918f8d5dc420ffd33f940053e0c - notch8/palni-palci@687f361 - notch8/louisville-hyku@8bdead0 **NOTE:** We may want to add this to UTK.
This commit contains 3 separate concepts but they are all in service of improved legibility: 1. Changing the hash to assume multi-line 2. Changing the return value to be a string instead of Array of Rodeo Locations 3. Adding a bit of documentation All of this is in service of helping triage what's going on. Related to: - notch8/derivative_rodeo#56
Apologies in advance, this commit conflates two things, but I'll explain. This commit is in service of completing the DerivativeService interface; namely the `#cleanup_derivatives` method. Originally, I was thinking I would only delete the derivatives generated by this service. So I began refactoring to reduce knowledge. That refactor meant extracting `#named_derivatives_and_generators`, and as a matter of hygiene and legibility, I moved the method closer to the configuration. The hope being that if one thing changes the other might. This then involved rethinking the `#create_derivatives` and `#cleanup_derivatives` to use this new method. I was looking for symmetry in method implementation (e.g. loop over the named derivatives and either create them or delete them). However, as I looked at the other reference implementations I noticed that I could get all of the derivatives by calling `Hyrax::DerivativePath.derivatives_for_reference` ([see code][1]). I spent a bit of time thinking, as the comments indicate, as to which approach to take: delete all derivatives OR only those that would be created by the present configuration. It makes sense to me to delete all of them, in part due to the implementation details of finding the correct `valid?` derivative service but also the fact that any `valid?` service is subject to configuration, which might change over time, and thus leave orphaned derivatives dangling in the file system. Closes #270 Related to: - notch8/derivative_rodeo#56 [1]:https://github.com/samvera/hyrax/blob/b28d8ff35d2fb708483d2ce0c4e687450b7f5aef/app/services/hyrax/derivative_path.rb#L14-L18
Apologies in advance, this commit conflates two things, but I'll explain. This commit is in service of completing the DerivativeService interface; namely the `#cleanup_derivatives` method. Originally, I was thinking I would only delete the derivatives generated by this service. So I began refactoring to reduce knowledge. That refactor meant extracting `#named_derivatives_and_generators`, and as a matter of hygiene and legibility, I moved the method closer to the configuration. The hope being that if one thing changes the other might. This then involved rethinking the `#create_derivatives` and `#cleanup_derivatives` to use this new method. I was looking for symmetry in method implementation (e.g. loop over the named derivatives and either create them or delete them). However, as I looked at the other reference implementations I noticed that I could get all of the derivatives by calling `Hyrax::DerivativePath.derivatives_for_reference` ([see code][1]). I spent a bit of time thinking, as the comments indicate, as to which approach to take: delete all derivatives OR only those that would be created by the present configuration. It makes sense to me to delete all of them, in part due to the implementation details of finding the correct `valid?` derivative service but also the fact that any `valid?` service is subject to configuration, which might change over time, and thus leave orphaned derivatives dangling in the file system. Closes #270 Related to: - notch8/derivative_rodeo#56 [1]:https://github.com/samvera/hyrax/blob/b28d8ff35d2fb708483d2ce0c4e687450b7f5aef/app/services/hyrax/derivative_path.rb#L14-L18
**Context:** We're incorporating the derivative rodeo into the ingest process. This is first intended to be used by the OAI importer. The situation is as follows: In the OAI feed there are URLs for both the digital objects and a thumbnail. Due to prior constraints of ingest, we had to add the thumbnail as a FileSet to the work. However, with the derivative rodeo, we can look in S3 for an existing thumbnail (that was generated via SpaceStone) and assign that thumbnail to the digital object(s)'s file set. In other words, we can avoid adding the thumbnail file set (and running all the derivatives on that file set as well). However, this may conflict with some work done in GitLab I62 (as detailed in commit @433b66a8d95f49bd40335b5483621bb4e4a41227). **Discussion:** Prior to adding this change when the derivative service was working on the TN.jpg file (the thumbnail_url that comes with the OAI feed), it was raising a `ArgumentError' error “invalid byte sequence in UTF-8` exception. What was happening is that the derivative rodeo was wrongly assuming that the TN.jpg was a HOCR file. It was reading the contents of the file and asking if it was XML. And that raised the exception. As mentioned in [this comment][1], "the underlying problem remains." Namely the Derivate Rodeo service in IIIF print needs to better handle second order derivatives (e.g. generate a HOCR file). **Question:** - Is this the right approach? - What is the problem we were solving in @433b66a8d95f49bd40335b5483621bb4e4a41227 - What is the context of I62? I believe this is best resolved in a pairing/mobbing session. However, I put this forward for conversation. **Related to:** - notch8/derivative_rodeo#56 [1]:notch8/derivative_rodeo#56 (comment)
This is a bit of a stretch, based on other testing. But the included code comment may be relevant. Without this change I've consistently encounter the following observed behavior: ```gherkin Given an Bulkrax imported work with child works (e.g. pages split from a PDF) And I eradicate the imported work and child works When I re-build (e.g. Bulkrax::Entry#build) the import entry Then I do not see the attached files nor child works ``` With this change I've consistently encountered the following observed behavior: ```gherkin Given an Bulkrax imported work with child works (e.g. pages split from a PDF) And I eradicate the imported work and child works When I re-build (e.g. Bulkrax::Entry#build) the import entry Then I will see the work, child works, and derivatives, in the UI ``` **Context** This is copied from notch8/derivative_rodeo#56 (comment) to highlight the process. > Starting from a fresh Hyku tenant. When I import a work, files are associated with that work. When I eradicate the work and children then reimport the work, then no files show in the UI. > > One observation in the code is that the child works and file sets have an `is_child` true but their `is_child_of` is empty. Which leverages `ordered_by_ids`. > > Order of Operations > > - Ensure I have a clean Fedora and SOLR > - `docker compose down -v` works > - Check out the main branch > - `git checkout main` > - Build and bring up the image > - `docker compose up --force-recreate --build` > - Create the new tenant > - Login to hyku.test and create "adventist" tenant > - Stop the docker environment > - `docker compose stop` > - Set the correct branch and conditions > - This is a mix of stash and different branches. > - Branch `updating-to-test-the-rodeo` incorporates skipping the thumbnails > - Branch `without-slug-bug` is setup to ignore the slug features of Adventist; it is based on `updating-to-test-the-rodeo` > - Git Stash I'm referencing iiif\_print and derivative\_rodeo in `vendor/gems`; my stashed Gemfile and lock reflect this. > - Note: I have bumped the relationship job interval from 10 minutes to 1 minute > - In the `config/initializers/iiif_print.rb` I have add the following two lines at the bottom: > - `# DerivativeRodeo::Generators::PdfSplitGenerator.output_extension = 'jpg'` > - `DerivativeRodeo.config.logger = Logger.new(Rails.root.join("dr.log").to_s, level: Logger::DEBUG)` > - The above two lines use the TIFF for splitting (the format we used in the initial SpaceStone test) and setting the DerivativeRodeo logger means we send the very chatty information from the rodeo to a durable and separate location. This is helpful for reviewing to see how and what is being used. > - Bring up the images without starting services > - I don't want to auto-start the Rails server nor Good Job because we need to clean up two rotten jobs > - Bash into Worker and Delete All Jobs > - There should be two jobs the Embargo and Lease job. > - In Rails console ru `GoodJob::Job.destroy_all` > - Start the Rails server and good jobs > - Bash into web and run `bundle exec puma -v -b tcp://0.0.0.0:3000` > - Bash into worker > - export the following ENV variables > - `AWS_ACCESS_KEY_ID` > - `AWS_SECRET_ACCESS_KEY` > - `AWS_REGION` > - `AWS_S3_BUCKET` > - Run `bundle exec good_job start` > - Create an importer for the adventist tenant > - Using the OAI Parser > - With URL of <http://oai.adventistdigitallibrary.org/OAI-script> > - metadataPrefix of "oai\_adl" > - set of "adl:other" > - This import should take somewhere between 5 and 10 minutes > - Find one of the following AARK\_ID import entries: 20121816 (3 page PDF) or 20121862 (1 page PDF) > - Make note of the Bulkrax::Entry ID of the work, hereto referred to as `entry_id` > - VERIFY WORK STEP: Verify in the UI that the work has at least 3 items: the ARCHIVAL PDF, RAW.txt, and one image per page of the PDF. > - Assuming that is the case, shell into the web console and start the rails console and run the following: > - `switch!('adventist')` > - `entry = Bulkrax::Entry.find(entry_id)` # NOTE `entry_id` > - `GoodJob::Job.where(finished_at: nil).where(GoodJob::Job.arel_table['created_at'].lteq(DateTime.current)).joins_advisory_locks.where(pg_locks: { locktype: nil }).destroy_all` > - `entry.factory.find&.child_works&.map {|cw| cw.destroy(eradicate: true) }` > - This may output a =ActiveTriples::ParentStrategy::UnmutableParentError= message; if so run again > - `entry.factory.find&.destroy(eradicate: true)` > - This may output a =ActiveTriples::ParentStrategy::UnmutableParentError= message; if so I don't believe running it again will matter. > - `entry.build` > - Then repeat the VERIFY WORK STEP; you'll likely want to use the importer entries page for the `entry_id` > > If we do not have all of the same items as in the VERIFY WORK STEP then we continue to have a deep seeded bug; and may need to change something in the code and repeat some aspect of the checklist. Perhaps going all the way back to the first step. > > What I have observed is that the parent work (e.g. the object associated with `entry.factory.find`) has the correct associations assigned. > > `IiifPrint::LineageService.descendent_member_ids_for(work)` returns file\_set and work ids that represent the items in the UI. However, any of those ids when reified to an ActiveFedora object as `child` and then we call `IiifPrint::LineageService.ancestor_ids_for(child)`, we get an empty array. We should get an array that is `[work.id]` > > Digging deeper, each of the child works has `is_child` of `true` but their `ordered_by_ids` is an empty array. > > My current conjecture is one of the following is : > > - The [SlugBug overrides](https://github.com/scientist-softserv/adventist-dl/blob/main/app/models/concerns/slug_bug.rb) are problematic either alone or in relation to IIIF Print > - We can look at <https://github.com/scientist-softserv/adventist-dl/blob/main/config/initializers/slug_override.rb#L58-L69> to see how we amend the retrieval of `ordered_by_ids` > - We are missing the often copied `config/initializers/active_fedora_override.rb` in which we apply some changes to `ActiveFedora::Aggregation::ListSource.attribute_will_change!` > - essi > - louisville-hyku > - palni-palci > - nnp > - IIIF Print has some other underlying issue with relationships Related to: - notch8/derivative_rodeo#56 - https://github.com/scientist-softserv/nnp/commit/dc970b910bd29918f8d5dc420ffd33f940053e0c - notch8/palni-palci@687f361 - notch8/louisville-hyku@8bdead0 **NOTE:** We may want to add this to UTK.
Yes, we could have a DerivativeRodeo initializer...but we leverage the Rodeo by way of IiifPrint. So this makes sense as to the place to configure these things. In addition, I figured I'd share one of the things I did to help in the debugging of the DerivativeRodeo integration. Which has me thinking that perhaps we should look at doing this with the IiifPrint gem. After all, debugging that is also challenging. Related to: - notch8/derivative_rodeo#56
These changes reflect work done in verifying the following ticket: - notch8/derivative_rodeo#56
This commit reverts using the Derivative Rodeo for PDF splitting and derivative generation. It is also breadcrumbs for how to restore those functions. We revert from the Derivative Rodeo to the already established IIIF Print pluggable derivatives derived from the Newspaper works gem. The reason to revert is that this branch includes several changes that went into local testing of the DerivativeRodeo; and I want to capture those wins and merge in an already long-running branch, to reduce the chance of further branch drift. For reference, local testing of the DerivativeRodeo has worked both with and without having SpaceStone data for both PDF splitting and generating derivatives (e.g. thumbnails, word coordinates, alto files, and plain text). However, I had only done localized testing and I believe more testing is warranted; namely how does the full text search work. To consider is how we will: - Test on staging with the Rodeo but not have it in play for Production But that is an exercise for the person undoing this commit :) Related to: - notch8/derivative_rodeo#56 - https://github.com/scientist-softserv/adventist-dl/issues/500
**Context:** We're incorporating the derivative rodeo into the ingest process. This is first intended to be used by the OAI importer. The situation is as follows: In the OAI feed there are URLs for both the digital objects and a thumbnail. Due to prior constraints of ingest, we had to add the thumbnail as a FileSet to the work. However, with the derivative rodeo, we can look in S3 for an existing thumbnail (that was generated via SpaceStone) and assign that thumbnail to the digital object(s)'s file set. In other words, we can avoid adding the thumbnail file set (and running all the derivatives on that file set as well). However, this may conflict with some work done in GitLab I62 (as detailed in commit @433b66a8d95f49bd40335b5483621bb4e4a41227). **Discussion:** Prior to adding this change when the derivative service was working on the TN.jpg file (the thumbnail_url that comes with the OAI feed), it was raising a `ArgumentError' error “invalid byte sequence in UTF-8` exception. What was happening is that the derivative rodeo was wrongly assuming that the TN.jpg was a HOCR file. It was reading the contents of the file and asking if it was XML. And that raised the exception. As mentioned in [this comment][1], "the underlying problem remains." Namely the Derivate Rodeo service in IIIF print needs to better handle second order derivatives (e.g. generate a HOCR file). **Question:** - Is this the right approach? - What is the problem we were solving in @433b66a8d95f49bd40335b5483621bb4e4a41227 - What is the context of I62? I believe this is best resolved in a pairing/mobbing session. However, I put this forward for conversation. **Related to:** - notch8/derivative_rodeo#56 [1]:notch8/derivative_rodeo#56 (comment)
Yes, we could have a DerivativeRodeo initializer...but we leverage the Rodeo by way of IiifPrint. So this makes sense as to the place to configure these things. In addition, I figured I'd share one of the things I did to help in the debugging of the DerivativeRodeo integration. Which has me thinking that perhaps we should look at doing this with the IiifPrint gem. After all, debugging that is also challenging. Related to: - notch8/derivative_rodeo#56
These changes reflect work done in verifying the following ticket: - notch8/derivative_rodeo#56
This commit reverts using the Derivative Rodeo for PDF splitting and derivative generation. It is also breadcrumbs for how to restore those functions. We revert from the Derivative Rodeo to the already established IIIF Print pluggable derivatives derived from the Newspaper works gem. The reason to revert is that this branch includes several changes that went into local testing of the DerivativeRodeo; and I want to capture those wins and merge in an already long-running branch, to reduce the chance of further branch drift. For reference, local testing of the DerivativeRodeo has worked both with and without having SpaceStone data for both PDF splitting and generating derivatives (e.g. thumbnails, word coordinates, alto files, and plain text). However, I had only done localized testing and I believe more testing is warranted; namely how does the full text search work. To consider is how we will: - Test on staging with the Rodeo but not have it in play for Production But that is an exercise for the person undoing this commit :) Related to: - notch8/derivative_rodeo#56 - https://github.com/scientist-softserv/adventist-dl/issues/500
Without this commit, there's nothing in Hyrax/Hyku/IIIFPrint that will extract plain text from a plain text file. Related to: - notch8/adventist-dl@1d3e1a9 - notch8/derivative_rodeo#56 - https://github.com/scientist-softserv/adventist-dl/issues/500 - https://github.com/scientist-softserv/adventist-dl/issues/538
The goal of this punchiest is to outline the steps necessary to verify that IIIF print picks up the changes
:info
or more granularWith the above SpaceStone and Derivative Rodeo adjustments
:info
Derivative Rodeo Integration Tests for PDF Splitting
The following are the scenarios I’m working through for integration testing:
After the integration test we will need to:
The text was updated successfully, but these errors were encountered: