Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OAI filegetter: resume if process freezes? #344

Open
bondjimbond opened this issue Mar 15, 2017 · 30 comments
Open

OAI filegetter: resume if process freezes? #344

bondjimbond opened this issue Mar 15, 2017 · 30 comments

Comments

@bondjimbond
Copy link
Collaborator

Would it be possible to build in a way to resume downloading a large set if the process freezes? Perhaps the process could save the last-received resumption token, and a line in config could instruct MIK to use that when querying OAI the next time?

Very large sets appear to just hang up or time out, and a mechanism for recovering from a crash without re-initiating the process from the beginning would be extremely helpful.

@bondjimbond
Copy link
Collaborator Author

Another (though perhaps slower) option would be to check the output_directory for each record to see whether a file already exists before attempting to download.

@mjordan
Copy link
Collaborator

mjordan commented Mar 16, 2017

I don't think resumption tokens would work here since they are internal to the OAI harvester, and because they represent chuncks of records, usually 100 but I think that is configurable. However, we have approached this sort of problem in the CONTENTdm toolchains by using the Specific Set fetcher manipulator, which, as the name suggests, fetches specific records from the source. It could be used in two ways:

  1. if we could determine which IDs were not successfully harvested, which sounds like might be case in @bondjimbond's previous comment, we could rerun MIK using the list of IDs that were not harvested as input for this manipulator;
  2. using the Specific Set manipulator's "exclude" option, use a list of IDs that were successfully harvested as input and "exclude" those from the second harvest.

This manipulator currently only works with CSV and Cdm toolchains. Let me see what's involved in getting it to work with the OAI toolchains.

@mjordan
Copy link
Collaborator

mjordan commented Mar 17, 2017

@bondjimbond just to clarify, when you say "to resume downloading a large set if the process freezes", do you mean the OAI part of the harvest, or the downloading of the payload (PDF, etc.) file? Rereading your initial issue it appears you mean the OAI part, but your follow up comments suggest the file download part.

@mjordan
Copy link
Collaborator

mjordan commented Mar 17, 2017

Consulting the fine docs at https://github.com/MarcusBarnes/mik/wiki/Toolchain:-OAI-PMH-for-repositories-that-identify-resource-files-in-a-record-element I see that this toolchain already supports the SpecificSet fetcher manipulator. We could give that a try. It applies to the OAI harvesting, not the file downloading, however.

@bondjimbond
Copy link
Collaborator Author

I guess the issue here is downloading the payload. I'd say the OAI part of the harvest completes rather quickly, as the "temp" directory generates a huge number of .metadata files right away. (I don't know exactly how the toolchain works after you get all the .metadata files; it still takes some time to generate the XML files from them.)

SpecificSet is good and helpful in limiting the download - but when a "set" contains 6000+ objects, the possibility of timeouts/crashes/etc is high.

In my specific case, the last attempt to download a large set got me 95% complete but the rest failed due to a scheduled server reset. So I was able to collect all the PIDs with failed downloads. But that's kind of a unique case.

More generally... You could compare the PIDs in the .metadata files with the filenames of whatever's already in the output directory. Re-attempt downloads for every file that does not already have both an XML file and something else with same filename.

@mjordan
Copy link
Collaborator

mjordan commented Mar 17, 2017

SpecificSet is good and helpful in limiting the download - but when a "set" contains 6000+ objects, the possibility of timeouts/crashes/etc is high.

You can split up large sets into smaller ones, which would reduce the risk.

More generally... You could compare the PIDs in the .metadata files with the filenames of whatever's already in the output directory. Re-attempt downloads for every file that does not already have both an XML file and something else with same filename.

What you describe in the last paragraph could pretty easily be coded up as a filegetter manipulator. Given that the OAI "fetching" appears to be fairly reliable, this is probably the most direct solution to your problem. You would run MIK configured with this manipulator as many times as necessary until you got all your files.

This filegetter manipulator would be pretty specific but that's totally OK, the reason we built in ability to use manipulators was to address edge cases or exceptions without having to hack or write core toolchain classes. I'm happy to advise on how to write this if you want.

@mjordan
Copy link
Collaborator

mjordan commented Mar 17, 2017

You would run MIK configured with this manipulator as many times as necessary until you got all your files.

I say that on the assumption that it's OK to rerun the OAI-PMH half of this process as many times as you need to.

@bondjimbond
Copy link
Collaborator Author

It would be nice to have something like that. It would even work for unexplained hiccups - for example, I'm running a download right now where an object downloaded successfully before, but just failed for no reason this time. Running it again would probably be successful. Would be great if I could run the entire set again but skip all the files that already downloaded.

Your advice on writing this would be appreciated!

@mjordan
Copy link
Collaborator

mjordan commented Mar 17, 2017

Agreed about nice to have. Such a filegetter manipulator may be applicable to other toolchains that get files from a remote URL, both existing and yet-to-be-imagined ones. Let me put some initial thought into how to generalize this out without jeopardizing your immediate use case. I think MIK's abstraction of 'record keys' in other toolchains (in other words, unique IDs) is all we need to pay attention to.

@mjordan
Copy link
Collaborator

mjordan commented Mar 17, 2017

A quick look at the two existing filegetter manipulators at https://github.com/MarcusBarnes/mik/tree/master/src/filegettermanipulators reminds me that they are additive - they return likely paths to files. The use case we are circling around in the last few comments is subtractive. Despite my having said

What you describe in the last paragraph could pretty easily be coded up as a filegetter manipulator.

I am now flipflopping back to implementing this as a fetcher manipulator, but one where the "set" is determined by a payload file's presence in the output directory. Let me continue to stew this over a bit.

@mjordan
Copy link
Collaborator

mjordan commented Mar 17, 2017

Pondering this while eating my lunch, I think we can base this fetcher manipulator on SpecificSet pretty easily. All we need to do is replace the code inside its getSpecificSet() method with code that returns a list of keys that don't have corresponding files in the output directory. Does that make sense?

@bondjimbond
Copy link
Collaborator Author

Probably... I don't know exactly how SpecificSet works, but we can look it over.

@mjordan
Copy link
Collaborator

mjordan commented Mar 17, 2017

I can give this a first crack over the weekend. I think it should be pretty straight forward.

@mjordan
Copy link
Collaborator

mjordan commented Mar 18, 2017

@bondjimbond could you send me an .ini file for an OAI-PMH toolchain that uses a setSpec that retrieves a relatively small (< 50 or even fewer) set of objects? By email is OK if you prefer.

mjordan added a commit that referenced this issue Mar 19, 2017
@mjordan
Copy link
Collaborator

mjordan commented Mar 19, 2017

I've pushed up a new fetcher manipulator that filters out records that do not have a JPEG, etc. file in the output directory. To test it,

  1. add the following line to the [MANIPULATORS] section of your .ini file: fetchermanipulators[] = OaiMissingFileSet
  2. git fetch and then git checkout issue-344
  3. composer dump-autoload or equivalent on your system
  4. run mik to create some packages
  5. delete the JPEG, etc. file from a few of the resulting packages
  6. rerun mik. It should only regenerate the packages for ones that had the missing files.

I had hoped that this fetcher manipulator could apply to the CONTENTdm and CSV toolchains, but at the moment it doesn't.

@mjordan
Copy link
Collaborator

mjordan commented Apr 28, 2017

@bondjimbond can you give this a test?

@bondjimbond
Copy link
Collaborator Author

Having some weird troubles checking out your branch due to some git strangeness; can't test at the moment, sorry.

@mjordan
Copy link
Collaborator

mjordan commented May 24, 2017

@bondjimbond bump, if you have time.

@mjordan
Copy link
Collaborator

mjordan commented Apr 29, 2018

@bondjimbond not sure if you ever took a look at this, but if you could that would be great.

@bondjimbond
Copy link
Collaborator Author

@mjordan I don't know, it hasn't QUITE been a year yet... Maybe in another month?

Working on a few things this week, but I'll try to take a look soon.

@bondjimbond
Copy link
Collaborator Author

OK! Finally starting to test this... And immediately run into problems.

I'm using the Islandora filegetter:
class = OaipmhIslandoraObj

And here's my error:

Fatal error: Class 'mik\filegetters\OaipmhIslandoraObj' not found in /Users/Brandon/mik/mik on line 163

Looks like maybe the path to the filegetter is wrong? My filegetters director is at mik/src/filegetters, not mik/filegetters

@bondjimbond
Copy link
Collaborator Author

Never mind the above -- it looks like in that directory I don't actually have an OaipmhIslandoraObj file. Is that not part of the git repo? Shouldn't it have been added at some point after a git pull?

@mjordan
Copy link
Collaborator

mjordan commented May 2, 2018

Are you in the issue-344 branch?

@mjordan
Copy link
Collaborator

mjordan commented May 2, 2018

OaipmhIslandoraObj.php filegetter isn't in the https://github.com/MarcusBarnes/mik/tree/issue-344 branch. Not sure why but I can create a new branch tonight that contains everything you need. Will reply with details here.

@bondjimbond
Copy link
Collaborator Author

Sounds good!

mjordan added a commit that referenced this issue May 3, 2018
@mjordan
Copy link
Collaborator

mjordan commented May 3, 2018

OK, I've created a new branch, issue-344-new, which adds the new fetchermanipulator to the current master. Everything should be up to date and testable following the steps outlined above.

@bondjimbond
Copy link
Collaborator Author

Hmm.

Commencing MIK.
Filtering 68 records through the OaiMissingFileSet fetcher manipulator.

Fatal error: Uncaught exception 'mik\exceptions\MikErrorException' in /Users/Brandon/mik/mik:105
Stack trace:
#0 /Users/Brandon/mik/src/fetchermanipulators/OaiMissingFileSet.php(131): {closure}(8, 'Undefined varia...', '/Users/Brandon/...', 131, Array)
#1 /Users/Brandon/mik/src/fetchermanipulators/OaiMissingFileSet.php(104): mik\fetchermanipulators\OaiMissingFileSet->getFileList()
#2 /Users/Brandon/mik/src/fetchermanipulators/OaiMissingFileSet.php(68): mik\fetchermanipulators\OaiMissingFileSet->getRecordKeysWithFiles()
#3 /Users/Brandon/mik/src/fetchers/Oaipmh.php(148): mik\fetchermanipulators\OaiMissingFileSet->manipulate(Array)
#4 /Users/Brandon/mik/src/fetchers/Oaipmh.php(96): mik\fetchers\Oaipmh->applyFetchermanipulators(Array)
#5 /Users/Brandon/mik/mik(182): mik\fetchers\Oaipmh->getRecords(NULL)
#6 {main}
thrown in /Users/Brandon/mik/mik on line 105

Does this tell you anything about where the problem lies?

@mjordan
Copy link
Collaborator

mjordan commented May 3, 2018

If you look in your output directory, are all the files .xml files? In other words, are there any OBJ files?

@bondjimbond
Copy link
Collaborator Author

No files at all. But I do have a bunch of .metadata files in the temp directory.

Couple of errors in mik.log:

[2018-05-03 12:51:23] ErrorException.ERROR: ErrorException {"message":"Undefined index: datastream_ids","code":{"settings":{"CONFIG":{"config_id":"oai-test","last_updated_on":"2017-02-21","last_update_by":"bw"},"SYSTEM":{"date_default_timezone":"America/Vancouver","verify_ca":"0"},"FETCHER":{"class":"Oaipmh","oai_endpoint":"https://nwcc.arcabc.ca/oai2","set_spec":"nwcc_freda2","metadata_prefix":"oai_dc","temp_directory":"/tmp/oaitest_temp"},"METADATA_PARSER":{"class":"dc\\OaiToDc"},"FILE_GETTER":{"class":"OaipmhIslandoraObj","temp_directory":"/tmp/oaitest_temp"},"WRITER":{"class":"Oaipmh","output_directory":"/tmp/oaitest_output","postwritehooks":["/usr/bin/php extras/scripts/postwritehooks/oai_dc_to_mods.php"]},"MANIPULATORS":{"fetchermanipulators":["OaiMissingFileSet"]},"LOGGING":{"path_to_log":"/tmp/oaitest_output/mik.log","path_to_manipulator_log":"/tmp/oaitest_output/manipulator.log"}}},"severity":8,"file":"/Users/Brandon/mik/src/filegetters/OaipmhIslandoraObj.php","line":41} []
[2018-05-03 12:51:23] ErrorException.ERROR: ErrorException {"message":"problem instantiating fileGetterClass","details":"[object] (mik\exceptions\MikErrorException(code: 0): at /Users/Brandon/mik/mik:105)"} []
[2018-05-03 12:51:26] ErrorException.ERROR: ErrorException {"message":"Undefined variable: filtered_file_list","code":{"file_list":["/tmp/oaitest_output/mik.log"],"filetered_file_list":[],"pattern":"/tmp/oaitest_output/*","file_path":"/tmp/oaitest_output/mik.log"},"severity":8,"file":"/Users/Brandon/mik/src/fetchermanipulators/OaiMissingFileSet.php","line":131} []

@mjordan
Copy link
Collaborator

mjordan commented May 3, 2018

@bondjimbond there is a change in #465 that is causing no OBJ files to be retrieved. Would you mind testing that PR first and merging it if it works so that I don't have to make the same change in this branch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants