OAI filegetter: resume if process freezes? #344

bondjimbond · 2017-03-15T18:43:16Z

Would it be possible to build in a way to resume downloading a large set if the process freezes? Perhaps the process could save the last-received resumption token, and a line in config could instruct MIK to use that when querying OAI the next time?

Very large sets appear to just hang up or time out, and a mechanism for recovering from a crash without re-initiating the process from the beginning would be extremely helpful.

bondjimbond · 2017-03-16T16:55:09Z

Another (though perhaps slower) option would be to check the output_directory for each record to see whether a file already exists before attempting to download.

mjordan · 2017-03-16T23:23:01Z

I don't think resumption tokens would work here since they are internal to the OAI harvester, and because they represent chuncks of records, usually 100 but I think that is configurable. However, we have approached this sort of problem in the CONTENTdm toolchains by using the Specific Set fetcher manipulator, which, as the name suggests, fetches specific records from the source. It could be used in two ways:

if we could determine which IDs were not successfully harvested, which sounds like might be case in @bondjimbond's previous comment, we could rerun MIK using the list of IDs that were not harvested as input for this manipulator;
using the Specific Set manipulator's "exclude" option, use a list of IDs that were successfully harvested as input and "exclude" those from the second harvest.

This manipulator currently only works with CSV and Cdm toolchains. Let me see what's involved in getting it to work with the OAI toolchains.

mjordan · 2017-03-17T02:31:27Z

@bondjimbond just to clarify, when you say "to resume downloading a large set if the process freezes", do you mean the OAI part of the harvest, or the downloading of the payload (PDF, etc.) file? Rereading your initial issue it appears you mean the OAI part, but your follow up comments suggest the file download part.

mjordan · 2017-03-17T02:34:47Z

Consulting the fine docs at https://github.com/MarcusBarnes/mik/wiki/Toolchain:-OAI-PMH-for-repositories-that-identify-resource-files-in-a-record-element I see that this toolchain already supports the SpecificSet fetcher manipulator. We could give that a try. It applies to the OAI harvesting, not the file downloading, however.

bondjimbond · 2017-03-17T17:21:44Z

I guess the issue here is downloading the payload. I'd say the OAI part of the harvest completes rather quickly, as the "temp" directory generates a huge number of .metadata files right away. (I don't know exactly how the toolchain works after you get all the .metadata files; it still takes some time to generate the XML files from them.)

SpecificSet is good and helpful in limiting the download - but when a "set" contains 6000+ objects, the possibility of timeouts/crashes/etc is high.

In my specific case, the last attempt to download a large set got me 95% complete but the rest failed due to a scheduled server reset. So I was able to collect all the PIDs with failed downloads. But that's kind of a unique case.

More generally... You could compare the PIDs in the .metadata files with the filenames of whatever's already in the output directory. Re-attempt downloads for every file that does not already have both an XML file and something else with same filename.

mjordan · 2017-03-17T17:50:21Z

SpecificSet is good and helpful in limiting the download - but when a "set" contains 6000+ objects, the possibility of timeouts/crashes/etc is high.

You can split up large sets into smaller ones, which would reduce the risk.

More generally... You could compare the PIDs in the .metadata files with the filenames of whatever's already in the output directory. Re-attempt downloads for every file that does not already have both an XML file and something else with same filename.

What you describe in the last paragraph could pretty easily be coded up as a filegetter manipulator. Given that the OAI "fetching" appears to be fairly reliable, this is probably the most direct solution to your problem. You would run MIK configured with this manipulator as many times as necessary until you got all your files.

This filegetter manipulator would be pretty specific but that's totally OK, the reason we built in ability to use manipulators was to address edge cases or exceptions without having to hack or write core toolchain classes. I'm happy to advise on how to write this if you want.

mjordan · 2017-03-17T17:52:16Z

You would run MIK configured with this manipulator as many times as necessary until you got all your files.

I say that on the assumption that it's OK to rerun the OAI-PMH half of this process as many times as you need to.

bondjimbond · 2017-03-17T18:38:02Z

It would be nice to have something like that. It would even work for unexplained hiccups - for example, I'm running a download right now where an object downloaded successfully before, but just failed for no reason this time. Running it again would probably be successful. Would be great if I could run the entire set again but skip all the files that already downloaded.

Your advice on writing this would be appreciated!

mjordan · 2017-03-17T19:00:05Z

Agreed about nice to have. Such a filegetter manipulator may be applicable to other toolchains that get files from a remote URL, both existing and yet-to-be-imagined ones. Let me put some initial thought into how to generalize this out without jeopardizing your immediate use case. I think MIK's abstraction of 'record keys' in other toolchains (in other words, unique IDs) is all we need to pay attention to.

mjordan · 2017-03-17T19:38:29Z

A quick look at the two existing filegetter manipulators at https://github.com/MarcusBarnes/mik/tree/master/src/filegettermanipulators reminds me that they are additive - they return likely paths to files. The use case we are circling around in the last few comments is subtractive. Despite my having said

What you describe in the last paragraph could pretty easily be coded up as a filegetter manipulator.

I am now flipflopping back to implementing this as a fetcher manipulator, but one where the "set" is determined by a payload file's presence in the output directory. Let me continue to stew this over a bit.

mjordan · 2017-03-17T19:59:45Z

Pondering this while eating my lunch, I think we can base this fetcher manipulator on SpecificSet pretty easily. All we need to do is replace the code inside its getSpecificSet() method with code that returns a list of keys that don't have corresponding files in the output directory. Does that make sense?

bondjimbond · 2017-03-17T22:53:10Z

Probably... I don't know exactly how SpecificSet works, but we can look it over.

mjordan · 2017-03-17T22:55:07Z

I can give this a first crack over the weekend. I think it should be pretty straight forward.

mjordan · 2017-03-18T16:21:18Z

@bondjimbond could you send me an .ini file for an OAI-PMH toolchain that uses a setSpec that retrieves a relatively small (< 50 or even fewer) set of objects? By email is OK if you prefer.

mjordan · 2017-03-19T16:54:38Z

I've pushed up a new fetcher manipulator that filters out records that do not have a JPEG, etc. file in the output directory. To test it,

add the following line to the [MANIPULATORS] section of your .ini file: fetchermanipulators[] = OaiMissingFileSet
git fetch and then git checkout issue-344
composer dump-autoload or equivalent on your system
run mik to create some packages
delete the JPEG, etc. file from a few of the resulting packages
rerun mik. It should only regenerate the packages for ones that had the missing files.

I had hoped that this fetcher manipulator could apply to the CONTENTdm and CSV toolchains, but at the moment it doesn't.

mjordan · 2017-04-28T22:30:40Z

@bondjimbond can you give this a test?

bondjimbond · 2017-05-05T21:48:06Z

Having some weird troubles checking out your branch due to some git strangeness; can't test at the moment, sorry.

mjordan · 2017-05-24T16:46:14Z

@bondjimbond bump, if you have time.

mjordan · 2018-04-29T18:34:30Z

@bondjimbond not sure if you ever took a look at this, but if you could that would be great.

bondjimbond · 2018-04-30T14:15:41Z

@mjordan I don't know, it hasn't QUITE been a year yet... Maybe in another month?

Working on a few things this week, but I'll try to take a look soon.

bondjimbond · 2018-05-02T18:14:23Z

OK! Finally starting to test this... And immediately run into problems.

I'm using the Islandora filegetter:
class = OaipmhIslandoraObj

And here's my error:

Fatal error: Class 'mik\filegetters\OaipmhIslandoraObj' not found in /Users/Brandon/mik/mik on line 163

Looks like maybe the path to the filegetter is wrong? My filegetters director is at mik/src/filegetters, not mik/filegetters

bondjimbond · 2018-05-02T18:15:39Z

Never mind the above -- it looks like in that directory I don't actually have an OaipmhIslandoraObj file. Is that not part of the git repo? Shouldn't it have been added at some point after a git pull?

mjordan · 2018-05-02T18:29:48Z

Are you in the issue-344 branch?

mjordan · 2018-05-02T18:32:43Z

OaipmhIslandoraObj.php filegetter isn't in the https://github.com/MarcusBarnes/mik/tree/issue-344 branch. Not sure why but I can create a new branch tonight that contains everything you need. Will reply with details here.

bondjimbond · 2018-05-02T19:00:39Z

Sounds good!

mjordan · 2018-05-03T02:34:23Z

OK, I've created a new branch, issue-344-new, which adds the new fetchermanipulator to the current master. Everything should be up to date and testable following the steps outlined above.

bondjimbond · 2018-05-03T12:52:06Z

Hmm.

Commencing MIK.
Filtering 68 records through the OaiMissingFileSet fetcher manipulator.

Fatal error: Uncaught exception 'mik\exceptions\MikErrorException' in /Users/Brandon/mik/mik:105
Stack trace:
#0 /Users/Brandon/mik/src/fetchermanipulators/OaiMissingFileSet.php(131): {closure}(8, 'Undefined varia...', '/Users/Brandon/...', 131, Array)
#1 /Users/Brandon/mik/src/fetchermanipulators/OaiMissingFileSet.php(104): mik\fetchermanipulators\OaiMissingFileSet->getFileList()
#2 /Users/Brandon/mik/src/fetchermanipulators/OaiMissingFileSet.php(68): mik\fetchermanipulators\OaiMissingFileSet->getRecordKeysWithFiles()
#3 /Users/Brandon/mik/src/fetchers/Oaipmh.php(148): mik\fetchermanipulators\OaiMissingFileSet->manipulate(Array)
#4 /Users/Brandon/mik/src/fetchers/Oaipmh.php(96): mik\fetchers\Oaipmh->applyFetchermanipulators(Array)
#5 /Users/Brandon/mik/mik(182): mik\fetchers\Oaipmh->getRecords(NULL)
#6 {main}
thrown in /Users/Brandon/mik/mik on line 105

Does this tell you anything about where the problem lies?

mjordan · 2018-05-03T14:01:57Z

If you look in your output directory, are all the files .xml files? In other words, are there any OBJ files?

bondjimbond · 2018-05-03T14:07:35Z

No files at all. But I do have a bunch of .metadata files in the temp directory.

Couple of errors in mik.log:

[2018-05-03 12:51:23] ErrorException.ERROR: ErrorException {"message":"Undefined index: datastream_ids","code":{"settings":{"CONFIG":{"config_id":"oai-test","last_updated_on":"2017-02-21","last_update_by":"bw"},"SYSTEM":{"date_default_timezone":"America/Vancouver","verify_ca":"0"},"FETCHER":{"class":"Oaipmh","oai_endpoint":"https://nwcc.arcabc.ca/oai2","set_spec":"nwcc_freda2","metadata_prefix":"oai_dc","temp_directory":"/tmp/oaitest_temp"},"METADATA_PARSER":{"class":"dc\\OaiToDc"},"FILE_GETTER":{"class":"OaipmhIslandoraObj","temp_directory":"/tmp/oaitest_temp"},"WRITER":{"class":"Oaipmh","output_directory":"/tmp/oaitest_output","postwritehooks":["/usr/bin/php extras/scripts/postwritehooks/oai_dc_to_mods.php"]},"MANIPULATORS":{"fetchermanipulators":["OaiMissingFileSet"]},"LOGGING":{"path_to_log":"/tmp/oaitest_output/mik.log","path_to_manipulator_log":"/tmp/oaitest_output/manipulator.log"}}},"severity":8,"file":"/Users/Brandon/mik/src/filegetters/OaipmhIslandoraObj.php","line":41} []
[2018-05-03 12:51:23] ErrorException.ERROR: ErrorException {"message":"problem instantiating fileGetterClass","details":"[object] (mik\exceptions\MikErrorException(code: 0): at /Users/Brandon/mik/mik:105)"} []
[2018-05-03 12:51:26] ErrorException.ERROR: ErrorException {"message":"Undefined variable: filtered_file_list","code":{"file_list":["/tmp/oaitest_output/mik.log"],"filetered_file_list":[],"pattern":"/tmp/oaitest_output/*","file_path":"/tmp/oaitest_output/mik.log"},"severity":8,"file":"/Users/Brandon/mik/src/fetchermanipulators/OaiMissingFileSet.php","line":131} []

mjordan · 2018-05-03T14:34:11Z

@bondjimbond there is a change in #465 that is causing no OBJ files to be retrieved. Would you mind testing that PR first and merging it if it works so that I don't have to make the same change in this branch?

mjordan added a commit that referenced this issue Mar 19, 2017

Work on #344.

765c354

mjordan mentioned this issue Mar 22, 2017

OAI toolchain: provide options for rate limiting #339

Open

mjordan added the enhancement label Mar 24, 2017

mjordan added a commit that referenced this issue May 3, 2018

Work on #344.

b3b42d9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OAI filegetter: resume if process freezes? #344

OAI filegetter: resume if process freezes? #344

bondjimbond commented Mar 15, 2017

bondjimbond commented Mar 16, 2017

mjordan commented Mar 16, 2017 •

edited

Loading

mjordan commented Mar 17, 2017

mjordan commented Mar 17, 2017 •

edited

Loading

bondjimbond commented Mar 17, 2017

mjordan commented Mar 17, 2017

mjordan commented Mar 17, 2017

bondjimbond commented Mar 17, 2017

mjordan commented Mar 17, 2017

mjordan commented Mar 17, 2017

mjordan commented Mar 17, 2017

bondjimbond commented Mar 17, 2017

mjordan commented Mar 17, 2017

mjordan commented Mar 18, 2017 •

edited

Loading

mjordan commented Mar 19, 2017

mjordan commented Apr 28, 2017

bondjimbond commented May 5, 2017

mjordan commented May 24, 2017

mjordan commented Apr 29, 2018

bondjimbond commented Apr 30, 2018

bondjimbond commented May 2, 2018

bondjimbond commented May 2, 2018

mjordan commented May 2, 2018

mjordan commented May 2, 2018 •

edited

Loading

bondjimbond commented May 2, 2018

mjordan commented May 3, 2018

bondjimbond commented May 3, 2018

mjordan commented May 3, 2018

bondjimbond commented May 3, 2018

mjordan commented May 3, 2018

OAI filegetter: resume if process freezes? #344

OAI filegetter: resume if process freezes? #344

Comments

bondjimbond commented Mar 15, 2017

bondjimbond commented Mar 16, 2017

mjordan commented Mar 16, 2017 • edited Loading

mjordan commented Mar 17, 2017

mjordan commented Mar 17, 2017 • edited Loading

bondjimbond commented Mar 17, 2017

mjordan commented Mar 17, 2017

mjordan commented Mar 17, 2017

bondjimbond commented Mar 17, 2017

mjordan commented Mar 17, 2017

mjordan commented Mar 17, 2017

mjordan commented Mar 17, 2017

bondjimbond commented Mar 17, 2017

mjordan commented Mar 17, 2017

mjordan commented Mar 18, 2017 • edited Loading

mjordan commented Mar 19, 2017

mjordan commented Apr 28, 2017

bondjimbond commented May 5, 2017

mjordan commented May 24, 2017

mjordan commented Apr 29, 2018

bondjimbond commented Apr 30, 2018

bondjimbond commented May 2, 2018

bondjimbond commented May 2, 2018

mjordan commented May 2, 2018

mjordan commented May 2, 2018 • edited Loading

bondjimbond commented May 2, 2018

mjordan commented May 3, 2018

bondjimbond commented May 3, 2018

mjordan commented May 3, 2018

bondjimbond commented May 3, 2018

mjordan commented May 3, 2018

mjordan commented Mar 16, 2017 •

edited

Loading

mjordan commented Mar 17, 2017 •

edited

Loading

mjordan commented Mar 18, 2017 •

edited

Loading

mjordan commented May 2, 2018 •

edited

Loading