-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OAI filegetter: resume if process freezes? #344
Comments
Another (though perhaps slower) option would be to check the output_directory for each record to see whether a file already exists before attempting to download. |
I don't think resumption tokens would work here since they are internal to the OAI harvester, and because they represent chuncks of records, usually 100 but I think that is configurable. However, we have approached this sort of problem in the CONTENTdm toolchains by using the Specific Set fetcher manipulator, which, as the name suggests, fetches specific records from the source. It could be used in two ways:
This manipulator currently only works with CSV and Cdm toolchains. Let me see what's involved in getting it to work with the OAI toolchains. |
@bondjimbond just to clarify, when you say "to resume downloading a large set if the process freezes", do you mean the OAI part of the harvest, or the downloading of the payload (PDF, etc.) file? Rereading your initial issue it appears you mean the OAI part, but your follow up comments suggest the file download part. |
Consulting the fine docs at https://github.com/MarcusBarnes/mik/wiki/Toolchain:-OAI-PMH-for-repositories-that-identify-resource-files-in-a-record-element I see that this toolchain already supports the SpecificSet fetcher manipulator. We could give that a try. It applies to the OAI harvesting, not the file downloading, however. |
I guess the issue here is downloading the payload. I'd say the OAI part of the harvest completes rather quickly, as the "temp" directory generates a huge number of .metadata files right away. (I don't know exactly how the toolchain works after you get all the .metadata files; it still takes some time to generate the XML files from them.) SpecificSet is good and helpful in limiting the download - but when a "set" contains 6000+ objects, the possibility of timeouts/crashes/etc is high. In my specific case, the last attempt to download a large set got me 95% complete but the rest failed due to a scheduled server reset. So I was able to collect all the PIDs with failed downloads. But that's kind of a unique case. More generally... You could compare the PIDs in the .metadata files with the filenames of whatever's already in the output directory. Re-attempt downloads for every file that does not already have both an XML file and something else with same filename. |
You can split up large sets into smaller ones, which would reduce the risk.
What you describe in the last paragraph could pretty easily be coded up as a filegetter manipulator. Given that the OAI "fetching" appears to be fairly reliable, this is probably the most direct solution to your problem. You would run MIK configured with this manipulator as many times as necessary until you got all your files. This filegetter manipulator would be pretty specific but that's totally OK, the reason we built in ability to use manipulators was to address edge cases or exceptions without having to hack or write core toolchain classes. I'm happy to advise on how to write this if you want. |
I say that on the assumption that it's OK to rerun the OAI-PMH half of this process as many times as you need to. |
It would be nice to have something like that. It would even work for unexplained hiccups - for example, I'm running a download right now where an object downloaded successfully before, but just failed for no reason this time. Running it again would probably be successful. Would be great if I could run the entire set again but skip all the files that already downloaded. Your advice on writing this would be appreciated! |
Agreed about nice to have. Such a filegetter manipulator may be applicable to other toolchains that get files from a remote URL, both existing and yet-to-be-imagined ones. Let me put some initial thought into how to generalize this out without jeopardizing your immediate use case. I think MIK's abstraction of 'record keys' in other toolchains (in other words, unique IDs) is all we need to pay attention to. |
A quick look at the two existing filegetter manipulators at https://github.com/MarcusBarnes/mik/tree/master/src/filegettermanipulators reminds me that they are additive - they return likely paths to files. The use case we are circling around in the last few comments is subtractive. Despite my having said
I am now flipflopping back to implementing this as a fetcher manipulator, but one where the "set" is determined by a payload file's presence in the output directory. Let me continue to stew this over a bit. |
Pondering this while eating my lunch, I think we can base this fetcher manipulator on SpecificSet pretty easily. All we need to do is replace the code inside its |
Probably... I don't know exactly how SpecificSet works, but we can look it over. |
I can give this a first crack over the weekend. I think it should be pretty straight forward. |
@bondjimbond could you send me an .ini file for an OAI-PMH toolchain that uses a setSpec that retrieves a relatively small (< 50 or even fewer) set of objects? By email is OK if you prefer. |
I've pushed up a new fetcher manipulator that filters out records that do not have a JPEG, etc. file in the output directory. To test it,
I had hoped that this fetcher manipulator could apply to the CONTENTdm and CSV toolchains, but at the moment it doesn't. |
@bondjimbond can you give this a test? |
Having some weird troubles checking out your branch due to some git strangeness; can't test at the moment, sorry. |
@bondjimbond bump, if you have time. |
@bondjimbond not sure if you ever took a look at this, but if you could that would be great. |
@mjordan I don't know, it hasn't QUITE been a year yet... Maybe in another month? Working on a few things this week, but I'll try to take a look soon. |
OK! Finally starting to test this... And immediately run into problems. I'm using the Islandora filegetter: And here's my error:
Looks like maybe the path to the filegetter is wrong? My filegetters director is at mik/src/filegetters, not mik/filegetters |
Never mind the above -- it looks like in that directory I don't actually have an OaipmhIslandoraObj file. Is that not part of the git repo? Shouldn't it have been added at some point after a git pull? |
Are you in the issue-344 branch? |
OaipmhIslandoraObj.php filegetter isn't in the https://github.com/MarcusBarnes/mik/tree/issue-344 branch. Not sure why but I can create a new branch tonight that contains everything you need. Will reply with details here. |
Sounds good! |
OK, I've created a new branch, issue-344-new, which adds the new fetchermanipulator to the current master. Everything should be up to date and testable following the steps outlined above. |
Hmm.
Does this tell you anything about where the problem lies? |
If you look in your output directory, are all the files .xml files? In other words, are there any OBJ files? |
No files at all. But I do have a bunch of .metadata files in the temp directory. Couple of errors in mik.log:
|
@bondjimbond there is a change in #465 that is causing no OBJ files to be retrieved. Would you mind testing that PR first and merging it if it works so that I don't have to make the same change in this branch? |
Would it be possible to build in a way to resume downloading a large set if the process freezes? Perhaps the process could save the last-received resumption token, and a line in config could instruct MIK to use that when querying OAI the next time?
Very large sets appear to just hang up or time out, and a mechanism for recovering from a crash without re-initiating the process from the beginning would be extremely helpful.
The text was updated successfully, but these errors were encountered: