-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use D8's Migrate API to batch ingest content? #452
Comments
With some additional discussion about the Migrate Source CSV module here. |
Good question, but I don't think this is in scope for MVP. Back burner @dannylamb? |
That's fine, I added this more as a placeholder for discussion more than anything else. |
Oh, I've been purposefully silent on batch loading. Since we will have a REST API, it really opens up the options you can use, so I'm not keen to tie anyone to anything. But if you don't mind doing the work on the web server, the D8 Migrate API is a great option. ETL via a plugin based architecture? That's a solid improvement from D7. |
@dannylamb thanks. I take it that the alternative is to ingest content directly into F4 via its REST API. Let me mull over the pros/cons of each approach and in the new year I'll post something here. |
@mjordan I'd be interested in mulling over the pros and cons of the various approaches with you in the new year, since this is of interest to me. |
@MarcusBarnes yeah, when I looked up "ETL" it struck me that's exactly what MIK does. Check out https://api.drupal.org/api/drupal/core!modules!migrate!migrate.api.php/group/migration/8.2.x. |
@mjordan I would load content directly through Islandora's API, which we will eventually have. In 7.x-1.x the REST api is read-only, but you will be able to POST content in 8.x. At that point you can use anything to migrate content. Personally, I'd load up an activemq queue with my content and churn on that. You'd get some nice durability and scalability that way. But if someone wants to just do up a bunch of perl sripts... who am I to judge? |
Sweet, thanks for the nudge in the right direction. Is that the Islandora REST API spec'ed out roughly at the very end of https://github.com/Islandora-CLAW/CLAW/blob/master/docs/mvp/mvp_doc.md? Unimportant, but if you're referring to https://github.com/discoverygarden/islandora_rest as the 7.x-1.x REST API, you can POST and PUT with it as well as GET. |
Yes, also kept purposefully quite vague. :P Hunh... was always under the impression it was read-only. I wonder why it hasn't gotten the uptake it deserves, then. |
Dunno... I have some plans for using it (and have used it already in some internal housekeeping scripts), but it kinda dropped off my horizon last little while. Will return to it soon. REST APIs FTW! |
BTW, after the CLAW call last week I cleaned up my proof-of-concept that uses the Migrate API to load master Tiffs, their descriptive metadata, and link some authority records. I mention it here as a point of reference: CLAW Migrate Files Proof-of-Concept. |
@seth-shaw-unlv That's more full featured than what I was working on 👍 |
@seth-shaw-unlv Sweet. I'm going to be on a plane for a few hours tomorrow so might hack out a Move to Islandora Kit toolchain to fetch images from an existing Islandora (in my case, a vagrant running on my laptop) and dump them out in an arrangement like the one in your repo. That would be one example of a 7.x -> CLAW migration path. |
@mjordan I could probably just update mine to match your sample csv. Looks simple enough to do. I could probably do the update in a few minutes tomorrow morning. Edit: done. Took longer than I expected because I bumped into an issue with migrate_plus' entity_lookup plugin. While my previous example had lookups for each content type based on the CSV column, having a single column look across multiple content types wasn't possible in a single pass without a patch. (Multiple passes works but subjects not already in Drupal would get dropped.) |
@seth-shaw-unlv that's awesome. I've been working (also took me longer than I expected) to create an end-to-end MIK toolchain that will generate a CSV like the one you have from an Islandora's OAI endpoint. Here's what the output looks like, harvested from a collection on my 7.x vagrant:
with the CSV file looking like this (so far, still needs some work):
My goal with this is to allow someone to run MIK against their 7.x repository and get the input for a Migrate Plus ingest like the one you've created. |
Hey, all-- just a naive question from a naive onlooker, but is the basic idea of this ticket to create some tooling so that people can do migrations by acting against Drupal, instead of acting against the backend? |
Then I have to stick my 2¢ in; that would be AWESOME. Thanks to all of you for working on this! |
OK, now the CSV file looks like this:
The specific DC elements that end up in the file are configurable in the MIK .ini file. To parse MODS instead of DC, all we'd need to do is replace one PHP class file. If anyone is interested in seeing an example .ini file, I pasted one into the MIK Github issue linked a couple of comments above. |
@seth-shaw-unlv I can demo the MIK harvest part of this in tomorrow's CLAW call. You willing to go over the Migrate import stuff a little? |
@mjordan Sure. |
OK, based on conversation yesterday, we now have a way of outputting one larger XML file to use as Migrate Plus' input, rather than a CSV. A sample file is attached. The file is generated by concatenating all of the harvested MODS or DC XML files into one, wrapping them all in an outer element. The script is here. I've attached a sample output file (renamed to .txt so I can attach it). |
This might be a little nit-picky, but the root element should probably be modsCollection. |
No problem, it's configurable in the script but I'll change that to the default later today. |
Folks. What is the rationale for breaking Drupal 8 (memory 😬 ) by passing it a huge XML? instead of a key, value set like CSV? I have seen java apps like oxygen use 16 Gbytes of ram on 2xxxxx MODS records if all together in the same document. You think Drupal8/PHP can handle this and survive also a bad formed XML or are you planning on splitting before reading into a PHP Object? It is just a question before you all go this route because it could be amazing but also expensive and prone to fail. Maybe we can build a test case? like 100.000 mods documents? (that would be the median number of objects people have in their Islandora 7.x deployments) |
If you use the migrate_plus XML processor it uses SAX which is memory efficient. Now, If you want to go nuts with the XPath requirements you would have to use the XMLSimple Processor which will try to load the whole thing into memory, which I agree, is likely to die on you. That stated, if someone has a giant MODS file I can play with (all my collection records are in CSV), I'm happy to test it. The only XML testing I've done was with a relatively small set of Agent authorities (9,046 entries) which didn't give me any troubles in a default drupalvm image. Edit: I didn't answer the initial question: the rationale is simplicity of setup. Skipping the intermediary CSV file saves a step and the single file allows us to provide a single data source in the migrate configuration entity (rather than a list of 100k records). If it doesn't work, then sure, bring on the CSV intermediary. |
@seth-shaw-unlv, makes sense. Will ask our folks here to generate a test set and share as soon as possible, SAX should be able to deal with it if simplistic, but i agree that some XPath use cases can become heavy. thanks! |
It might make sense to keep (and document) the new CSV writer component of MIK and provide the script to assemble the one-XML-file-to-rule-them-all. So if people want one or the other they could choose. Or not use MIK at all! |
I've written a small script that harvests a collection from 7.x. MIK may be overkill. The script is at https://github.com/mjordan/get_islandora_content. It only supports XML input, not CSV. |
Not sure if this has come up yet, but would it make sense to use D8's Migrate API as the basis for batch ingest tools? Implementations are already starting to appear.
The text was updated successfully, but these errors were encountered: