Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use D8's Migrate API to batch ingest content? #452

Closed
mjordan opened this issue Dec 11, 2016 · 30 comments
Closed

Use D8's Migrate API to batch ingest content? #452

mjordan opened this issue Dec 11, 2016 · 30 comments

Comments

@mjordan
Copy link
Contributor

mjordan commented Dec 11, 2016

Not sure if this has come up yet, but would it make sense to use D8's Migrate API as the basis for batch ingest tools? Implementations are already starting to appear.

@mjordan mjordan changed the title Use the D8's Migrate API to batch ingest content? Use D8's Migrate API to batch ingest content? Dec 11, 2016
@mjordan
Copy link
Contributor Author

mjordan commented Dec 11, 2016

With some additional discussion about the Migrate Source CSV module here.

@ruebot
Copy link
Member

ruebot commented Dec 11, 2016

Good question, but I don't think this is in scope for MVP. Back burner @dannylamb?

@mjordan
Copy link
Contributor Author

mjordan commented Dec 12, 2016

That's fine, I added this more as a placeholder for discussion more than anything else.

@dannylamb
Copy link
Contributor

Oh, I've been purposefully silent on batch loading. Since we will have a REST API, it really opens up the options you can use, so I'm not keen to tie anyone to anything. But if you don't mind doing the work on the web server, the D8 Migrate API is a great option. ETL via a plugin based architecture? That's a solid improvement from D7.

@mjordan
Copy link
Contributor Author

mjordan commented Dec 12, 2016

@dannylamb thanks. I take it that the alternative is to ingest content directly into F4 via its REST API. Let me mull over the pros/cons of each approach and in the new year I'll post something here.

@MarcusBarnes
Copy link

MarcusBarnes commented Dec 12, 2016

@mjordan I'd be interested in mulling over the pros and cons of the various approaches with you in the new year, since this is of interest to me.

@mjordan
Copy link
Contributor Author

mjordan commented Dec 12, 2016

@MarcusBarnes yeah, when I looked up "ETL" it struck me that's exactly what MIK does. Check out https://api.drupal.org/api/drupal/core!modules!migrate!migrate.api.php/group/migration/8.2.x.

@dannylamb
Copy link
Contributor

@mjordan I would load content directly through Islandora's API, which we will eventually have. In 7.x-1.x the REST api is read-only, but you will be able to POST content in 8.x. At that point you can use anything to migrate content.

Personally, I'd load up an activemq queue with my content and churn on that. You'd get some nice durability and scalability that way. But if someone wants to just do up a bunch of perl sripts... who am I to judge?

@mjordan
Copy link
Contributor Author

mjordan commented Dec 12, 2016

Sweet, thanks for the nudge in the right direction. Is that the Islandora REST API spec'ed out roughly at the very end of https://github.com/Islandora-CLAW/CLAW/blob/master/docs/mvp/mvp_doc.md?

Unimportant, but if you're referring to https://github.com/discoverygarden/islandora_rest as the 7.x-1.x REST API, you can POST and PUT with it as well as GET.

@dannylamb
Copy link
Contributor

Yes, also kept purposefully quite vague. :P

Hunh... was always under the impression it was read-only. I wonder why it hasn't gotten the uptake it deserves, then.

@mjordan
Copy link
Contributor Author

mjordan commented Dec 12, 2016

Dunno... I have some plans for using it (and have used it already in some internal housekeeping scripts), but it kinda dropped off my horizon last little while. Will return to it soon. REST APIs FTW!

@seth-shaw-unlv
Copy link
Contributor

BTW, after the CLAW call last week I cleaned up my proof-of-concept that uses the Migrate API to load master Tiffs, their descriptive metadata, and link some authority records. I mention it here as a point of reference: CLAW Migrate Files Proof-of-Concept.

@dannylamb
Copy link
Contributor

@seth-shaw-unlv That's more full featured than what I was working on 👍

@mjordan
Copy link
Contributor Author

mjordan commented Apr 10, 2018

@seth-shaw-unlv Sweet. I'm going to be on a plane for a few hours tomorrow so might hack out a Move to Islandora Kit toolchain to fetch images from an existing Islandora (in my case, a vagrant running on my laptop) and dump them out in an arrangement like the one in your repo. That would be one example of a 7.x -> CLAW migration path.

@seth-shaw-unlv
Copy link
Contributor

seth-shaw-unlv commented Apr 10, 2018

@mjordan I could probably just update mine to match your sample csv. Looks simple enough to do. I could probably do the update in a few minutes tomorrow morning.

Edit: done. Took longer than I expected because I bumped into an issue with migrate_plus' entity_lookup plugin. While my previous example had lookups for each content type based on the CSV column, having a single column look across multiple content types wasn't possible in a single pass without a patch. (Multiple passes works but subjects not already in Drupal would get dropped.)

@mjordan
Copy link
Contributor Author

mjordan commented Apr 12, 2018

@seth-shaw-unlv that's awesome. I've been working (also took me longer than I expected) to create an end-to-end MIK toolchain that will generate a CSV like the one you have from an Islandora's OAI endpoint. Here's what the output looks like, harvested from a collection on my 7.x vagrant:

/tmp/oai_to_csv_output/
├── metadata.csv
├── mik.log
├── oai_drupal-site.org_doitest_12.png
├── oai_drupal-site.org_doitest_16.png
├── oai_drupal-site.org_doitest_4.jpeg
├── oai_drupal-site.org_doitest_5.jpeg
├── oai_drupal-site.org_doitest_6.png
└── problem_records.log

with the CSV file looking like this (so far, still needs some work):

"autogen 6 - blurg",StillImage,"nonprojected graphic",doitest:16,"This record was harvested on a Thursday."
"Church Holy Rosary, Vancouver B.C.",Churches,"Holy Rosary Church in Vancouver, B.C.",,1911,image,doitest:4,eng,"Vancouver, BC"
"Second test object.","Vanity Press","Jordan, M. (author)",(editor),2015-01-01,Text,doitest:3,,"This record was harvested on a Thursday."
"Has DOI?","Vanity Press",PhysicalObject,globe,doitest:6,"This record was harvested on a Thursday."
"autogen 6",StillImage,"nonprojected graphic",doitest:12,"This record was harvested on a Thursday."

My goal with this is to allow someone to run MIK against their 7.x repository and get the input for a Migrate Plus ingest like the one you've created.

@ajs6f
Copy link

ajs6f commented Apr 12, 2018

Hey, all-- just a naive question from a naive onlooker, but is the basic idea of this ticket to create some tooling so that people can do migrations by acting against Drupal, instead of acting against the backend?

@mjordan
Copy link
Contributor Author

mjordan commented Apr 12, 2018

@ajs6f yes, specifically, Drupal 8's Migrate API. A related issue is #819.

@ajs6f
Copy link

ajs6f commented Apr 12, 2018

Then I have to stick my 2¢ in; that would be AWESOME. Thanks to all of you for working on this!

@mjordan
Copy link
Contributor Author

mjordan commented Apr 12, 2018

OK, now the CSV file looks like this:

ID,title,identifier,description,format,File
oai%3Adrupal-site.org%3Adoitest_16,"autogen 6 - blurg",doitest:16,"This record was harvested on a Thursday.","nonprojected graphic",oai_drupal-site.org_doitest_16.png
oai%3Adrupal-site.org%3Adoitest_4,"Church Holy Rosary, Vancouver B.C.",doitest:4,"Holy Rosary Church in Vancouver, B.C.",oai_drupal-site.org_doitest_4.jpeg
oai%3Adrupal-site.org%3Adoitest_5,"Second test object.",doitest:3,"This record was harvested on a Thursday.",oai_drupal-site.org_doitest_5.jpeg
oai%3Adrupal-site.org%3Adoitest_6,"Has DOI?",doitest:6,"This record was harvested on a Thursday.",globe,oai_drupal-site.org_doitest_6.png
oai%3Adrupal-site.org%3Adoitest_12,"autogen 6",doitest:12,"This record was harvested on a Thursday.","nonprojected graphic",oai_drupal-site.org_doitest_12.png

The specific DC elements that end up in the file are configurable in the MIK .ini file. To parse MODS instead of DC, all we'd need to do is replace one PHP class file. If anyone is interested in seeing an example .ini file, I pasted one into the MIK Github issue linked a couple of comments above.

@mjordan
Copy link
Contributor Author

mjordan commented Apr 17, 2018

@seth-shaw-unlv I can demo the MIK harvest part of this in tomorrow's CLAW call. You willing to go over the Migrate import stuff a little?

@seth-shaw-unlv
Copy link
Contributor

@mjordan Sure.

@mjordan
Copy link
Contributor Author

mjordan commented Apr 19, 2018

OK, based on conversation yesterday, we now have a way of outputting one larger XML file to use as Migrate Plus' input, rather than a CSV. A sample file is attached. The file is generated by concatenating all of the harvested MODS or DC XML files into one, wrapping them all in an outer element. The script is here. I've attached a sample output file (renamed to .txt so I can attach it).

metadata.xml.txt

@seth-shaw-unlv
Copy link
Contributor

This might be a little nit-picky, but the root element should probably be modsCollection.

@mjordan
Copy link
Contributor Author

mjordan commented Apr 19, 2018

No problem, it's configurable in the script but I'll change that to the default later today.

@DiegoPino
Copy link
Contributor

Folks. What is the rationale for breaking Drupal 8 (memory 😬 ) by passing it a huge XML? instead of a key, value set like CSV? I have seen java apps like oxygen use 16 Gbytes of ram on 2xxxxx MODS records if all together in the same document. You think Drupal8/PHP can handle this and survive also a bad formed XML or are you planning on splitting before reading into a PHP Object? It is just a question before you all go this route because it could be amazing but also expensive and prone to fail. Maybe we can build a test case? like 100.000 mods documents? (that would be the median number of objects people have in their Islandora 7.x deployments)

@seth-shaw-unlv
Copy link
Contributor

seth-shaw-unlv commented Apr 19, 2018

If you use the migrate_plus XML processor it uses SAX which is memory efficient.

Now, If you want to go nuts with the XPath requirements you would have to use the XMLSimple Processor which will try to load the whole thing into memory, which I agree, is likely to die on you.

That stated, if someone has a giant MODS file I can play with (all my collection records are in CSV), I'm happy to test it. The only XML testing I've done was with a relatively small set of Agent authorities (9,046 entries) which didn't give me any troubles in a default drupalvm image.

Edit: I didn't answer the initial question: the rationale is simplicity of setup. Skipping the intermediary CSV file saves a step and the single file allows us to provide a single data source in the migrate configuration entity (rather than a list of 100k records). If it doesn't work, then sure, bring on the CSV intermediary.

@DiegoPino
Copy link
Contributor

@seth-shaw-unlv, makes sense. Will ask our folks here to generate a test set and share as soon as possible, SAX should be able to deal with it if simplistic, but i agree that some XPath use cases can become heavy. thanks!

@mjordan
Copy link
Contributor Author

mjordan commented Apr 19, 2018

It might make sense to keep (and document) the new CSV writer component of MIK and provide the script to assemble the one-XML-file-to-rule-them-all. So if people want one or the other they could choose. Or not use MIK at all!

@mjordan
Copy link
Contributor Author

mjordan commented May 1, 2018

I've written a small script that harvests a collection from 7.x. MIK may be overkill. The script is at https://github.com/mjordan/get_islandora_content. It only supports XML input, not CSV.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants