diff --git a/demo/metadata_migration/notebooks/migrate_8_1_2_to_9_0_4.ipynb b/demo/metadata_migration/notebooks/migrate_8_1_2_to_9_0_4.ipynb new file mode 100644 index 00000000..b2e91df3 --- /dev/null +++ b/demo/metadata_migration/notebooks/migrate_8_1_2_to_9_0_4.ipynb @@ -0,0 +1,607 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Migrate Mongo data from `nmdc-schema` [`v8.1.2`](https://github.com/microbiomedata/nmdc-schema/releases/tag/v8.1.2) to [`v9.0.4`](https://github.com/microbiomedata/nmdc-schema/releases/tag/v9.0.4)\n", + "\n", + "## Preface\n", + "\n", + "Download a copy of https://raw.githubusercontent.com/microbiomedata/nmdc-schema/main/assets/misc/study_dois_changes.yaml and save it at the path `assets/misc/study_dois_changes.yaml` relative to this notebook. This is a workaround for issue https://github.com/microbiomedata/nmdc-schema/issues/1310. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "### 1. Determine Mongo collections that will be transformed\n", + "\n", + "In this step, you will determine which Mongo collections will be transformed during this migration.\n", + "\n", + "1. In [`nmdc_schema/migration_recursion.py`](https://github.com/microbiomedata/nmdc-schema/blob/main/nmdc_schema/migration_recursion.py), locate the Python class whose name reflects the initial and final version numbers of this migration.\n", + "2. In that Python class, locate the `self.agenda` dictionary.\n", + "3. In that dictionary, make a list of the keys—these are the names of the Mongo collections that will be transformed during this migration. For example:\n", + " ```py\n", + " self.agenda = dict(\n", + " collection_name_1=[self.some_function],\n", + " collection_name_2=[self.some_function],\n", + " )\n", + " ```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 2. Coordinate with teammates that read/write to those collections\n", + "\n", + "In this step, you'll identify and reach out to the people that read/write to those collections; to agree on a migration schedule that works for you and them.\n", + "\n", + "Here's a table of Mongo collections and the components of the NMDC system that write to them (according to [a conversation that occurred on September 11, 2023](https://nmdc-group.slack.com/archives/C01SVTKM8GK/p1694465755802979?thread_ts=1694216327.234519&cid=C01SVTKM8GK)).\n", + "\n", + "| Mongo collection | NMDC system components that write to it |\n", + "|---------------------------------------------|----------------------------------------------------------|\n", + "| `biosample_set` | Workflows (via manual entry via `nmdc-runtime` HTTP API) |\n", + "| `data_object_set` | Workflows (via `nmdc-runtime` HTTP API) |\n", + "| `mags_activity_set` | Workflows (via `nmdc-runtime` HTTP API) |\n", + "| `metagenome_annotation_activity_set` | Workflows (via `nmdc-runtime` HTTP API) |\n", + "| `metagenome_assembly_set` | Workflows (via `nmdc-runtime` HTTP API) |\n", + "| `read_based_taxonomy_analysis_activity_set` | Workflows (via `nmdc-runtime` HTTP API) |\n", + "| `read_qc_analysis_activity_set` | Workflows (via `nmdc-runtime` HTTP API) |\n", + "| `jobs` | Scheduler (via Mongo directly) |\n", + "| `*` | `nmdc-runtime` (via Mongo directly) |\n", + "\n", + "You can use that table to help determine which people read/write to those collections. You can then coordinate a migration time slot with them via Slack, email, etc." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 3. Setup a migration environment\n", + "\n", + "In this step, you'll set up an environment in which you can run this notebook.\n", + "\n", + "1. Start a **Mongo server** on your local machine (and ensure it does **not** contain a database named `nmdc`).\n", + " 1. You can start a temporary, [Docker](https://hub.docker.com/_/mongo)-based Mongo server at `localhost:27055` by running this command:\n", + " ```shell\n", + " # Run in any directory:\n", + " docker run --rm --detach --name mongo-migration-transformer -p 27055:27017 mongo\n", + " ```\n", + " > Note: A Mongo server started via that command will have no access control (i.e. you will be able to access it without a username or password).\n", + "2. Create and populate a **notebook configuration file** named `.notebook.env`.\n", + " 1. You can use the `.notebook.env.example` file as a template:\n", + " ```shell\n", + " # Run in the same directory as this notebook:\n", + " $ cp .notebook.env.example .notebook.env\n", + " ```\n", + "3. Create and populate **Mongo configuration files** for connecting to the origin and transformer Mongo servers.\n", + " 1. You can use the `.mongo.yaml.example` file as a template:\n", + " ```shell\n", + " # Run in the same directory as this notebook:\n", + " $ cp .mongo.yaml.example .mongo.origin.yaml\n", + " $ cp .mongo.yaml.example .mongo.transformer.yaml\n", + " ```\n", + " > When populating the file for the origin Mongo server, use credentials that have write access to the `nmdc` database." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Procedure" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Install Python dependencies\n", + "\n", + "In this step, you'll [install](https://saturncloud.io/blog/what-is-the-difference-between-and-in-jupyter-notebooks/) the Python packages upon which this notebook depends. You can do that by running this cell.\n", + "\n", + "> Note: If the output of this cell says \"Note: you may need to restart the kernel to use updated packages\", restart the kernel (not the notebook) now." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -r requirements.txt\n", + "%pip install nmdc-schema==9.0.4" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Import Python dependencies\n", + "\n", + "Import the Python objects upon which this notebook depends.\n", + "\n", + "> Note: One of the Python objects is a Python class that is specific to this migration." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Standard library packages:\n", + "from pathlib import Path\n", + "from shutil import rmtree\n", + "from copy import deepcopy\n", + "\n", + "# Third-party packages:\n", + "import pymongo\n", + "from nmdc_schema.nmdc_data import get_nmdc_jsonschema_dict\n", + "from nmdc_schema.migration_recursion import Migrator_from_8_1_to_9_0 as Migrator\n", + "from jsonschema import Draft7Validator\n", + "from dictdiffer import diff\n", + "\n", + "# First-party packages:\n", + "from helpers import Config" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Programmatically determine which collections will be transformed\n", + "\n", + "Here are the names of the collections this migration will transform.\n", + "\n", + "> Ensure you have coordinated with the people that read/write to them." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "agenda_collection_names = Migrator().agenda.keys()\n", + "\n", + "print(\"The following collections will be transformed:\")\n", + "print(\"\\n\".join(agenda_collection_names))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Parse configuration files\n", + "\n", + "Parse the notebook and Mongo configuration files." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "cfg = Config()\n", + "\n", + "# Define some aliases we can use to make the shell commands in this notebook easier to read.\n", + "mongodump = cfg.mongodump_path\n", + "mongorestore = cfg.mongorestore_path" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Perform a sanity test of the application paths." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!{mongodump} --version\n", + "!{mongorestore} --version" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create Mongo clients\n", + "\n", + "Create Mongo clients you can use to access the \"origin\" Mongo server (i.e. the one containing the database you want to migrate) and the \"transformer\" Mongo server (i.e. the one you want to use to perform the data transformations)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Mongo client for origin Mongo server.\n", + "origin_mongo_client = pymongo.MongoClient(host=cfg.origin_mongo_server_uri, directConnection=True)\n", + "\n", + "# Mongo client for transformer Mongo server.\n", + "transformer_mongo_client = pymongo.MongoClient(host=cfg.transformer_mongo_server_uri)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Perform a sanity test of the Mongo clients' abilities to access their respective Mongo servers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Display the Mongo server version (running on the \"origin\" Mongo server).\n", + "print(\"Origin Mongo server version: \" + origin_mongo_client.server_info()[\"version\"])\n", + "\n", + "# Sanity test: Ensure the origin database exists.\n", + "assert \"nmdc\" in origin_mongo_client.list_database_names(), \"Origin database does not exist.\"\n", + "\n", + "# Display the Mongo server version (running on the \"transformer\" Mongo server).\n", + "print(\"Transformer Mongo server version: \" + transformer_mongo_client.server_info()[\"version\"])\n", + "\n", + "# Sanity test: Ensure the transformation database does not exist.\n", + "assert \"nmdc\" not in transformer_mongo_client.list_database_names(), \"Transformation database already exists.\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create JSON Schema validator\n", + "\n", + "In this step, you'll create a JSON Schema validator for the NMDC Schema." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "nmdc_jsonschema: dict = get_nmdc_jsonschema_dict()\n", + "nmdc_jsonschema_validator = Draft7Validator(nmdc_jsonschema)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Perform sanity tests of the NMDC Schema dictionary and the JSON Schema validator.\n", + "\n", + "> Reference: https://python-jsonschema.readthedocs.io/en/latest/api/jsonschema/protocols/#jsonschema.protocols.Validator.check_schema" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"NMDC Schema title: \" + nmdc_jsonschema[\"title\"])\n", + "print(\"NMDC Schema version: \" + nmdc_jsonschema[\"version\"])\n", + "\n", + "nmdc_jsonschema_validator.check_schema(nmdc_jsonschema) # raises exception if schema is invalid" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Dump collections from the \"origin\" Mongo server\n", + "\n", + "In this step, you'll use `mongodump` to dump the collections that will be transformed during this migration; from the \"origin\" Mongo server.\n", + "\n", + "Since `mongodump` doesn't provide a CLI option that you can use to specify the collections you _want_ it to dump (unless that is only one collection), you can use a different CLI option to tell it all the collection you do _not_ want it to dump. The end result will be the same—there's just an extra step involved." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "That extra step is to generate an `--excludeCollection=\"{name}\"` CLI option for each collection that is not on the agenda, which you'll do now." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Build a string containing zero or more `--excludeCollection=\"...\"` options, which can be included in a `mongodump` command.\n", + "all_collection_names: list[str] = origin_mongo_client[\"nmdc\"].list_collection_names()\n", + "non_agenda_collection_names = [name for name in all_collection_names if name not in agenda_collection_names]\n", + "exclusion_options = [f\"--excludeCollection='{name}'\" for name in non_agenda_collection_names]\n", + "exclusion_options_str = \" \".join(exclusion_options) # separates each option with a space\n", + "\n", + "print(exclusion_options_str)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here, you'll run a `mongodump` command containing all those `--excludeCollection=\"{name}\"` CLI options." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Dump the not-excluded collections from the origin database.\n", + "!{mongodump} \\\n", + " --config=\"{cfg.origin_mongo_config_file_path}\" \\\n", + " --db=\"nmdc\" \\\n", + " --gzip \\\n", + " --out=\"{cfg.origin_dump_folder_path}\" \\\n", + " {exclusion_options_str}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Load the collections into the \"transformer\" Mongo server\n", + "\n", + "In this step, you'll load the collections dumped from the \"origin\" Mongo server, into the \"transformer\" MongoDB server.\n", + "\n", + "Since it's possible that the dump includes more collections than are on the agenda (due to someone creating a collection between the time you generated the exclusion list and the time you ran `mongodump`), you will use one or more of `mongorestore`'s `--nsInclude` CLI options to indicate which collections you want to load.\n", + "\n", + "Here's where you will generate the `--nsInclude=\"nmdc.{name}\"` CLI options." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "inclusion_options = [f\"--nsInclude='nmdc.{name}'\" for name in agenda_collection_names]\n", + "inclusion_options_str = \" \".join(inclusion_options) # separates each option with a space\n", + "\n", + "print(inclusion_options_str)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here, you'll run a `mongorestore` command containing all those `--nsInclude=\"nmdc.{name}\"` CLI options." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Restore the dumped collections to the transformer MongoDB server.\n", + "!{mongorestore} \\\n", + " --config=\"{cfg.transformer_mongo_config_file_path}\" \\\n", + " --gzip \\\n", + " --drop \\\n", + " --preserveUUID \\\n", + " --dir=\"{cfg.origin_dump_folder_path}\" \\\n", + " {inclusion_options_str}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Transform the collections within the \"transformer\" Mongo server\n", + "\n", + "Now that the transformer database contains a copy of each collection on the agenda, you can transform those copies.\n", + "\n", + "The transformation functions are provided by the `nmdc-schema` Python package.\n", + "> You can examine the transformation functions at: https://github.com/microbiomedata/nmdc-schema/blob/main/nmdc_schema/migration_recursion.py\n", + "\n", + "In this step, you will retrieve each documents from each collection on the agenda, pass it to the associated transformation function(s) on the agenda, then store the transformed document in place of the original one—all within the \"transformation\" database only. **The \"origin\" database is not involved with this step.**\n", + "\n", + "> Note: This step also includes validation. Reference: https://github.com/microbiomedata/nmdc-runtime/blob/main/metadata-translation/src/bin/validate_json.py\n", + "\n", + "> Note: This step also include a before-and-after comparison to facilitate manual spot checks. References: https://docs.python.org/3/library/copy.html#copy.deepcopy and https://dictdiffer.readthedocs.io/" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "migrator = Migrator()\n", + "\n", + "# Apply the transformations.\n", + "for collection_name, transformation_pipeline in migrator.agenda.items():\n", + " print(f\"Transforming documents in collection: {collection_name}\")\n", + " transformed_documents = []\n", + "\n", + " # Get each document from this collection.\n", + " collection = transformer_mongo_client[\"nmdc\"][collection_name]\n", + " for original_document in collection.find():\n", + " # Make a deep copy of the original document, to enable before-and-after comparison.\n", + " print(original_document)\n", + " copy_of_original_document = deepcopy(original_document)\n", + " \n", + " # Put the document through the transformation pipeline associated with this collection.\n", + " transformed_document = original_document # initializes the variable\n", + " for transformation_function in transformation_pipeline:\n", + " #\n", + " # THIS IS A WORKAROUND FOR ISSUE https://github.com/microbiomedata/nmdc-schema/issues/1311\n", + " #\n", + " # Note: Some of the transformation functions in the migration class specific to this migration\n", + " # do not return the transformed dictionary. As a result, transformation functions\n", + " # \"further down the pipeline\" do not receive a dictionary as input.\n", + " # \n", + " # The workaround I have employed here is:\n", + " # 1. Manually read the transformation functions and verify they all do, indeed,\n", + " # modify the dictionary \"in place\" (as opposed to returning a copy).\n", + " # 2. Once that has been verified; replace the standard notebook code\n", + " # (commented-out below) with code that ignores the return value\n", + " # of the transformation functions.\n", + " #\n", + " # transformed_document = transformation_function(transformed_document)\n", + " transformation_function(transformed_document)\n", + " print(transformed_document)\n", + " \n", + " # Compare the transformed document with a copy of the original document;\n", + " # and, if there are any differences, print those differences.\n", + " difference = diff(copy_of_original_document, transformed_document)\n", + " differences = list(difference)\n", + " if len(differences) > 0:\n", + " print(f\"✏️ {differences}\")\n", + "\n", + " # Validate the transformed document.\n", + " #\n", + " # Reference: https://github.com/microbiomedata/nmdc-schema/blob/main/src/docs/schema-validation.md\n", + " #\n", + " # Note: Dictionaries originating as Mongo documents include a Mongo-generated key named `_id`. However,\n", + " # the NMDC Schema does not describe that key and, indeed, data validators consider dictionaries\n", + " # containing that key to be invalid with respect to the NMDC Schema. So, here, we validate a\n", + " # copy (i.e. a shallow copy) of the document that lacks that specific key.\n", + " #\n", + " # Note: `root_to_validate` is a dictionary having the shape: { \"some_collection_name\": [ some_document ] }\n", + " # Reference: https://docs.python.org/3/library/stdtypes.html#dict (see the \"type constructor\" section)\n", + " #\n", + " transformed_document_without_underscore_id_key = {key: value for key, value in transformed_document.items() if key != \"_id\"}\n", + " root_to_validate = dict([(collection_name, [transformed_document_without_underscore_id_key])])\n", + " nmdc_jsonschema_validator.validate(root_to_validate) # raises exception if invalid\n", + "\n", + " # Store the transformed document.\n", + " transformed_documents.append(transformed_document) \n", + " print(\"\") \n", + "\n", + " # Replace the original documents with the transformed versions of themselves (in the transformer database).\n", + " for transformed_document in transformed_documents:\n", + " collection.replace_one({\"id\": {\"$eq\": transformed_document[\"id\"]}}, transformed_document)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Dump the transformed collections" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Dump the database from the transformer MongoDB server.\n", + "!{mongodump} \\\n", + " --config=\"{cfg.transformer_mongo_config_file_path}\" \\\n", + " --db=\"nmdc\" \\\n", + " --gzip \\\n", + " --out=\"{cfg.transformer_dump_folder_path}\" \\\n", + " {exclusion_options_str}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Load the transformed data into the \"origin\" Mongo server\n", + "\n", + "In this step, you'll put the transformed collection(s) into the origin MongoDB server, replacing the original collection(s) that have the same name(s)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Replace the same-named collection(s) on the origin server, with the transformed one(s).\n", + "!{mongorestore} \\\n", + " --config=\"{cfg.origin_mongo_config_file_path}\" \\\n", + " --gzip \\\n", + " --verbose \\\n", + " --dir=\"{cfg.transformer_dump_folder_path}\" \\\n", + " --drop \\\n", + " --preserveUUID \\\n", + " {inclusion_options_str}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### (Optional) Clean up\n", + "\n", + "Delete the temporary files and MongoDB dumps created by this notebook.\n", + "\n", + "> Note: You can skip this step, in case you want to delete them manually later (e.g. to examine them before deleting them)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "paths_to_files_to_delete = []\n", + "\n", + "paths_to_folders_to_delete = [\n", + " cfg.origin_dump_folder_path,\n", + " cfg.transformer_dump_folder_path,\n", + "]\n", + "\n", + "# Delete files.\n", + "for path in [Path(string) for string in paths_to_files_to_delete]:\n", + " try:\n", + " path.unlink()\n", + " print(f\"Deleted: {path}\")\n", + " except:\n", + " print(f\"Failed to delete: {path}\")\n", + "\n", + "# Delete folders.\n", + "for path in [Path(string) for string in paths_to_folders_to_delete]:\n", + " try:\n", + " rmtree(path)\n", + " print(f\"Deleted: {path}\")\n", + " except:\n", + " print(f\"Failed to delete: {path}\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}