Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue337 move data #339

Open
wants to merge 36 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
87581eb
#337: add code for moving data
mortenwh Jul 26, 2024
996b596
#337: update MMD file
mortenwh Jul 29, 2024
c31a81b
add contact address, country
mortenwh Jul 29, 2024
cae9f8d
#337: bug fix - see line 113, and return values of get_acdd
mortenwh Jul 29, 2024
e9838c9
#337: this should work..
mortenwh Jul 29, 2024
dd7232f
#337: executable script
mortenwh Jul 29, 2024
6e8b035
#337: add dependency
mortenwh Jul 29, 2024
6998b5e
#337: fix flake errors
mortenwh Jul 29, 2024
de70609
#337: updates after real-life test
mortenwh Jul 29, 2024
21088af
#337: minor bug fix
mortenwh Jul 29, 2024
666e9ea
#337: another minor bug fix
mortenwh Jul 29, 2024
376011c
#337: changes after testing on actual data
mortenwh Jul 30, 2024
fa87f4f
#337: still updating tests
mortenwh Jul 30, 2024
57a93ad
#337: Full test coverage
mortenwh Jul 31, 2024
4859b48
Merge branch 'master' into issue337_move_data
mortenwh Aug 1, 2024
5cb79c9
#337: fix flake errors
mortenwh Aug 1, 2024
e0097cb
Add description to README
mortenwh Aug 1, 2024
edd7295
Merge branch 'master' into issue337_move_data
mortenwh Sep 30, 2024
2e3d411
#337: install new script
mortenwh Sep 30, 2024
50b4edb
#337: properly close nc files
mortenwh Sep 30, 2024
36fac4b
337: remove test, as docstring does not make sense
mortenwh Sep 30, 2024
d2a0943
#337: resolve conflicts
mortenwh Sep 30, 2024
c703e16
#337: fix flake errors
mortenwh Sep 30, 2024
6515696
#337: change use of base folders and pattern, update tests, and chang…
mortenwh Oct 2, 2024
ab47412
#337: raise exception and resolve flake errors
mortenwh Oct 2, 2024
11ac8e1
#337: don't create folders in case of dry-run, change input arg and u…
mortenwh Oct 2, 2024
cda8b89
#337: change metadata update info
mortenwh Oct 2, 2024
236d828
#337: update test
mortenwh Oct 2, 2024
e73d5b2
#337: change metadata update info
mortenwh Oct 2, 2024
18732b6
#337: catch FileExistsError
mortenwh Oct 2, 2024
bda2cba
#337: handle special chars in requests, check response text, add and …
mortenwh Oct 2, 2024
71c4ee3
#337: remove commented lines
mortenwh Oct 2, 2024
da9fc22
#337: add logging
mortenwh Oct 2, 2024
cbe816e
#337: change warning msg
mortenwh Oct 3, 2024
2a18d58
#337: delete and insert instead of update
mortenwh Oct 3, 2024
28f7e0c
Merge branch 'master' into issue337_move_data
mortenwh Oct 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,26 @@ Python tools for MMD. The package contains tools for generating MMD files from n

generates an output MMD file called `reference_nc.xml`.

In addition, the `mmd_operations` module currently contains a tool to move data
files and accordingly update MMD files registered in online catalogs. This
module can be extended with other necessary data management tools. Moving
can be done with the `move_data` script, e.g.:

```
move_data /path/to/files-from-git/mmd-xml-<env> /path/to/old/storage /path/to/new/storage "%Y/%m/%d/*.nc" --dmci-update
```

The two last arguments provide a search pattern in case the netCDF files are
stored in subfolders, and to directly updated the metadata catalog,
respectively. If --dmci-update is not provided, local MMD files will not be
pushed to the catalog.

The results of the `move_data` script will be logged to a file, which by
default is named `move_data.log`. You can change the filename through the
option `--log-file`. Due to a bug in pycsw, the file may contain warnings about
not found datasets. This can be handled by reingesting the MMD files. Keep the
log file, and get help from a data manager to handle this.

# Installation

To avoid problems with conflicting versions, we recommend using the [Conda](
Expand Down
252 changes: 252 additions & 0 deletions py_mmd_tools/mmd_operations.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,252 @@
"""
License:

This file is part of the py-mmd-tools repository
<https://github.com/metno/py-mmd-tools>.

py-mmd-tools is licensed under the Apache License 2.0
<https://github.com/metno/py-mmd-tools/blob/master/LICENSE>
"""
import os
import pytz
import uuid
import shutil
import logging
import netCDF4
import datetime
import requests
import tempfile
import datetime_glob

import urllib.parse


def add_metadata_update_info(f, note, type="Minor modification"):
""" Add update information """
f.write(
" <mmd:update>\n"
" <mmd:datetime>%s</mmd:datetime>\n"
" <mmd:type>%s</mmd:type>\n"
" <mmd:note>%s</mmd:note>\n"
" </mmd:update>\n" % (datetime.datetime.utcnow().replace(
tzinfo=pytz.utc).strftime("%Y-%m-%dT%H:%M:%SZ"), type, note))


def check_csw_catalog(ds_id, nc_file, urls, env, emsg=""):
"""Search for the dataset with id 'ds_id' in the CSW metadata
catalog.
"""
payload = {
"service": "CSW",
"version": "2.0.2",
"request": "GetRepositoryItem",
"id": ds_id}

payload_str = urllib.parse.urlencode(payload, safe=":")

ds_found_and_accessible = False
res = requests.get(url=f"https://{urls[env]['csw']}/csw",
params=payload_str)
# TODO: check the data_access urls
if res.status_code == 200 and "ExceptionText" not in res.text:
ds_found_and_accessible = True
else:
emsg += f"Could not find dataset ({ds_id}) in CSW catalog: {nc_file}, {res.text}"

return ds_found_and_accessible, emsg


def get_local_mmd_git_path(nc_file, mmd_repository_path):
"""Return the path to the original MMD file.
"""
ds = netCDF4.Dataset(nc_file)
lvlA = "arch_%s" % uuid.UUID(ds.id).hex[7]
lvlB = "arch_%s" % uuid.UUID(ds.id).hex[6]
lvlC = "arch_%s" % uuid.UUID(ds.id).hex[5]
mmd_filename = ds.id + ".xml"
ds.close()
return os.path.join(mmd_repository_path, lvlA, lvlB, lvlC, mmd_filename)


def mmd_change_file_location(mmd, new_file_location, copy=True):
"""Copy original MMD file, and change the file_location field.
Return the filename of the updated MMD file, and a status flag
indicating if it has been changed or not.
"""
if not os.path.isfile(mmd):
raise ValueError(f"File does not exist: {mmd}")
if copy:
tmp_path = tempfile.gettempdir()
shutil.copy2(mmd, tmp_path)
# Edit copied MMD file
mmd = os.path.join(tmp_path, os.path.basename(mmd))
lines = mmd_readlines(mmd)
# Open the MMD file and add updates
status = False
with open(mmd, "w") as f:
for line in lines:
if "</mmd:last_metadata_update>" in line:
add_metadata_update_info(f, "Change storage information.")
if "<mmd:file_location>" in line:
f.write(f" <mmd:file_location>{new_file_location}</mmd:file_location>\n")
status = True
else:
f.write(line)
return mmd, status


def mmd_readlines(filename):
""" Read lines in MMD file.
"""
if not os.path.exists(filename):
raise ValueError("File does not exist: %s" % filename)
with open(filename, "r") as f:
lines = f.readlines()
return lines


def move_data_file(nc_file, nfl, emsg=""):
"""Move data file from nc_file to nfl.
"""
nc_moved = False
try:
shutil.move(nc_file, nfl)
except Exception as e:
nc_moved = False
emsg = f"Could not move file from {nc_file} to {nfl}.\nError message: {str(e)}\n"
else:
nc_moved = True
return nc_moved, emsg


def move_data(mmd_repository_path, old_file_location_base, new_file_location_base,
ext_pattern=None, dry_run=True, env="prod"):
"""Update MMD and move data file.
"""
if env not in ["dev", "staging", "prod"]:
raise ValueError("Invalid env input")
if env not in mmd_repository_path:
raise ValueError("Invalid mmd_repository path")

urls = {
"prod": {
"dmci": "dmci.s-enda.k8s.met.no",
"csw": "data.csw.met.no",
"id_namespace": "no.met",
},
"staging": {
"dmci": "dmci.s-enda-staging.k8s.met.no",
"csw": "https://csw.s-enda-staging.k8s.met.no/",
"id_namespace": "no.met.staging",
},
"dev": {
"dmci": "dmci.s-enda-dev.k8s.met.no",
"csw": "https://csw.s-enda-dev.k8s.met.no/",
"id_namespace": "no.met.dev",
}
}

if os.path.isfile(old_file_location_base):
existing = [old_file_location_base]
else:
existing = [str(nc_file) for match, nc_file in
datetime_glob.walk(pattern=os.path.join(old_file_location_base, ext_pattern))]

copy_mmd = True
if dry_run:
# Not copying the MMD file will make it easy to check changes
# with git diff
copy_mmd = False

updated = []
not_updated = {}
for nc_file in existing:
# Error message
emsg = ""
nfl = new_file_location(nc_file, new_file_location_base, old_file_location_base, dry_run)
mmd_orig = get_local_mmd_git_path(nc_file, mmd_repository_path)

# Check permissions before doing anything
remove_file_allowed = os.access(nc_file, os.W_OK)
write_file_allowed = os.access(nfl, os.W_OK)
if not remove_file_allowed or not write_file_allowed:
if not remove_file_allowed and not write_file_allowed:
raise PermissionError(f"Missing permissions to delete {nc_file} "
f"and to write {nfl}")
if not remove_file_allowed:
raise PermissionError(f"Missing permission to delete {nc_file}")
if not write_file_allowed:
raise PermissionError(f"Missing permission to write {nfl}")

mmd_new, mmd_updated = mmd_change_file_location(mmd_orig, nfl, copy=copy_mmd)
if not mmd_updated:
raise Exception(f"Could not update MMD file for {nc_file}")

# Get MMD content as binary data
with open(mmd_new, "rb") as fn:
data = fn.read()
res = requests.post(url=f"https://{urls[env]['dmci']}/v1/validate", data=data)

# Update with dmci update
dmci_updated = False
if res.status_code == 200 and "OK" in res.text and not dry_run:
ds = netCDF4.Dataset(nc_file)
ds_id = f"{ds.naming_authority}:{ds.id}".strip()
# Delete dataset
del_res = requests.post(url=f"https://{urls[env]['dmci']}/v1/delete/{ds_id}")
if del_res.status_code != 200 or "OK" not in del_res.text:
raise Exception(f"Not able to delete dataset ({ds_id}): {del_res.text}")
# Reingest dataset
res = requests.post(url=f"https://{urls[env]['dmci']}/v1/insert", data=data)
"""NOTE: because of a bug in pycsw, the below updated is
replaced by delete and insert above
"""
# Update dataset
# res = requests.post(url=f"https://{urls[env]['dmci']}/v1/update", data=data)
if res.status_code == 200 and "OK" in res.text:
# This should be the case for a dry-run and a valid xml
dmci_updated = True
else:
raise Exception("Could not push updated MMD file to the "
f"DMCI API: {mmd_new}, {res.text}")

if dmci_updated and not dry_run:
nc_moved, emsg = move_data_file(nc_file, nfl)
if not nc_moved:
raise Exception(f"Could not move {nc_file} to {nfl}.")
elif dmci_updated and dry_run:
nc_moved = True

ds_id = f"{urls[env]['id_namespace']}:{os.path.basename(mmd_orig).split('.')[0]}"
if not dry_run:
ds_found_and_accessible, emsg = check_csw_catalog(ds_id, nc_file, urls, env, emsg=emsg)
else:
ds_found_and_accessible = True

if not ds_found_and_accessible:
logging.warning(emsg)

if all([mmd_updated, dmci_updated, nc_moved, ds_found_and_accessible]):
updated.append(mmd_orig)
logging.info(f"Updated {mmd_orig}.")
else:
not_updated[mmd_orig] = emsg

return not_updated, updated


def new_file_location(nc_file, new_base_loc, existing_base_loc, dry_run):
"""Return the name of the new folder where the netcdf file will be
stored. Subfolders of new_base_loc will be created.
"""
if not os.path.isdir(new_base_loc):
raise ValueError(f"Folder does not exist: {new_base_loc}")
file_path = nc_file.replace(existing_base_loc, new_base_loc)
new_folder = os.path.dirname(os.path.abspath(file_path))
if not dry_run:
try:
os.makedirs(new_folder)
except FileExistsError:
# Do nothing
pass
return new_folder
2 changes: 1 addition & 1 deletion py_mmd_tools/mmd_to_nc.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ def process_element(self, xml_element, translations):
raise ValueError('Multiple ACDD or ACCD extension fields provided.'
' Please use another translation function.')
# Update the dictionary containing the ACDD elements
self.update_acdd({acdd_name[0]: xml_element.text}, {acdd_name[0]: sep})
self.update_acdd({acdd_name[0]: xml_element.text}, {acdd_name[0]: sep[0]})

def update_acdd(self, new_dict, sep=None):
"""
Expand Down
77 changes: 77 additions & 0 deletions py_mmd_tools/script/move_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
#!/usr/bin/env python3
"""
Script to move one or more datasets from one location to another, and
update its MMD xml file accordingly.

License:

This file is part of the py-mmd-tools repository
<https://github.com/metno/py-mmd-tools>.

py-mmd-tools is licensed under the Apache License 2.0
<https://github.com/metno/py-mmd-tools/blob/master/LICENSE>
"""
import os
import logging
import argparse

from py_mmd_tools.mmd_operations import move_data


def create_parser():
"""Create parser object"""
parser = argparse.ArgumentParser(description="Move one or more datasets from one location to "
"another, and update its MMD xml file "
"accordingly.")
parser.add_argument(
"mmd_repository_path", type=str,
help="Local folder containing all MMD files.")
parser.add_argument(
"old_file_location_base", type=str,
help="Base folder from which the data file(s) will be moved, or exact path to a file.")
parser.add_argument(
"new_file_location_base", type=str,
help="Base or exact path to the folder to which the data file(s) will be moved.")
parser.add_argument(
"--ext-pattern", type=str, default=None,
help="Pathname pattern extending old_file_location_base, i.e., extending the "
"existing file *base* location(s) with, e.g, the year and month as a "
"glob pattern intertwined with date/time format akin to "
"strptime/strftime format (e.g., '%Y/%m').")
parser.add_argument(
"--dmci-update", action="store_true",
help="Directly update the online catalog with the changed MMD files."
)
parser.add_argument(
"--log-file", type=str, default="move_data.log",
help="Log filename")

return parser


def main(args=None):
"""Move dataset(s) and update MMD.
"""
if not os.path.isdir(args.mmd_repository_path):
raise ValueError(f"Invalid input: {args.mmd_repository_path}")

if not os.path.isdir(args.new_file_location_base):
raise ValueError(f"Invalid input: {args.new_file_location_base}")

logging.basicConfig(filename=args.log_file, level=logging.INFO)

not_updated, updated = move_data(args.mmd_repository_path,
args.old_file_location_base,
args.new_file_location_base,
args.ext_pattern,
dry_run=not args.dmci_update)

return updated, not_updated


def _main(): # pragma: no cover
main(create_parser().parse_args()) # entry point in setup.cfg


if __name__ == '__main__': # pragma: no cover
main(create_parser().parse_args())
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ dependencies = [
"shapely",
"wget",
"xmltodict",
"datetime_glob",
]
name = "py-mmd-tools"
description = "This is a tools for generating MMD files from netCDF-CF files with ACDD attributes, for documenting netCDF-CF files from MMD information."
Expand All @@ -49,6 +50,7 @@ nc2mmd = "py_mmd_tools.script.nc2mmd:_main"
check_nc = "py_mmd_tools.script.check_nc:_main"
yaml2adoc = "py_mmd_tools.script.yaml2adoc:_main"
ncheader2json = "py_mmd_tools.script.ncheader2json:_main"
move_data = "py_mmd_tools.script.move_data:_main"

[project.urls]
source = "https://github.com/metno/py-mmd-tools"
Expand Down
4 changes: 4 additions & 0 deletions pytest.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[pytest]
markers =
py_mmd_tools: Core functionality tests
online: Tests requiring web access
Binary file added tests/data/2024/09/01/reference_nc.nc
Binary file not shown.
Loading
Loading