Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade script #452

Merged
merged 22 commits into from
Apr 14, 2020
Merged

Upgrade script #452

merged 22 commits into from
Apr 14, 2020

Conversation

achilleas-k
Copy link
Member

This PR is mainly for the upgrade script but also includes some fixes and cleanup.

The script first checks the file header version to see if it's older than the current. If it is, it continues to collect tasks that need to be performed to complete the upgrade.

Tasks are created using functions that check if a change is required and if so, they return a closure bound to the specific filename. The returned function is then run after the user confirms they want to apply the changes. If --force is specified, the changes are applied without prompting.

All checks and changes are made in standalone functions that open the file using h5py and close it immediately after. This makes each change atomic (or atomic-ish) and reduces the chances that a file will be left in an invalid state. There's still a chance the user will stop the
procedure after a change has been applied and before the file version in the header is updated. This will put the file in a sort of invalid state but rerunning the upgrade script should finalise the changes.

Regardless of the possibilities of corruption or invalid states, the user is warned to make sure their files and data are backed up.

If you want to test the script, you can use the files in this GIN repository: https://gin.g-node.org/achilleas/nix-files. This includes two files from the G-Node/nix-examples repository (which come from Jan) and two more I made, one using nixpy v1.4 and the other using v1.5 (current master).

Header properties shouldn't have setters.  The format and version
attributes should only ever be added on creation.  A property setter
should never be exposed, since it communicates that the user can edit
the attributes freely, which they shouldn't since it would break the
file header check.

The version attribute is not overwritten if it exists when the method is
called.  This avoids resetting the version by calling the method on an
old file with a new library, which can be done since "private" methods
aren't really private.  The same check is not made for format since it
never changes.

Removed timestamp updating from the setter methods since it's called in
the constructor, which is only run on new file creation.
Instead of setting a flag, duplicate a couple of lines to make the
differences between new and existing files clearer.
This file format version change reflects the following format changes:
- Property values: no longer compound data types
- DataFrameDimension: new dimension type for DataArray
- File ID: new header attribute
Only if they're newer than 1.2.0, the version that introduced IDs
Script to upgrade NIX files to newest file format version.
For now just a stub.
This is a reference to the underlying H5Group for the metadata root
container.  It shouldn't be accessible as 'File.metadata'.  The proper
top level container for metadata is 'File.sections'.
This is a WIP and still needs to take care of old style Property
objects.

The script first checks the file header version to see if it's older
than the current. If it is, it continues to collect tasks that need to
be performed to complete the upgrade.

Tasks are created using functions that check if a change is required and
if so, they return a closure bound to the specific filename. The
returned function is then run after the user confirms they want to apply
the changes. If --force is specified, the changes are applied without
prompting.

All checks and changes are made in standalone functions that open the
file using h5py and close it immediately after. This makes each change
atomic (or atomic-ish) and reduces the chances that a file will be left
in an invalid state. There's still a chance the user will stop the
procedure after a change has been applied and before the file version in
the header is updated. This will put the file in a sort of invalid state
but rerunning the upgrade script should finalise the changes.

Regardless of the possibilities of corruption or invalid states, the
user is warned to make sure their files and data are backed up.
Extra attributes, if set, become new properties to avoid data loss.
Uncertainties are added to the new uncertainty attribute of the
property, but only if there is one unique uncertainty for all values in
the old property.  If it is not unique and each value has its own
uncertainty, they are also added as an extra property.
@achilleas-k achilleas-k force-pushed the upgrade-script branch 2 times, most recently from 8e9f00d to ecc5121 Compare March 26, 2020 22:17
@achilleas-k
Copy link
Member Author

Forgot to mention: This PR includes #451.

jgrewe
jgrewe previously approved these changes Mar 27, 2020
Copy link
Member

@jgrewe jgrewe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lvgtm! Worked hard to find something to nitpick upon :)

def update_property_values(fname):
"""
Returns a closure that binds the filename. When the return value is
called, it modifies rewrites all the metadata Property objects to the new
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing "and" like "modifies and rewrites"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OH NO!!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

welcome

@achilleas-k
Copy link
Member Author

I had a thought about a potential edge case that might cause issues. If you supply the same file twice, the task collection procedure will put the same file on the list twice. This wont be a problem with the file ID and version update, but the property reformatting might cause issues. Most likely the script will crash when attempting to get the "uncertainty" component of the property's value, so it shouldn't cause any corruption.

This will happen regardless of whether the file is specified twice on the command line, or if it's through a symlink or hardlink.

@achilleas-k
Copy link
Member Author

Tested now and as I suspected it crashes when trying to get the first field of the compound data type in the property. I can imagine this might be annoying if someone is running an upgrade of all their files and they have a symlink to one somewhere. I guess the chances are low in general, but maybe we should guard against it.

This isn't an issue if the same file is specified twice on the command line, since the tasks are indexed by the filename and that will get deduplicated.

If you supply the same file twice (perhaps through a symlink), the task
collection procedure will put the same file on the list twice. This wont
be a problem with the file ID and version update, but the property
reformatting might cause issues.The script will crash when attempting to
get the "uncertainty" component of the property's value, so it shouldn't
cause any corruption.

Regardless, checking the property format again right before performing
the change will prevent the issue.
Task preparation functions are responsible for returning the closure or
None if it's not needed.  The task is not added if it's not needed.

Even if it is added, the task checks if it needs to make the chance
right before making it (guards against multiple runs on the same file
after task collection).
@achilleas-k
Copy link
Member Author

With the latest changes, the script is a bit more careful about what it adds to the tasks and what it performs. It will only add a task if it's needed (previously, the Property update task was added even if the number of Property objects that needed updating was 0) and before performing any change, it rechecks if it's needed.

return update_props


def create_property(hfile, name, dtype, data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid the unit got lost during upgrading

Copy link
Member Author

@achilleas-k achilleas-k Mar 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh dear. Good catch!
Same for definition.

@achilleas-k
Copy link
Member Author

I'll add an extra step at the end to offer to run h5repack on the upgraded files to reclaim the space from the deleted objects. This will create a copy of each file, so the user should be warned of space usage (and potential process time).

@jgrewe jgrewe merged commit 1d83d88 into G-Node:master Apr 14, 2020
@jgrewe jgrewe mentioned this pull request Apr 14, 2020
@achilleas-k achilleas-k deleted the upgrade-script branch April 14, 2020 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants