Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zstash seems to archive "symlink names" but not values. #247

Closed
forsyth2 opened this issue Jan 14, 2023 · 5 comments · Fixed by #261
Closed

zstash seems to archive "symlink names" but not values. #247

forsyth2 opened this issue Jan 14, 2023 · 5 comments · Fixed by #261
Labels
semver: bug Bug fix (will increment patch version)

Comments

@forsyth2
Copy link
Collaborator

From @TonyB9000:

It turns out that with “zstash ls -l”, you can identify symlinks, as they list 0 filesize (and “None” for md5).

Still, if the symlink points to an actual file (where the zstash archive is created), you should (I think) tar up the real file (target), not the link.

@forsyth2 forsyth2 added the semver: bug Bug fix (will increment patch version) label Jan 14, 2023
@forsyth2
Copy link
Collaborator Author

@TonyB9000 zstash was in fact designed this way.

From https://github.com/E3SM-Project/zstash/blob/main/zstash/hpss_utils.py#L187:

    # Only add files or hardlinks.
    # (So don't add directories or softlinks.)

If there are two hard links pointing to the same file, the result will be two separate files; we have no way to prevent that.

Symbolic links get reproduced, but the file they're pointing to may be missing, as you discovered. We could add a command line option to include the real file, if that would be useful.

@TonyB9000
Copy link
Collaborator

TonyB9000 commented Feb 28, 2023

I think I would make this the default behavior - and issue an error (or at least, a warning) if a symlink points to nothing.

Instead of "tar cvf tarfile targetfile", I would employ the bash "realpath" function:

tar cvf tarfile `realpath targetfile`

or similar. There is no use I can think of for tarring-up broken links.

Granted, I don't quite know how to do this "in bulk". You might need to run a separate script:

rm realfiles
for afile in <glob-pattern>; do
    if [ -f `realpath $afile` ]; then
        echo `realpath $afile` >>realfiles
    fi
done

and then

tar cvf tarfile --files-from=realfiles

@TonyB9000
Copy link
Collaborator

TonyB9000 commented Feb 28, 2023

Correction: I can think of a case where you have a directory containing symlinks, and you want to "tar it" (with other stuff) and move it to a new location on the same file system. Then, it would be reasonable to tar only the links - not the actual files.

But this use-case is very unusual. I might make THAT a command-line option (tar links as links, not the files they refer to.)

Do we know why zstash was designed this way (only files and hardlinks)? What was the rationale?

The major problem is that a cursory inspection fo a zstash archive (consulting only the "index.db") gives the impression that files exists, when only the symlinks exist.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Mar 8, 2023

The major problem is that a cursory inspection fo a zstash archive (consulting only the "index.db") gives the impression that files exists, when only the symlinks exist.

Yes, that makes sense.

Granted, I don't quite know how to do this "in bulk".

From a cursory search, it looks like we could run tar --create --dereference https://www.gnu.org/software/tar/manual/html_node/dereference.html: "When ‘--dereference’ (‘-h’) is used with ‘--create’ (‘-c’), tar archives the files symbolic links point to, instead of the links themselves."

It looks like zstash doesn't use the tar command directly though. https://github.com/E3SM-Project/zstash/blob/main/zstash/hpss_utils.py adds individual files to an existing tar. We could call tarinfo.issym (https://docs.python.org/3/library/tarfile.html#tarfile.TarInfo) to see if the file is a symlink, but I'm not seeing an obvious way for us to copy over the file being pointed to.

(Also note that if two symlinks point to an identical file, my understanding is that you would end up with two copies of that file in the tar).

@golaz Let us know if you have any input on this, thanks!

@TonyB9000
Copy link
Collaborator

but I'm not seeing an obvious way for us to copy over the file being pointed to.

The python function "os.path.realpath()" "Return the canonical path of the specified filename, eliminating any symbolic links encountered in the path (if they are supported by the operating system)." (https://docs.python.org/3/library/os.path.html).
It has no effect if "filename" is already a "real" path (not a symlink).

When it comes to cataloging the contents of a zstash archive, using "zstash ls" is much simpler than "zstash ls -l", but only the latter will reveal an "empty" symlink. Avoiding them up-front would be preferable, if possible.

It looks like any file being referenced in the hpss_utils.py function that adds them to a tar archive would need to replace file with os.path.realpath(file).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: bug Bug fix (will increment patch version)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants