Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for extracting hard links #102

Merged
merged 1 commit into from
Apr 24, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 25 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,13 @@ is encountered while extracting `tarball` and the entry is only extracted if the
an archive, to skip entries that cause `extract` to throw an error, or to record
what is extracted during the extraction process.

Before it is passed to the predicate function, the `Header` object is somewhat
modified from the raw header in the tarball: the `path` field is normalized to
remove `.` entries and replace multiple consecutive slashes with a single slash.
If the entry has type `:hardlink`, the link target path is normalized the same
way so that it will match the path of the target entry; the size field is set to
the size of the target path (which must be an already-seen file).

If the `skeleton` keyword is passed then a "skeleton" of the extracted tarball
is written to the file or IO handle given. This skeleton file can be used to
recreate an identical tarball by passing the `skeleton` keyword to the `create`
Expand Down Expand Up @@ -156,6 +163,13 @@ is encountered while extracting `old_tarball` and the entry is skipped unless
an archive, to skip entries that would cause `extract` to throw an error, or to
record what content is encountered during the rewrite process.

Before it is passed to the predicate function, the `Header` object is somewhat
modified from the raw header in the tarball: the `path` field is normalized to
remove `.` entries and replace multiple consecutive slashes with a single slash.
If the entry has type `:hardlink`, the link target path is normalized the same
way so that it will match the path of the target entry; the size field is set to
the size of the target path (which must be an already-seen file).

### Tar.tree_hash

```jl
Expand Down Expand Up @@ -187,6 +201,13 @@ is encountered while processing `tarball` and an entry is only hashed if
archive, to skip entries that cause `extract` to throw an error, or to record
what is extracted during the hashing process.

Before it is passed to the predicate function, the `Header` object is somewhat
modified from the raw header in the tarball: the `path` field is normalized to
remove `.` entries and replace multiple consecutive slashes with a single slash.
If the entry has type `:hardlink`, the link target path is normalized the same
way so that it will match the path of the target entry; the size field is set to
the size of the target path (which must be an already-seen file).

Currently supported values for `algorithm` are `git-sha1` (the default) and
`git-sha256`, which uses the same basic algorithm as `git-sha1` but replaces the
SHA1 hash function with SHA2-256, the hash function that git will transition to
Expand Down Expand Up @@ -362,18 +383,16 @@ supports only the following file types:
* plain files
* directories
* symlinks
* hardlinks (extracted as copies)

The `Tar` package does not support other file types that the TAR format can
represent, including: hard links, character devices, block devices, and FIFOs.
If you attempt to create or extract an archive that contains any of these kinds
of entries, `Tar` will raise an error. You can, however, list the contents of a
represent, including: character devices, block devices, and FIFOs. If you
attempt to create or extract an archive that contains any of these kinds of
entries, `Tar` will raise an error. You can, however, list the contents of a
tarball containing other kinds of entries by passing the `strict=false` flag to
the `list` function; without this option, `list` raises the same error as
`extract` would.

In the future, optional support may be added for using hard links within
archives to avoid duplicating identical files.

### Time Stamps

Also in accordance with its design goal as a data transfer tool, the `Tar`
Expand Down
21 changes: 21 additions & 0 deletions src/Tar.jl
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,13 @@ is encountered while extracting `tarball` and the entry is only extracted if the
an archive, to skip entries that cause `extract` to throw an error, or to record
what is extracted during the extraction process.

Before it is passed to the predicate function, the `Header` object is somewhat
modified from the raw header in the tarball: the `path` field is normalized to
remove `.` entries and replace multiple consecutive slashes with a single slash.
If the entry has type `:hardlink`, the link target path is normalized the same
way so that it will match the path of the target entry; the size field is set to
the size of the target path (which must be an already-seen file).

If the `skeleton` keyword is passed then a "skeleton" of the extracted tarball
is written to the file or IO handle given. This skeleton file can be used to
recreate an identical tarball by passing the `skeleton` keyword to the `create`
Expand Down Expand Up @@ -251,6 +258,13 @@ is encountered while extracting `old_tarball` and the entry is skipped unless
`predicate(hdr)` is true. This can be used to selectively rewrite only parts of
an archive, to skip entries that would cause `extract` to throw an error, or to
record what content is encountered during the rewrite process.

Before it is passed to the predicate function, the `Header` object is somewhat
modified from the raw header in the tarball: the `path` field is normalized to
remove `.` entries and replace multiple consecutive slashes with a single slash.
If the entry has type `:hardlink`, the link target path is normalized the same
way so that it will match the path of the target entry; the size field is set to
the size of the target path (which must be an already-seen file).
"""
function rewrite(
predicate::Function,
Expand Down Expand Up @@ -301,6 +315,13 @@ is encountered while processing `tarball` and an entry is only hashed if
archive, to skip entries that cause `extract` to throw an error, or to record
what is extracted during the hashing process.

Before it is passed to the predicate function, the `Header` object is somewhat
modified from the raw header in the tarball: the `path` field is normalized to
remove `.` entries and replace multiple consecutive slashes with a single slash.
If the entry has type `:hardlink`, the link target path is normalized the same
way so that it will match the path of the target entry; the size field is set to
the size of the target path (which must be an already-seen file).

Currently supported values for `algorithm` are `git-sha1` (the default) and
`git-sha256`, which uses the same basic algorithm as `git-sha1` but replaces the
SHA1 hash function with SHA2-256, the hash function that git will transition to
Expand Down
15 changes: 12 additions & 3 deletions src/create.jl
Original file line number Diff line number Diff line change
Expand Up @@ -54,10 +54,19 @@ function rewrite_tarball(
end
node = node′
end
if !(hdr.type == :directory && get(node, name, nothing) isa Dict)
node[name] = (hdr, position(old_tar))
if hdr.type == :hardlink
node′ = tree
for part in split(hdr.link, '/')
node′ = node′[part]
end
hdr′ = Header(node′[1], path=hdr.path, mode=hdr.mode)
node[name] = (hdr′, node′[2])
else
if !(hdr.type == :directory && get(node, name, nothing) isa Dict)
node[name] = (hdr, position(old_tar))
end
skip_data(old_tar, hdr.size)
end
skip_data(old_tar, hdr.size)
end
write_tarball(new_tar, tree, buf=buf) do node, tar_path
if node isa Dict
Expand Down
87 changes: 65 additions & 22 deletions src/extract.jl
Original file line number Diff line number Diff line change
Expand Up @@ -79,8 +79,16 @@ function extract_tarball(
mkdir(sys_path)
elseif hdr.type == :symlink
copy_symlinks || symlink(hdr.link, sys_path)
elseif hdr.type == :hardlink
src_path = joinpath(root, hdr.link)
cp(src_path, sys_path)
elseif hdr.type == :file
read_data(tar, sys_path, size=hdr.size, buf=buf)
else # should already be caught by check_header
error("unsupported tarball entry type: $(hdr.type)")
end
# apply tarball permissions
if hdr.type in (:file, :hardlink)
exec = 0o100 & hdr.mode != 0
tar_mode = exec ? 0o755 : 0o644
sys_mode = filemode(sys_path)
Expand All @@ -93,21 +101,19 @@ function extract_tarball(
# we don't have a way to do that afaik
end
chmod(sys_path, tar_mode & sys_mode)
else # should already be caught by check_header
error("unsupported tarball entry type: $(hdr.type)")
end
end
copy_symlinks || return

# resolve the internal targets of symlinks
for (path, what) in paths
what isa AbstractString || continue
what isa String || continue
target = link_target(paths, path, what)
paths[path] = something(target, :symlink)
end

# follow chains of symlinks
follow(seen::Vector, what::Symbol) =
follow(seen::Vector, what::Any) =
what == :symlink ? what : seen[end]
follow(seen::Vector, what::String) =
what in seen ? :symlink : follow(push!(seen, what), paths[what])
Expand Down Expand Up @@ -159,7 +165,7 @@ end

# resolve symlink target or nothing if not valid
function link_target(
paths::Dict{String,Union{String,Symbol}},
paths::Dict{String},
path::AbstractString,
link::AbstractString,
)
Expand Down Expand Up @@ -220,12 +226,18 @@ function git_tree_hash(
node[name] = Dict{String,Any}()
end
return
end
if hdr.type == :symlink
elseif hdr.type == :symlink
mode = "120000"
hash = git_object_hash("blob", HashType) do io
write(io, hdr.link)
end
elseif hdr.type == :hardlink
mode = iszero(hdr.mode & 0o100) ? "100644" : "100755"
node′ = tree
for part in split(hdr.link, '/')
node′ = node′[part]
end
hash = node′[2] # hash of linked file
elseif hdr.type == :file
mode = iszero(hdr.mode & 0o100) ? "100644" : "100755"
hash = git_file_hash(tar, hdr.size, HashType, buf=buf)
Expand Down Expand Up @@ -332,31 +344,62 @@ function read_tarball(
)
write_skeleton_header(skeleton, buf=buf)
# symbols for path types except symlinks store the link
paths = Dict{String,Union{Symbol,String}}()
paths = Dict{String,Any}()
globals = Dict{String,String}()
while !eof(tar)
hdr = read_header(tar, globals=globals, buf=buf, tee=skeleton)
hdr === nothing && break
# check if we should extract or skip
if !predicate(hdr)
skip_data(tar, hdr.size)
continue
end
check_header(hdr)
err = nothing
# normalize path and check for symlink attacks
path = ""
for part in split(hdr.path, '/')
# check_header checks for ".." later
(isempty(part) || part == ".") && continue
# check_header doesn't allow ".." in path
get(paths, path, nothing) isa String && error("""
Refusing to extract path with symlink prefix, possible attack
* path to extract: $(repr(hdr.path))
* symlink prefix: $(repr(path))
""")
isempty(path) || (paths[path] = :directory)
if err === nothing && get(paths, path, nothing) isa String
err = """
Tarball contains path with symlink prefix:
- path = $(repr(hdr.path))
- prefix = $(repr(path))
Refusing to extract — possible attack!
"""
end
path = isempty(path) ? part : "$path/$part"
end
paths[path] = hdr.type == :symlink ? hdr.link : hdr.type
hdr′ = Header(hdr, path=path)
# check that hardlinks refer to already-seen files
if err === nothing && hdr.type == :hardlink
parts = filter!(split(hdr.link, '/')) do part
# check_header checks for ".." later
!isempty(part) && part != "."
end
link = join(parts, '/')
hdr = Header(hdr, link=link)
hdr′ = Header(hdr′, link=link)
what = get(paths, link, Symbol("non-existent"))
if what isa Integer # plain file
hdr′ = Header(hdr′, size=what)
else
err = """
Tarball contains hardlink with $what target:
- path = $(repr(hdr.path))
- target = $(repr(hdr.link))
Refusing to extract — possible attack!
"""
end
end
# check if we should extract or skip
if !predicate(hdr′) # pass normalized header
skip_data(tar, hdr.size)
continue
end
check_header(hdr)
err === nothing || error(err)
# record info about path
paths[path] =
hdr.type == :symlink ? hdr.link :
hdr.type == :file ? hdr.size :
hdr.type
# apply callback, checking that it consumes IO correctly
before = applicable(position, tar) ? position(tar) : 0
callback(hdr, split(path, '/', keepempty=false))
applicable(position, tar) || continue
Expand Down
12 changes: 9 additions & 3 deletions src/header.jl
Original file line number Diff line number Diff line change
Expand Up @@ -99,12 +99,18 @@ function check_header(hdr::Header)
err("path is absolute")
occursin(r"(^|/)\.\.(/|$)", hdr.path) &&
err("path contains '..' component")
hdr.type in (:file, :symlink, :directory) ||
hdr.type in (:file, :hardlink, :symlink, :directory) ||
err("unsupported entry type")
hdr.type ∉ (:hardlink, :symlink) && !isempty(hdr.link) &&
err("non-link with link path")
hdr.type == :symlink && hdr.size != 0 &&
err("symlink with non-zero size")
hdr.type ∈ (:hardlink, :symlink) && isempty(hdr.link) &&
err("$(hdr.type) with empty link path")
hdr.type ∈ (:hardlink, :symlink) && hdr.size != 0 &&
err("$(hdr.type) with non-zero size")
hdr.type == :hardlink && hdr.link[1] == '/' &&
err("hardlink with absolute link path")
hdr.type == :hardlink && occursin(r"(^|/)\.\.(/|$)", hdr.link) &&
err("hardlink contains '..' component")
hdr.type == :directory && hdr.size != 0 &&
err("directory with non-zero size")
hdr.type != :directory && endswith(hdr.path, "/") &&
Expand Down
12 changes: 11 additions & 1 deletion test/setup.jl
Original file line number Diff line number Diff line change
Expand Up @@ -61,15 +61,25 @@ function make_test_tarball(tar_create::Function = Tar.create)
dir′ = joinpath(dir, "s"^b)
mkpath(dir′)
push!(paths, dir′)
path = paths[i += 1]
link = joinpath(dir, "l"^b)
target = relpath(paths[i += 1], link)
target = relpath(path, link)
symlink(target, link)
push!(paths, link)
broken = joinpath(dir, "b"^b)
if target != "."
symlink(chop(target), broken)
push!(paths, broken)
end
isfile(path) || continue
hard = joinpath(dir, "h"^b)
mode = isodd(i) ? 0o755 : 0o644
if Sys.which("ln") !== nothing
run(`ln $path $hard`)
else
cp(path, hard)
end
chmod(hard, mode)
end
end
end
Expand Down