Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract vm images #16 #20

Merged
merged 50 commits into from
Jun 1, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
a2a1901
Merge pull request #19 from nexB/update-boilerplate
steven-esser Jan 19, 2021
f46bc48
Default to 64 bits windows on CI
pombredanne Jan 24, 2021
182532f
Use wheels embedded in virtualenv.pyz
pombredanne Jan 25, 2021
cd4e87b
Do not force an upgrade on virtualenv.pyz embeds
pombredanne Jan 25, 2021
51510cb
Fix .gitattributes
steven-esser Feb 11, 2021
386bb90
Merge pull request #21 from nexB/fix-gitattributes
steven-esser Feb 12, 2021
371c11e
Use trace not debug for tracing
pombredanne Apr 6, 2021
056b6c1
Add new CLI option to support extracting all formats
pombredanne Apr 6, 2021
d6f8044
Add support for VM image extraction #16
pombredanne Apr 6, 2021
4b519f8
Remove empty READMEs from skeleton
pombredanne Apr 19, 2021
1fbf4e3
Complete support for VM image extraction #16
pombredanne Apr 22, 2021
aa5da29
Add CLI scripts copied from scancode-toolkit
pombredanne Apr 22, 2021
89d01bf
Use correct paths in script
pombredanne Apr 23, 2021
d0ba2c2
Expand supported VM images types
pombredanne Apr 23, 2021
1c84e9e
Merge pull request #24 from nexB/use-venv-embeds
pombredanne May 7, 2021
d6fe59f
Update markers syntax for pytest
pombredanne May 7, 2021
ca6ab21
Add fallback version for setuptools_scm
pombredanne May 7, 2021
1364bbb
Add note for setuptools_scam fallback version
pombredanne May 11, 2021
be851b0
Use azure-posix.yml for linux and macOS
pombredanne May 11, 2021
4f0aecf
Adopt new configure script derived from ScanCode
pombredanne May 11, 2021
aa04429
Add notes on customization
pombredanne May 11, 2021
56ada8f
Adopt new configure --dev convention
pombredanne May 11, 2021
0dbcdc9
Clarify CHANGELOG to be Rst
pombredanne May 11, 2021
d21aef3
Add skeleton release notes to README.rst
pombredanne May 11, 2021
ab707b5
Improve README
pombredanne May 11, 2021
8163e2a
Implement new envt. variables approach
pombredanne May 11, 2021
a75737d
Merge pull request #25 from nexB/new-configure
pombredanne May 12, 2021
acb85c0
Streamline default kinds code
pombredanne May 28, 2021
15d43d5
Format code
pombredanne May 28, 2021
a63c849
Work towards symlinks support
pombredanne May 28, 2021
a073286
Format code and streamline license headers
pombredanne May 30, 2021
ac9c505
Merge latest skeleton
pombredanne May 30, 2021
6c00362
Use new coomadn invocation for 7zip
pombredanne May 30, 2021
f358194
Improve kernel settings doc for VMs
pombredanne May 30, 2021
93d03e0
Merge branch 'main'
pombredanne May 31, 2021
3aeb2ec
Update format
pombredanne May 31, 2021
2c412e8
Add Python 3.9 to Travis
pombredanne May 31, 2021
69eec23
Format and remove spurious spaces
pombredanne May 31, 2021
1074c50
Install and configure libguesfs in CI
pombredanne May 31, 2021
08aa847
Update doc for release
pombredanne May 31, 2021
28528df
Skip vm image tests on Travis
pombredanne May 31, 2021
bfde189
Merge latest skeleton
pombredanne May 31, 2021
0e09ad9
Bump to more modern version of setuptools_scm
pombredanne May 31, 2021
e339a70
Add space for correct syntax
pombredanne May 31, 2021
06bcb03
Merge remote-tracking branch 'skeleton/main' into 16-vm-images
pombredanne May 31, 2021
a0a1436
Sort imports
pombredanne May 31, 2021
9d9d1a1
Improve failure reporting
pombredanne May 31, 2021
6e765bf
Add test documentation
pombredanne Jun 1, 2021
bbbffbc
Update documentation
pombredanne Jun 1, 2021
4a8ef69
Improve documentation and readability
pombredanne Jun 1, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitattributes
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
# Ignore all Git auto CR/LF line endings conversions
* binary
* -text
pyproject.toml export-subst
5 changes: 3 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,10 @@ python:
- "3.6"
- "3.7"
- "3.8"
- "3.9"

# Scripts to run at install stage
install: ./configure
install: ./configure --dev

# Scripts to run at script stage
script: tmp/bin/pytest
script: tmp/bin/pytest --ignore=tests/test_vmimage.py
4 changes: 3 additions & 1 deletion AUTHORS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,12 @@ The following organizations or individuals have contributed to this repo:

- Abhishek Kumar @Abhishek-Dev09
- AlexB @a-tinsmith
- Konrad Weihmann @priv-kweihmann
- Maximilian Huber @maxhbr
- Michael Rupprecht @michaelrup
- Philippe Ombredanne @pombredanne
- Pierre Tardy @tardyp
- Qingmin Duanmu @qduanmu
- Rakesh Balusa @balusarakesh
- Ravi Jain @JRavi2
- Steven Esser @majurg
- Steven Esser @majurg
48 changes: 31 additions & 17 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
@@ -1,36 +1,50 @@
Release notes
=============
Changelog
=========

vNext
-----
v (next)
--------


Version 21.1.21
---------------
v21.6.1
--------

- Bump dependencies and use latest typecode and binaries. This is to fix
installation problems on multiple OSes.
- Add support for VMDK, QCOW and VDI VM image filesystems extraction
- Add new configuration mechanism to get third-party binary paths:

- Use an environment variable
- Or use a plugin-provided path
- Or use well-known system installation locations
- Or use the system PATH
- Or fail with an informative error message

Version 21.1.21
---------------
- Update to use latest skeleton

- Add new [full] extra requires that install all the dependencies
- Fix bug related to commoncode libraries loading

v2021-2-24
----------

- Fix incorrect documentation link


v2021-1-21
----------

- Fix bug related to CommonCode libraries loading
- Improve the extra requirements
- Set minimum version for dependencies
- Improve documentation
- Reorganize tests files


Version 21.1.15
---------------
v2021-1-15
----------

- Drop support for Python 2
- Use the latest CommonCode and TypeCode libraries
- Add azure-pipelines CI support


Version 20.10
-------------
v20.10
------

- Initial release.
- Initial release as a split from ScanCode toolkit
21 changes: 5 additions & 16 deletions NOTICE
Original file line number Diff line number Diff line change
@@ -1,19 +1,8 @@
#
# Copyright (c) nexB Inc. and others.
# SPDX-License-Identifier: Apache-2.0
#
# Visit https://aboutcode.org and https://github.com/nexB/ for support and download.
# Copyright (c) nexB Inc. and others. All rights reserved.
# ScanCode is a trademark of nexB Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# SPDX-License-Identifier: Apache-2.0
# See http://www.apache.org/licenses/LICENSE-2.0 for the license text.
# See https://github.com/nexB/extractcode for support or download.
# See https://aboutcode.org for more information about nexB OSS projects.
#
244 changes: 231 additions & 13 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,35 +4,253 @@ ExtractCode
- license: Apache-2.0
- copyright: copyright (c) nexB. Inc. and others
- homepage_url: https://github.com/nexB/extractcode
- keywords: archive, extraction, libarchive, 7zip, scancode-toolkit
- keywords: archive, extraction, libarchive, 7zip, scancode-toolkit, extractcode

Supports Windows, Linux and macOS on 64 bits processors and Python 3.6 to 3.9.

ExtractCode is a universal archive extractor. It uses behind the scenes
the Python standard library, a custom ctypes binding to libarchive and
the 7zip command line to extract a large number of common and
less common archives and compressed files. It tries to extract things
in the same way on all OSes, including auto-renaming files that would
not have valid names on certain filesystems or when there are multiple
copies of the same path in a given archive.
The extraction is driven from a "voting" system that considers the
file extension(s) and name, the file type and mime type (using a ctypes
binding to libmagic) to select the most appropriate extractor or
uncompressor function. It can handle multi-level archives such as tar.gz.

**ExtractCode is a (mostly) universal archive extractor.**

Install with::

pip install extractcode[full]


Why another extractor?
----------------------

**it will extract!**

ExtractCode will extract things where other extractors may fail.

- Say you want to extract the tarball of the Linux kernel source code on Windows.
It contains paths that are the same when ignoring the case and therefore will
not extract OK on Windows: some file may be munged or the extract may file.

- Or a tarball (on any OS) may contain multiple times the exact same path. In
these cases the paths showing up earlier in the archive may be "hidden" and
overwritten by the same path showing up later in the archive giving the
impression that there is only one file.

- Or an archive may be damaged a little but most files can still be extracted.

- Or the extracted files are such permissions that you cannot read them and are
not owned by you.

- Or the archive may contain weird paths inluding relative paths that may be
problematic to extract.

- Or the archive may contain special file types (character/device files) that
may be problematic to extract.

- Or an archive may be a virtual disk or some file system(s) images that would
typically need to be mounted to be accessed, and may require root access
and guesswork to find out which partition and filesystem are at play and
which driver to use.

In all these cases, ExtractCode will extract and try hard do the right thing to
obtain the actual archived content when other tools may fail.

It can also extract recursively any type of (nested) archives-in-archives

As a downside, the extracted content may not be exactly what would be expected
to use the contained files: for instance ... but this it is perfectly OK for
file content analysis for software composition or forensic analysis.

Behind the scene, ExtractCode uses multiple tools such as:

- the Python standard library,
- a custom ctypes binding to libarchive,
- the 7zip command line tool, and
- optionally libguestfs on Linux.

With these, it is possible to extract a large number of common and less common
archives and compressed file types. ExtractCode tries to extract things in the
same way on all supported OSes, including auto-renaming files that would have
invalid, non-extractible names on certain filesystems or when there are multiple
copies of the same path in a given archive (which is possible in a tar).

The extraction is driven from a "voting" system that considers the file
extension(s) and name, the filetype and mimetype (using a ctypes binding to
libmagic) to select the most appropriate extractor or decompressor function.
It can handle multi-level archives such as tar.gz and can extract recursively
any nested archives.

Visit https://aboutcode.org and https://github.com/nexB/ for support and download.


We run CI tests on:

- Azure pipelines https://dev.azure.com/nexB/extractcode/_build


Installation
------------

To install this package with its full capability (where the binaries for
7zip and libarchive are installed), use the `full` extra option::

pip install extractcode[full]

If you want to use the version of binaries (possibly) provided by your operating
system, use the `minimal` option::

pip install extractcode

In this case, you will need to provide a working and compatible libarchive and
7zip installed and configured in one of these ways such that ExtractCode can
find them:

- **a typecode-libarchive and typecode-7z plugin**: See the standard ones at
https://github.com/nexB/scancode-plugins/tree/main/builtins
These can either bundle a libarchive library, a 7z executable or expose a
system-installed libraries.
It does so by providing plugin entry points as ``scancode_location_provider``
for ``extractcode_libarchive`` that should point to a ``LocationProviderPlugin``
subclass with a ``get_locations()`` method that must return a mapping with
this key:

- 'extractcode.libarchive.dll': the absolute path to a **libarchive** shared object/DLL

See for example:

- https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_libarchive-linux/setup.py#L40
- https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_libarchive-linux/src/extractcode_libarchive/__init__.py#L17

And in the same way, the ``scancode_location_provider`` for ``extractcode_7zip``
should point to a ``LocationProviderPlugin`` subclass with a ``get_locations()``
method that must return a mapping with this key:

- 'extractcode.sevenzip.exe': the absolute path to a **7zip** executable

See for example:

- https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_7z-linux/setup.py#L40
- https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_7z-linux/src/extractcode_7z/__init__.py#L18

- use **environment variables** to point to installed binaries:

- EXTRACTCODE_LIBARCHIVE_PATH: the absolute path to a libarchive DLL
- EXTRACTCODE_7Z_PATH: the absolute path to a 7zip executable


- **a system-installed libarchive and 7zip executable** available in the system **PATH**.


The supported binary tools versions are:

- libarchive 3.5.x
- 7zip 16.5.x


Development
-----------

To set up the development environment::

source configure
source configure --dev


To run unit tests::

pytest -vvs -n 2


To clean up development environment::

./configure --clean


To run the command line tool in the activated environment::

./extractcode -h


Configuration with environment variables
----------------------------------------

ExtractCode will use these environment variables if set:

- EXTRACTCODE_LIBARCHIVE_PATH : the path to the ``libarchive.so`` libarchive
shared library used to support some of the archive formats. If not provided,
ExtractCode will look for a plugin-provided libarchive library path. See
https://github.com/nexB/scancode-plugins/tree/main/builtins for such plugins.
If no plugin contributes libarchive, then a final attempt is made to look for
it in the PATH using standard DLL loading techniques.

- EXTRACTCODE_7Z_PATH : the path to the ``7z`` 7zip executable used to support
some of the archive formats. If not provided, ExtractCode will look for a
plugin-provided 7z executable path. See
https://github.com/nexB/scancode-plugins/tree/main/builtins for such plugins.
If no plugin contributes 7z, then a final attempt is made to look for
it in the PATH.

- EXTRACTCODE_GUESTFISH_PATH : the path to the ``guestfish`` tool from
libguestfs to use to extract VM images. If not provided, ExtractCode will look
in the PATH for an installed ``guestfish`` executable instead.



Adding support for VM images extraction
---------------------------------------

Adding support for VM images requires the manual installation of the
libguestfs-tools system package. This is suported only on Linux.
On Debian and Ubuntu you can use this command::

sudo apt-get install libguestfs-tools


On Ubuntu only, an additional manual step is required as the kernel executable
file cannot be read by users as required by libguestfish.

Run this command as a temporary and immediate fix::

sudo chmod 0644 /boot/vmlinuz-*
for k in /boot/vmlinuz-*
do sudo dpkg-statoverride --add --update root root 0644 /boot/vmlinuz-$k
done

You likely want both this temporary fix and a more permanent fix; otherwise each
kernel update will revert to the default permissions and ExtractCode will stop
working for VM images extraction.

Therefore follow these instructions:

1. As sudo, create the file /etc/kernel/postinst.d/statoverride with this
content, devised by Kees Cook (@kees) in
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/759725/comments/3 ::

#!/bin/sh
version="$1"
# passing the kernel version is required
[ -z "${version}" ] && exit 0
dpkg-statoverride --update --add root root 0644 /boot/vmlinuz-${version}

2. Set executable permissions::

sudo chmod +x /etc/kernel/postinst.d/statoverride

See also these links for a complete discussion:

- https://bugs.launchpad.net/ubuntu/+source/linux/+bug/759725
- https://bugzilla.redhat.com/show_bug.cgi?id=1670790
- https://bugs.launchpad.net/ubuntu/+source/libguestfs/+bug/1813662/comments/24


Alternative
-----------

These other tools are related and were considered before creating ExtractCode:

These tools provide built-in, original extraction capabilities:

- https://libarchive.org/ (integrated in ExtractCode) (BSD license)
- https://www.7-zip.org/ (integrated in ExtractCode) (LGPL license)
- https://theunarchiver.com/command-line (maintenance status unknown) (LGPL license)

These tools are command line tools wrapping other extraction tools and are
similar to ExtractCode but with different goals:

- https://github.com/wummel/patool (wrapper on many CLI tools) (GPL license)
- https://github.com/dtrx-py/dtrx (wrapper on a few CLI tools) (recently revived) (GPL license)
Loading