This is a simple compression program, which utilizes Huffman encoding to compress text, so that it can be stored efficiently.
The compressed data is stored in the form of blocks in a custom file format.
$ ./tc compress <input_file> <output_file>
$ ./tc decompress <input_file> <output_file>
All values are stored in Little endian format
A compressed file contains the file header, which is 14
bytes in size, followed by a list of blocks, which contain the compressed data using Huffman encoding.
┌──────────────────────────┬──────────┬──────────┬─────────────┐
│ File header (14 bytes) │ Block 0 │ Block 1 │ ..... │
└──────────────────────────┴──────────┴──────────┴─────────────┘
Name | Offset | Size(in bytes) | Value | Comments |
---|---|---|---|---|
Magic number | 0 | 4 | 0xC 0x0 0xD 0xE |
Identifies the file |
File version | 4 | 2 | Stores the file format version (unsigned integer) | Currently 1 |
Reserved | 6 | 8 | - | Reserved for future use |
Blocks | 14 | - | List of blocks | - |
A block consists of a block header, followed by the compressed data, the block header contains the size of the uncompressed data along with the Huffman tree used to encode the data (Canonical Huffman code).
Note: Offsets are with respect to the start of the block and not with respect to start of the file.
Name | Offset | Size(in bytes) | Value | Comments |
---|---|---|---|---|
Block type | 0 | 1 | Identifies the block type | For future use(i.e. to store uncompressible blocks directly) |
Uncompressed size | 1 | 8 | Size of the uncompressed data(unsigned) | - |
Compressed size | 1 | 8 | Size of the compressed data(unsigned) | - |
Frequency table | 9 | 256 | Number of bits for each symbol | H[i] represents number of bits for byte with value i |
Compressed data | - | Variable length | - | - |
After Uncompressed size
number of bytes have been processed, the next block begins.
For version 1, block_size
is set to 64KB
(64000 bytes)
This program operates on the file in a chunked manner, i.e. the entire file is not read to memory at once, this prevents the application from consuming too much memory when the input file is huge. A chunk is defined as a sequence of block_size
bytes of data, which is written as one block.
The application reads one block of data from the input file and finds the frequence of occurency of each symbol in the file. This frequency table is used to create a Huffman coding table. The coding table is canonicalized so that the transmission of the Huffman tree becomes easier.
The size of the data in the block, along with the frequency table for the current block is written out. Then, the compressed bits of the block are written to the output file. Extra bits are considered padding bits and have to be ignored.
First the magic number is verified to make sure that the file was produced by this program, then the Huffman tree is constructed with the help of the frequency of table. Then the input file is processed as a bit stream, and after Uncompressed size
bytes have been processed, the next block is decompressed.