text-compression

This is a simple compression program, which utilizes Huffman encoding to compress text, so that it can be stored efficiently.

The compressed data is stored in the form of blocks in a custom file format.

Usage

$ ./tc compress <input_file> <output_file>
$ ./tc decompress <input_file> <output_file>

File format specification and design

All values are stored in Little endian format

A compressed file contains the file header, which is 14 bytes in size, followed by a list of blocks, which contain the compressed data using Huffman encoding.

┌──────────────────────────┬──────────┬──────────┬─────────────┐
│  File header (14 bytes)  │ Block 0  │  Block 1 │ .....       │
└──────────────────────────┴──────────┴──────────┴─────────────┘

File header format

Name	Offset	Size(in bytes)	Value	Comments
Magic number	0	4	`0xC 0x0 0xD 0xE`	Identifies the file
File version	4	2	Stores the file format version (unsigned integer)	Currently `1`
Reserved	6	8	-	Reserved for future use
Blocks	14	-	List of blocks	-

Block format

A block consists of a block header, followed by the compressed data, the block header contains the size of the uncompressed data along with the Huffman tree used to encode the data (Canonical Huffman code).

Note: Offsets are with respect to the start of the block and not with respect to start of the file.

Name	Offset	Size(in bytes)	Value	Comments
Block type	0	1	Identifies the block type	For future use(i.e. to store uncompressible blocks directly)
Uncompressed size	1	8	Size of the uncompressed data(unsigned)	-
Compressed size	1	8	Size of the compressed data(unsigned)	-
Frequency table	9	256	Number of bits for each symbol	H[i] represents number of bits for byte with value i
Compressed data	-	Variable length	-	-

After Uncompressed size number of bytes have been processed, the next block begins.

For version 1, block_size is set to 64KB (64000 bytes)

Compression

This program operates on the file in a chunked manner, i.e. the entire file is not read to memory at once, this prevents the application from consuming too much memory when the input file is huge. A chunk is defined as a sequence of block_size bytes of data, which is written as one block.

The application reads one block of data from the input file and finds the frequence of occurency of each symbol in the file. This frequency table is used to create a Huffman coding table. The coding table is canonicalized so that the transmission of the Huffman tree becomes easier.

The size of the data in the block, along with the frequency table for the current block is written out. Then, the compressed bits of the block are written to the output file. Extra bits are considered padding bits and have to be ignored.

Decompression

First the magic number is verified to make sure that the file was produced by this program, then the Huffman tree is constructed with the help of the frequency of table. Then the input file is processed as a bit stream, and after Uncompressed size bytes have been processed, the next block is decompressed.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
profiler-data		profiler-data
src		src
subprojects		subprojects
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
development.sh		development.sh
meson.build		meson.build

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

text-compression

Usage

File format specification and design

File header format

Block format

Compression

Decompression

About

Languages

License

ananthvk/text-compressor

Folders and files

Latest commit

History

Repository files navigation

text-compression

Usage

File format specification and design

File header format

Block format

Compression

Decompression

About

Topics

Resources

License

Stars

Watchers

Forks

Languages