Skip to content

A text compression program in C++, which performs basic compression using Huffman Code algorithm

License

Notifications You must be signed in to change notification settings

ananthvk/text-compressor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

text-compression

This is a simple compression program, which utilizes Huffman encoding to compress text, so that it can be stored efficiently.

The compressed data is stored in the form of blocks in a custom file format.

Usage

$ ./tc compress <input_file> <output_file>
$ ./tc decompress <input_file> <output_file>

File format specification and design

All values are stored in Little endian format

A compressed file contains the file header, which is 14 bytes in size, followed by a list of blocks, which contain the compressed data using Huffman encoding.

┌──────────────────────────┬──────────┬──────────┬─────────────┐
│  File header (14 bytes)  │ Block 0  │  Block 1 │ .....       │
└──────────────────────────┴──────────┴──────────┴─────────────┘

File header format

Name Offset Size(in bytes) Value Comments
Magic number 0 4 0xC 0x0 0xD 0xE Identifies the file
File version 4 2 Stores the file format version (unsigned integer) Currently 1
Reserved 6 8 - Reserved for future use
Blocks 14 - List of blocks -

Block format

A block consists of a block header, followed by the compressed data, the block header contains the size of the uncompressed data along with the Huffman tree used to encode the data (Canonical Huffman code).

Note: Offsets are with respect to the start of the block and not with respect to start of the file.

Name Offset Size(in bytes) Value Comments
Block type 0 1 Identifies the block type For future use(i.e. to store uncompressible blocks directly)
Uncompressed size 1 8 Size of the uncompressed data(unsigned) -
Compressed size 1 8 Size of the compressed data(unsigned) -
Frequency table 9 256 Number of bits for each symbol H[i] represents number of bits for byte with value i
Compressed data - Variable length - -

After Uncompressed size number of bytes have been processed, the next block begins.

For version 1, block_size is set to 64KB (64000 bytes)

Compression

This program operates on the file in a chunked manner, i.e. the entire file is not read to memory at once, this prevents the application from consuming too much memory when the input file is huge. A chunk is defined as a sequence of block_size bytes of data, which is written as one block.

The application reads one block of data from the input file and finds the frequence of occurency of each symbol in the file. This frequency table is used to create a Huffman coding table. The coding table is canonicalized so that the transmission of the Huffman tree becomes easier.

The size of the data in the block, along with the frequency table for the current block is written out. Then, the compressed bits of the block are written to the output file. Extra bits are considered padding bits and have to be ignored.

Decompression

First the magic number is verified to make sure that the file was produced by this program, then the Huffman tree is constructed with the help of the frequency of table. Then the input file is processed as a bit stream, and after Uncompressed size bytes have been processed, the next block is decompressed.

About

A text compression program in C++, which performs basic compression using Huffman Code algorithm

Topics

Resources

License

Stars

Watchers

Forks