LLM Text Compressor

A Python script to compress text generated by large language models (LLMs), shrinking data size without losing key context or meaning. It offers multiple compression types—like key points, bullet points, and paraphrasing—so you can keep exactly what you need. Perfect for applications needing efficient storage and transmission.

Overview

This script reads a large text file, splits it into manageable chunks, and compresses each chunk using OpenAI's language model to reduce the text to a specified target token size. It iteratively reduces the size of the text until it reaches the target token count, adjusting compression aggressiveness if necessary to prevent endless loops.

Features

Multiple Compression Techniques: Choose from options like summarizing, outlining, paraphrasing, extracting keywords, and more.
Target Token Count: Define the exact token count you want for the output.
JSON Support: Output in JSON format if desired.
Automatic Loop Prevention: The script automatically increases compression aggression after each iteration to prevent getting stuck in a loop.
Customizable Parameters: Modify the type of compression and output format through command-line arguments.

Requirements

Python 3.7+
OpenAI API key
The following Python packages:
- tiktoken
- openai
- argparse

Installation

Clone this repository:

git clone https://github.com/yourusername/llm-text-compressor.git
cd llm-text-compressor

Install the required Python packages:
```
pip install -r requirements.txt
```
Set up your OpenAI API key as an environment variable:
```
export OPENAI_API_KEY="your_openai_api_key"
```

Usage

python llm_text_compressor.py --large_text <path_to_text_file> --token_target <token_count> --compressor_type <type> [--json] [--return_str]

--large_text: Path to the large text file you want to compress (e.g., us_constitution.txt).
--token_target: Target number of tokens for the final output.
--compressor_type: Optional. Type of compression. Options include:
- auto_detect (Default)
- bullet_points
- glossary_terms
- outline
- critical_analysis
- facts_database
- keywords_keyphrases
--model_name: Optional. This is gpt-4o-mini by default.
--json: Optional. Use this flag to output in JSON format.
--return_str: Optional. Use this flag to return the result as a string.

Example Usage

python llm_text_compressor.py --large_text sample_files/us_constitution.txt --token_target 2000 --compressor_type facts_database --model_name gpt-4o-mini --json --return_str

This command will take the us_constitution.txt file, compress it to approximately 2000 tokens, and extract "facts" in JSON format using the model gpt-4o-mini.

Compression Types

The compressor_type parameter lets you choose the method of compression. Here’s a breakdown of the options:

auto_detect: Analysizes the text and chooses a suitable compression type. This is the default.
bullet_points: Summarizes the text using bullet points for clarity and quick reading.
glossary_terms: Extracts and defines key terms as a glossary.
outline: Structures the summary with headings and subheadings to capture the flow and main points.
critical_analysis: Analyzes the main points, discussing strengths, weaknesses, and underlying themes.
facts_database: Extracts factual statements, key details, and verifiable information from the text.
keywords_keyphrases: Distills the text into a list of keywords or phrases that represent the core content, ideal for quick reference.

Example Input File

You can use a sample text file, like the U.S. Constitution, for testing. Save it as us_constitution.txt and place it in the project directory.

Customization

The script's compression algorithm can be fine-tuned by adjusting the token_target and compression type, as well as the aggression factor, which increases by 10% with each iteration if the target token count is not met.

Modifying the Aggression Factor

By default, the script adjusts the compression aggression by 10% each iteration to avoid endless loops. You can modify this factor within the script if needed.

Known Issues

Loop Prevention: Although an aggression factor is applied, in some cases, it may take multiple iterations to reach the target token count.
Accuracy of Compression: Compression accuracy varies by text complexity and chosen compression type.
Token Count Limitations: The actual output may not match the exact token_target due to model and tokenization limitations.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
sample_files		sample_files
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
llm_text_compressor.py		llm_text_compressor.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Text Compressor

Table of Contents

Overview

Features

Requirements

Installation

Usage

Example Usage

Compression Types

Example Input File

Customization

Modifying the Aggression Factor

Known Issues

License

About

Releases

Packages

Languages

License

taylorbayouth/llm-text-compressor

Folders and files

Latest commit

History

Repository files navigation

LLM Text Compressor

Table of Contents

Overview

Features

Requirements

Installation

Usage

Example Usage

Compression Types

Example Input File

Customization

Modifying the Aggression Factor

Known Issues

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages