A Python script to compress text generated by large language models (LLMs), shrinking data size without losing key context or meaning. It offers multiple compression types—like key points, bullet points, and paraphrasing—so you can keep exactly what you need. Perfect for applications needing efficient storage and transmission.
- Overview
- Features
- Requirements
- Installation
- Usage
- Compression Types
- Example Input File
- Customization
- Known Issues
- License
This script reads a large text file, splits it into manageable chunks, and compresses each chunk using OpenAI's language model to reduce the text to a specified target token size. It iteratively reduces the size of the text until it reaches the target token count, adjusting compression aggressiveness if necessary to prevent endless loops.
- Multiple Compression Techniques: Choose from options like summarizing, outlining, paraphrasing, extracting keywords, and more.
- Target Token Count: Define the exact token count you want for the output.
- JSON Support: Output in JSON format if desired.
- Automatic Loop Prevention: The script automatically increases compression aggression after each iteration to prevent getting stuck in a loop.
- Customizable Parameters: Modify the type of compression and output format through command-line arguments.
- Python 3.7+
- OpenAI API key
- The following Python packages:
tiktoken
openai
argparse
-
Clone this repository:
git clone https://github.com/yourusername/llm-text-compressor.git cd llm-text-compressor
-
Install the required Python packages:
pip install -r requirements.txt
-
Set up your OpenAI API key as an environment variable:
export OPENAI_API_KEY="your_openai_api_key"
python llm_text_compressor.py --large_text <path_to_text_file> --token_target <token_count> --compressor_type <type> [--json] [--return_str]
--large_text
: Path to the large text file you want to compress (e.g.,us_constitution.txt
).--token_target
: Target number of tokens for the final output.--compressor_type
: Optional. Type of compression. Options include:auto_detect
(Default)bullet_points
glossary_terms
outline
critical_analysis
facts_database
keywords_keyphrases
--model_name
: Optional. This isgpt-4o-mini
by default.--json
: Optional. Use this flag to output in JSON format.--return_str
: Optional. Use this flag to return the result as a string.
python llm_text_compressor.py --large_text sample_files/us_constitution.txt --token_target 2000 --compressor_type facts_database --model_name gpt-4o-mini --json --return_str
This command will take the us_constitution.txt
file, compress it to approximately 2000 tokens, and extract "facts" in JSON format using the model gpt-4o-mini
.
The compressor_type
parameter lets you choose the method of compression. Here’s a breakdown of the options:
- auto_detect: Analysizes the text and chooses a suitable compression type. This is the default.
- bullet_points: Summarizes the text using bullet points for clarity and quick reading.
- glossary_terms: Extracts and defines key terms as a glossary.
- outline: Structures the summary with headings and subheadings to capture the flow and main points.
- critical_analysis: Analyzes the main points, discussing strengths, weaknesses, and underlying themes.
- facts_database: Extracts factual statements, key details, and verifiable information from the text.
- keywords_keyphrases: Distills the text into a list of keywords or phrases that represent the core content, ideal for quick reference.
You can use a sample text file, like the U.S. Constitution, for testing. Save it as us_constitution.txt
and place it in the project directory.
The script's compression algorithm can be fine-tuned by adjusting the token_target
and compression type, as well as the aggression factor, which increases by 10% with each iteration if the target token count is not met.
By default, the script adjusts the compression aggression by 10% each iteration to avoid endless loops. You can modify this factor within the script if needed.
- Loop Prevention: Although an aggression factor is applied, in some cases, it may take multiple iterations to reach the target token count.
- Accuracy of Compression: Compression accuracy varies by text complexity and chosen compression type.
- Token Count Limitations: The actual output may not match the exact
token_target
due to model and tokenization limitations.
This project is licensed under the MIT License. See the LICENSE
file for details.