Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models

📖 arXiv
▶️ YT: Structured generation hurts LLM reasoning performance (Paper Explainer) by Elvis Saravia

Update:

2024/12/09 : We have since added a updated note on structure generation libraries and found different results

Structured generation, the process of producing content in standardized formats like JSON and XML, is widely utilized in real-world applications to extract key output information from large language models (LLMs). This study investigates whether such constraints on generation space impact LLMs' abilities, including reasoning and domain knowledge comprehension.

Environment Setup

pip install -r requirements.txt

You will need a together api key to run properly

export TOGETHER_API_KEY="XXX"
export OAI_KEY="sk-XXXX"
export ANTHROPIC_API_KEY="XXX"

and gemini vertex setup to run all the code...

pip install --upgrade google-cloud-aiplatform
gcloud auth application-default login
export GCP_PROJECT_NAME="Your Project Name"

Reproduce Results

To evaluate run:

python main.py --model gpt-3.5-turbo-0125 \
        --dataset lastletter \
        --prompt_style xml \
        --num_shots 0 \
        --prompt_version tasks/templates/lastletter-v2-5.yaml

Note if prompt_version is provided, the output path will be logging/<prompt_version filename>/

To add a new task: write the prompts in _utils.py

For each format function, make sure you return the prompt and the format parser function which takes in the LLM response text and the original dataset row

The format parser will return:

{
    'correct': correct,
    'answer': answer, # ground truth
    'predict': predict, # parsed out answer
    'parsed_result': parsed_results, # make sure its a dict
    'parse_failed': parse_failed, # 1 or 0 if the response doesn't contain any answer keys
    'response_non_yaml': response_non_yml # 1 or 0 if the response doesn't contain the structure ie yaml
}

There's a full list of commands to run all the available combinations under these files:

run_gsm8k.sh
run_letter.sh
run_shuffobj.sh
run_task280.sh
run_ddxplus.sh
run_sports.sh

Ideally any format should have 9 combinations (3 prompt instruction variants + 3 format variants) to obtain the full results.

If running all the results is too much, here's the executed resulted: Drive

OpenAI 100% structure gen method

Currently supported dataset: gsm8k, shuffleobj, lastletter on gpt-4o-mini-2024-07-18 and gpt-4o-2024-08-06

python main.py --model gpt-4o-mini-2024-07-18 \
  --dataset gsm8k \
  --series struct-v2 \
  --prompt_style struct-v2 \
  --num_shots 0 \
  --prompt_version tasks/templates/gsm8k-t1-f1.yaml

Make sure to swap the prompt version on format f1, f2, f3 for all format variant fields

Analysis

python visualize.py

Once every file is available it should visualize the barplot for each tasks

Reference

If you find our work helpful, please cite as

@article{tam2024let,
  title={Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models},
  author={Tam, Zhi Rui and Wu, Cheng-Kuang and Tsai, Yi-Lin and Lin, Chieh-Yen and Lee, Hung-yi and Chen, Yun-Nung},
  journal={arXiv preprint arXiv:2408.02442},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models

Environment Setup

Reproduce Results

OpenAI 100% structure gen method

Analysis

Reference

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
llms		llms
resources		resources
tasks		tasks
.gitignore		.gitignore
2stage_parsing.py		2stage_parsing.py
README.md		README.md
agg_result.py		agg_result.py
main.py		main.py
requirements.txt		requirements.txt
run_ddxplus.sh		run_ddxplus.sh
run_gsm8k.sh		run_gsm8k.sh
run_letter.sh		run_letter.sh
run_shuffobj.sh		run_shuffobj.sh
run_sports.sh		run_sports.sh
run_task280.sh		run_task280.sh
study_llm_parser.py		study_llm_parser.py
updates.md		updates.md
utils.py		utils.py
visualize.py		visualize.py

appier-research/structure-gen

Folders and files

Latest commit

History

Repository files navigation

Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models

Environment Setup

Reproduce Results

OpenAI 100% structure gen method

Analysis

Reference

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages