Long Context Hub

This project is designed to extend large models to handle longer contexts. The following steps outline the process for setting up the project, downloading necessary datasets, and running fine-tuning and supervised fine-tuning tasks.

1. Clone the repository

git clone [email protected]:ZackZikaiXiao/long_context_hub.git

安装所需库:

pip install -r requirements.txt
pip install flash-attn --no-build-isolation

2. 下载continued pretrain数据集

To download the LongAlpaca-12k dataset from Hugging Face and place it in the project directory, run the following command:

wget https://huggingface.co/datasets/Yukang/LongAlpaca-12k/resolve/main/LongAlpaca-12k.json -P ./dataset/

目录是：./dataset/LongAlpaca-12k.json

4. 下载supervised finetune数据集

下载 RedPajama 数据集到项目根目录 ./RedPajama-Data-1T-Sample

5.Fine-tune

先下载llama3-8B-Instruct:

pip install modelscope
python download_llama3.py

下载好之后，命令行会提示：

2024-05-27 23:42:01,567 - modelscope - INFO - Loading ast index from /home/zikaixiao/.cache/modelscope/ast_indexer

Meta-Llama-3-8B-Instruct在modelscope下，比如:/home/zikaixiao/.cache/modelscope/hub/LLM-Research/Meta-Llama-3-8B-Instruct

在项目根目录运行

./scripts/run_fine_tune.sh

参数参考：

MODEL_NAME_OR_PATH="./models/llama3-8B" # 官方权重，用chat版本
MODEL_MAX_LENGTH=32768
OUTPUT_DIR="./llama3_weights/llama3-8B-32k-ft"

6. Supervised fine-tunne

./scripts/run_supervised_fine_tune.sh

参数参考：

MODEL_NAME_OR_PATH="./llama3_weights/llama3-8B-32k-ft"
MODEL_MAX_LENGTH=32768
FILTER_MODE="all"    # FILTER_MODE要从从filter_ranges的key中选择
OUTPUT_DIR="./llama3_weights/llama3-8B-32k-ft-sft"

保存的路径OUTPUT_DIR要根据FILTER_MODE修改，在llama3-8B-ft后面加上后缀，比如：

llama3-8B-32k-ft-sft-sft-16k-40k代表先ft，再所有数据sft，再用16k-40k长度的数据来sft
llama3-8B-32k-ft-sft-0-16k-d代表先ft，再用除了0-16k以外的长度的数据来sft

filter_ranges=(
    ["all"]="(0, float('inf'), False)"               # OUTPUT_DIR: modelName-windowLength-ft-sft
    ["0-16k"]="(0, 16000, False)"                    # OUTPUT_DIR: modelName-windowLengthft-ft-sft-sft-0-16k
    ["16k-40k"]="(16000, 40000, False)"              # OUTPUT_DIR: modelName-windowLengthft-ft-sft-sft-16k-40k
    ["40k-72k"]="(40000, 72000, False)"              # OUTPUT_DIR: modelName-windowLengthft-sft-sft-40k-72k
    ["not-0-16k"]="(0, 16000, True)"                 # OUTPUT_DIR: modelName-windowLengthft-ft-sft-0-16k-d
    ["not-16k-40k"]="(16000, 40000, True)"           # OUTPUT_DIR: modelName-windowLengthft-ft-sft-16k-40k-d
    ["not-40k-plus"]="(40000, float('inf'), True)"   # OUTPUT_DIR: modelName-windowLengthft-ft-sft-40k-plus
)

TO-DO

把llama3-8B通过先ft（dataset: RedPajama）拓展到32k，得到llama3-8B-32k-ft
在llama3-8B-32k-ft基础上sft(Alpaca-12k所有样本)，得到llama3-8B-32k-ft-sft
对llama3-8B-32k-ft-sft分别通过FILTER_MODE="0-16k", "16k-40k", "40k-72k"的sft，得到llama3-8B-32k-ft-sft-sft-0-16k, llama3-8B-32k-ft-sft-sft-16-40k, llama3-8B-32k-ft-sft-sft-40-72k
在llama3-8B-32k-ft基础上，通过FILTER_MODE="not-0-16k", "not-16k-40k", "not-40k-plus"的sft，得到llama3-8B-32k-ft-sft-0-16k-d, llama3-8B-32k-ft-sft-16k-40k-d, llama3-8B-32k-ft-sft-40k-plus
evaluate，我需要换成LEval, 因为Longbench输入字符长度只有1-40k，预计在16k token以内，不够用
试一试tinyllama在LEval的不同长度范围的性能
尝试从短到长的sft方案（自步学习）
用IDF data selection试试

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.vscode		.vscode
attention		attention
baselines		baselines
benchmark/RULER		benchmark/RULER
ds_configs		ds_configs
imgs		imgs
motivation		motivation
others		others
scripts		scripts
scripts_llama3		scripts_llama3
trainer		trainer
.gitattributes		.gitattributes
.gitignore		.gitignore
=0.23.0		=0.23.0
README.md		README.md
download_model_data.ipynb		download_model_data.ipynb
eval.py		eval.py
eval_distributed.py		eval_distributed.py
fine-tune.py		fine-tune.py
functions_module.py		functions_module.py
generate.py		generate.py
generate_replace.py		generate_replace.py
generate_replace_every_step_mask.py		generate_replace_every_step_mask.py
generate_replace_every_step_pos_permu.py		generate_replace_every_step_pos_permu.py
generate_replace_one_step.py		generate_replace_one_step.py
halu_attn_vis.py		halu_attn_vis.py
halu_detect.py		halu_detect.py
hfd.sh		hfd.sh
inference.py		inference.py
length.py		length.py
llama_flash_attn.py		llama_flash_attn.py
logit_change_analysis.py		logit_change_analysis.py
logits_shape.txt		logits_shape.txt
main_db.ipynb		main_db.ipynb
main_db_translation.ipynb		main_db_translation.ipynb
merge_lora_weights_and_save_hf_model.py		merge_lora_weights_and_save_hf_model.py
mola_training_instruction.py		mola_training_instruction.py
mrr_vs_log_length.pdf		mrr_vs_log_length.pdf
ode.py		ode.py
passkey_retrivial.py		passkey_retrivial.py
rag.py		rag.py
requirements.txt		requirements.txt
resampling_replace.py		resampling_replace.py
save_callback.py		save_callback.py
supervised-fine-tune-LoRAMEL.py		supervised-fine-tune-LoRAMEL.py
supervised-fine-tune-lora.py		supervised-fine-tune-lora.py
supervised-fine-tune-moice.py		supervised-fine-tune-moice.py
supervised-fine-tune-mola.py		supervised-fine-tune-mola.py
supervised-fine-tune-sampler.py		supervised-fine-tune-sampler.py
supervised-fine-tune.py		supervised-fine-tune.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Long Context Hub

1. Clone the repository

2. 下载continued pretrain数据集

4. 下载supervised finetune数据集

5.Fine-tune

6. Supervised fine-tunne

TO-DO

Tinyllama1.1B实验和结论

About

Releases

Packages

Languages

ZackZikaiXiao/long_context_hub

Folders and files

Latest commit

History

Repository files navigation

Long Context Hub

1. Clone the repository

2. 下载continued pretrain数据集

4. 下载supervised finetune数据集

5.Fine-tune

6. Supervised fine-tunne

TO-DO

Tinyllama1.1B实验和结论

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages