Skip to content

Latest commit

 

History

History
217 lines (164 loc) · 10.7 KB

README_en.md

File metadata and controls

217 lines (164 loc) · 10.7 KB

Open Source Chinese Pre-trained Language Model Steel-LLM

[ 中文 | English ]

👋 Introduction

Steel-LLM is a personally initiated project to pre-train a large Chinese model from scratch. We pre-trained a Chinese LLM with approximately 1 billion parameters using 1 trillion tokens of data. The project took 8 months from initiation to the completion of the first version of the model. We have shared the entire process in detail, including data collection, data processing, pre-training framework selection, and model design, and have open-sourced all the code. This enables anyone with 8 to several dozen GPUs to reproduce our work. Thanks to open-source Chinese data, Steel-LLM outperforms some larger early models released by institutions on Chinese benchmarks, scoring 42 in CEVAL and 36 in CMMLU.

Static Badge Static Badge Static Badge
Static Badge Arxiv

"Steel" was inspired by an excellent band from the North China Plain called “Omnipotent Youth Society (万青)”. When they were producing their first album under limited conditions, they referred to it as "making steel with primitive methods," but it turned out to be a legendary album. Similarly, our conditions for training the LLM are limited, but we hope to produce good "steel" nonetheless.

🔔 Announcements

Updates

Subsequently, further exploration will be carried out in aspects such as mathematical ability, reinforcement learning, and complex reasoning...

[2025/2/13] The technical report has been uploaded:https://arxiv.org/abs/2502.06635

[2025/1/17] Updated steel-LLM-chat-v2. During fine-tuning, English data was added, and the proportion of Chinese and English data remained consistent with that of pre-training. Eventually, the score on Ceval increased from 38 to 41.9, and the score on CMMLU increased from 33 to 36.

[2024/11/13] 🔥 Published a project summary article "My Journey of Pre-training a 1B LLM from Scratch": https://mp.weixin.qq.com/s/POUugkCNZTzmlKWZVVD1CQ 🔥

[2024/10/28] Updated the first version of the chat model, scoring 38 in CEVAL and 33 in CMMLU.

[2024/10/24] Published the details of Steel-LLM fine-tuning and evaluation. During fine-tuning, we explored experiments such as CoT and model leaderboard. Blog address: https://mp.weixin.qq.com/s/KK0G0spNw0D9rPUESkHMew

[2024/9/2] HuggingFace updated the checkpoints for 480k, 660k, 720k, 980k, 1060k (the final checkpoint) steps.

[2024/8/18] Pre-training completed, followed by fine-tuning and evaluation.

[2024/7/18] Continued training using 8*H800, wandb: https://api.wandb.ai/links/steel-llm-lab/vqf297nr

[2024/6/30] Released the checkpoint for 200k steps of pre-training, HuggingFace link

[2024/5/21] Official training of the model began, and checkpoints will be released periodically.

[2024/5/19] Completed model modification based on Qwen1.5, model size 1.12B:

  • FFN layer uses softmax moe, providing higher training speed with the same parameter count
  • Uses dual-layer SwiGLU

Related blog: https://zhuanlan.zhihu.com/p/700395878

[2024/5/5] Blog on modifying the pre-training program: https://zhuanlan.zhihu.com/p/694223107

[2024/4/24] Completed improvements to the training program: compatible with HuggingFace format models, support for data checkpointing, support for adding new data.

[2024/4/14] Completed data collection and processing, generating bin files required for the pre-training program. Updated blog on data collection and processing: https://zhuanlan.zhihu.com/p/687338497

🧑‍🤝‍🧑 Community

You are welcome to join our community group. The number of members has exceeded 200. Add WeChat to join the group: a1843450905


🤖 Pre-training

Data Collection

The datasets used and their links are listed below. For more details, see this article

Data Processing

(For details, see this article)

Step 1: Format Conversion

  • Source data: Unified format processing for 4 types of data:
    • Simple text: Baidu Encyclopedia (manual merging of title and paragraphs), Chinese Wikipedia
    • Dialogue (including single and multi-round): Baidu Encyclopedia Q&A data, BELLE dialogue data (BELLE_3_5M), MOSS project dialogue data, Zhihu Q&A data, BELLE task data (BELLE_2_5M), firefly1.1M
    • Code data: starcoder
    • Other data: No separate processing required for wanjuan and skypile datasets
  • Target format: {"text": "asdfasdf..."}, file saved as .jsonl format.
  • Run command: python data/pretrain_data_prepare/step1_data_process.py

Step 2: Data Juicer Processing

  • Run command: sh data/pretrain_data_prepare/step2/run_step2.sh
  • Specific data juicer operators used are documented in this document.

Step 3: Generating Final Training Bin Format

Modify filename_sets in the code to specify the data path, then run the following program:

python pretrain_modify_from_TinyLlama/scripts/prepare_steel_llm_data.py

Input data format: jsonl file containing the 'text' field

Tokenizer

We do not train a tokenizer separately and use the tokenizer from Qwen/Qwen1.5-MoE-A2.7B-Chat

Model Architecture

(For details, see this article)

Based on the Qwen1.5 model, the following changes were made:

  • FFN layer uses softmax moe, providing higher training speed with the same parameter count
  • Uses dual-layer SwiGLU

Pre-training Framework

(For details, see this article)

Based on the TinyLlama pre-training program, the following improvements were made:

  • Compatible with HuggingFace format models
  • Fully restores training progress when loading checkpoints
  • Data consistency checks
  • Adds new data to the dataset without affecting already trained data

Start pre-training:

python Steel-LLM/pretrain_modify_from_TinyLlama/pretrain/pretrain_steel_llm.py

Evaluation

(For details, see this article)

Steel-LLM was tested on CEVAL and CMMLU. Steel-LLM aims to train a Chinese LLM, with 80% of the training data being Chinese, so it was not evaluated on English benchmarks. Metrics for other models come from the CEVAL paper, MiniCPM technical report, MAP-Neo technical report, etc. More model metrics can be found in the previous blog.

CEVAL CMMLU
Steel-LLM-chat-v2 41.90 36.08
Steel-LLM-chat-v1 38.57 33.48
Tiny-Llama-1.1B 25.02 24.03
Gemma-2b-it 32.3 33.07
Phi2(2B) 23.37 24.18
Deepseek-coder-1.3B-instruct 28.33 27.75
CT-LLM-SFT-2B 41.54 41.48
MiniCPM-2B-sft-fp32 49.14 51.0
Qwen1.5-1.8B-Chat 56.84 54.11
ChatGLM-6B 38.9 -
Moss 33.1 -
LLAMA-65B 34.7 -
Qwen-7B 58.96 60.35
Gemma-7B 42.57 44.20
OLMo-7B 35.18 35.55
MAP-NEO-7B 56.97 55.01

⛏️ Quick Usage

from modelscope import AutoModelForCausalLM, AutoTokenizer

model_name = "zhanshijin/Steel-LLM"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Who developed you?"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Hardware Resources

GPU: 8* H800 80G (training for about 30 days)

GPU: 8* A100 80G (training for about 60 days)

Disk: 4TB

Citation

BibTeX:

@article{gu2025steel,
  title={Steel-LLM: From Scratch to Open Source--A Personal Journey in Building a Chinese-Centric LLM},
  author={Gu, Qingshui and Li, Shu and Zheng, Tianyu and Zhang, Zhaoxiang},
  journal={arXiv preprint arXiv:2502.06635},
  year={2025}
}