[ 中文 | English ]
Steel-LLM is a personally initiated project to pre-train a large Chinese model from scratch. We pre-trained a Chinese LLM with approximately 1 billion parameters using 1 trillion tokens of data. The project took 8 months from initiation to the completion of the first version of the model. We have shared the entire process in detail, including data collection, data processing, pre-training framework selection, and model design, and have open-sourced all the code. This enables anyone with 8 to several dozen GPUs to reproduce our work. Thanks to open-source Chinese data, Steel-LLM outperforms some larger early models released by institutions on Chinese benchmarks, scoring 42 in CEVAL and 36 in CMMLU.
"Steel" was inspired by an excellent band from the North China Plain called “Omnipotent Youth Society (万青)”. When they were producing their first album under limited conditions, they referred to it as "making steel with primitive methods," but it turned out to be a legendary album. Similarly, our conditions for training the LLM are limited, but we hope to produce good "steel" nonetheless.
Subsequently, further exploration will be carried out in aspects such as mathematical ability, reinforcement learning, and complex reasoning...
[2025/2/13] The technical report has been uploaded:https://arxiv.org/abs/2502.06635
[2025/1/17] Updated steel-LLM-chat-v2. During fine-tuning, English data was added, and the proportion of Chinese and English data remained consistent with that of pre-training. Eventually, the score on Ceval increased from 38 to 41.9, and the score on CMMLU increased from 33 to 36.
[2024/11/13] 🔥 Published a project summary article "My Journey of Pre-training a 1B LLM from Scratch": https://mp.weixin.qq.com/s/POUugkCNZTzmlKWZVVD1CQ 🔥
[2024/10/28] Updated the first version of the chat model, scoring 38 in CEVAL and 33 in CMMLU.
[2024/10/24] Published the details of Steel-LLM fine-tuning and evaluation. During fine-tuning, we explored experiments such as CoT and model leaderboard. Blog address: https://mp.weixin.qq.com/s/KK0G0spNw0D9rPUESkHMew
[2024/9/2] HuggingFace updated the checkpoints for 480k, 660k, 720k, 980k, 1060k (the final checkpoint) steps.
[2024/8/18] Pre-training completed, followed by fine-tuning and evaluation.
[2024/7/18] Continued training using 8*H800, wandb: https://api.wandb.ai/links/steel-llm-lab/vqf297nr
[2024/6/30] Released the checkpoint for 200k steps of pre-training, HuggingFace link
[2024/5/21] Official training of the model began, and checkpoints will be released periodically.
[2024/5/19] Completed model modification based on Qwen1.5, model size 1.12B:
- FFN layer uses softmax moe, providing higher training speed with the same parameter count
- Uses dual-layer SwiGLU
Related blog: https://zhuanlan.zhihu.com/p/700395878
[2024/5/5] Blog on modifying the pre-training program: https://zhuanlan.zhihu.com/p/694223107
[2024/4/24] Completed improvements to the training program: compatible with HuggingFace format models, support for data checkpointing, support for adding new data.
[2024/4/14] Completed data collection and processing, generating bin files required for the pre-training program. Updated blog on data collection and processing: https://zhuanlan.zhihu.com/p/687338497
You are welcome to join our community group. The number of members has exceeded 200. Add WeChat to join the group: a1843450905
The datasets used and their links are listed below. For more details, see this article
- Skywork/Skypile-150B data
- wanjuan1.0 (NLP part)
- Filtered Chinese Wikipedia data
- Baidu Encyclopedia data
- Baidu Encyclopedia Q&A data
- Zhihu Q&A data
- BELLE dialogue data
- MOSS project dialogue data
- firefly1.1M
- starcoder
(For details, see this article)
- Source data: Unified format processing for 4 types of data:
- Simple text: Baidu Encyclopedia (manual merging of title and paragraphs), Chinese Wikipedia
- Dialogue (including single and multi-round): Baidu Encyclopedia Q&A data, BELLE dialogue data (BELLE_3_5M), MOSS project dialogue data, Zhihu Q&A data, BELLE task data (BELLE_2_5M), firefly1.1M
- Code data: starcoder
- Other data: No separate processing required for wanjuan and skypile datasets
- Target format:
{"text": "asdfasdf..."}
, file saved as .jsonl format. - Run command:
python data/pretrain_data_prepare/step1_data_process.py
- Run command:
sh data/pretrain_data_prepare/step2/run_step2.sh
- Specific data juicer operators used are documented in this document.
Modify filename_sets
in the code to specify the data path, then run the following program:
python pretrain_modify_from_TinyLlama/scripts/prepare_steel_llm_data.py
Input data format: jsonl file containing the 'text' field
We do not train a tokenizer separately and use the tokenizer from Qwen/Qwen1.5-MoE-A2.7B-Chat
(For details, see this article)
Based on the Qwen1.5 model, the following changes were made:
- FFN layer uses softmax moe, providing higher training speed with the same parameter count
- Uses dual-layer SwiGLU
(For details, see this article)
Based on the TinyLlama pre-training program, the following improvements were made:
- Compatible with HuggingFace format models
- Fully restores training progress when loading checkpoints
- Data consistency checks
- Adds new data to the dataset without affecting already trained data
Start pre-training:
python Steel-LLM/pretrain_modify_from_TinyLlama/pretrain/pretrain_steel_llm.py
(For details, see this article)
Steel-LLM was tested on CEVAL and CMMLU. Steel-LLM aims to train a Chinese LLM, with 80% of the training data being Chinese, so it was not evaluated on English benchmarks. Metrics for other models come from the CEVAL paper, MiniCPM technical report, MAP-Neo technical report, etc. More model metrics can be found in the previous blog.
CEVAL | CMMLU | |
---|---|---|
Steel-LLM-chat-v2 | 41.90 | 36.08 |
Steel-LLM-chat-v1 | 38.57 | 33.48 |
Tiny-Llama-1.1B | 25.02 | 24.03 |
Gemma-2b-it | 32.3 | 33.07 |
Phi2(2B) | 23.37 | 24.18 |
Deepseek-coder-1.3B-instruct | 28.33 | 27.75 |
CT-LLM-SFT-2B | 41.54 | 41.48 |
MiniCPM-2B-sft-fp32 | 49.14 | 51.0 |
Qwen1.5-1.8B-Chat | 56.84 | 54.11 |
ChatGLM-6B | 38.9 | - |
Moss | 33.1 | - |
LLAMA-65B | 34.7 | - |
Qwen-7B | 58.96 | 60.35 |
Gemma-7B | 42.57 | 44.20 |
OLMo-7B | 35.18 | 35.55 |
MAP-NEO-7B | 56.97 | 55.01 |
from modelscope import AutoModelForCausalLM, AutoTokenizer
model_name = "zhanshijin/Steel-LLM"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Who developed you?"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
GPU: 8* H800 80G (training for about 30 days)
GPU: 8* A100 80G (training for about 60 days)
Disk: 4TB
BibTeX:
@article{gu2025steel,
title={Steel-LLM: From Scratch to Open Source--A Personal Journey in Building a Chinese-Centric LLM},
author={Gu, Qingshui and Li, Shu and Zheng, Tianyu and Zhang, Zhaoxiang},
journal={arXiv preprint arXiv:2502.06635},
year={2025}
}