First, navigate to the folder where you keep your projects and clone this repository to this folder:
git clone https://github.com/PenGuln/llama3.1.c.git
Then, open the repository folder:
cd llama3.1.c
Now, let's chat with llama3 in pure C. Firstly download the Llama3.1-Instruct checkpoint from huggingface:
from huggingface_hub import snapshot_download
snapshot_download(repo_id="meta-llama/Llama-3.1-8B-Instruct", allow_patterns="original/*", local_dir="./")
rename the checkpoint folder to Llama-3.1-8B-Instruct
.
mv original Llama-3.1-8B-Instruct
Then export the weight and quantized weight(optional)
python export.py llama3.1_8b_instruct.bin --meta-llama Llama-3.1-8B-Instruct
python export.py llama3.1_8b_instruct_quant.bin --meta-llama Llama-3.1-8B-Instruct --version 2
Export the tokenizer
python tokenizer.py --tokenizer-model=Llama-3.1-8B-Instruct/tokenizer.model
mv Llama-3.1-8B-Instruct/tokenizer.bin ./tokenizer.bin
Starting chat
make run
./run llama3.1_8b_instruct.bin -z tokenizer.bin -m chat
or in quantization mode
make run
./runq llama3.1_8b_instruct_quant.bin -z tokenizer.bin -m chat
Prepare test prompts from ultrachat-200K.
python ultrachat.py
Then a data.bin
will appear in the repository folder. Evaluate the average inference speed on the dataset by running
./test llama3.1_8b_instruct.bin -z tokenizer.bin -n 128
To test the openMP complied inference speed, you'll need to complile with
make runomp
Then run the same script with OMP_NUM_THREADS
flag.
OMP_NUM_THREADS=64 ./test llama3.1_8b_instruct.bin -z tokenizer.bin -n 128