Skip to content

Latest commit

 

History

History
57 lines (39 loc) · 2.73 KB

README.md

File metadata and controls

57 lines (39 loc) · 2.73 KB

pecca-rs

Pecca is starting as a Rust port of the excellent @karpathy llama2.c, itself a minimalistic adaptation of llama.cpp.

Compared to other Rust ports, Pecca leverages ndarray, which has several advantages:

  • Type Safety: all matrices have proper dimensions (instead of giant flat arrays) and most operations will check dimensions compatibility.
  • Speed: out of the box and single-threaded, Pecca is already slightly faster than the C version.
  • Readability: matrix operations can be written succinctly. This first version of pecca-rs is only 425 lines, including comments.

Going forward, Pecca will leverage Rust and its ecosystem whenever it makes sense, rather than attempting to avoid dependencies above all (like llama.cpp).

Usage

git clone https://github.com/rahoua/pecca-rs.git
cd pecca-rs
wget -P ./models/stories/  https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
cargo run --release generate ./models/stories/stories15M.bin

Pecca can be run similarly with larger tiny stories models (like the 110M one) or the llama2 models (only 7B recommended so far). For a full list of command line options run:

pecca-rs --help

To get the llama2 models, follow the instructions for llama2.c. Pecca supports the same model format. As Pecca does not use memmap, loading and quantizing the model on the fly can take some time. To speed things up, the models can also be saved quantized using the -f --write-model <path> command line switch.

For codellama, the instructions are similar except for the tokenizer which is slightly different. To make the process easier, the updated tokenizer is provided. To override the default tokenizer, run pecca using the -k command line option:

./target/release/pecca-rs generate /path/to/codellama-instr-7b.bin -k "./models/tokenizer-code.bin"

Performance

At the moment there's no formal benchmark, we just provide rough estimates to give a ballpark of overall performance.

Llama2 7B model on a Macbook Pro M2 Max:

  • llama2.c, f32: 4 tok/s
  • llama.cpp, Q4KM quantization: 24 tok/s
  • pecca, f32: 4 tok/s
  • pecca, i8 quantization: 11 tok/s

Future Directions

A list of possible future developments for the project:

  • Improved tokenizer.
  • Inference performance and general memory footprint during inference.
  • Experiment with SmoothQuant
  • Explore extending ndarray dot operation to support cublas or Metal.
  • Additional parallelization of independent operations.
  • Various refactoring.
  • Support for additional models.