NOTICE: Deprecation I originally wrote this script as a makeshift solution before a proper binding came out, and since there are projects like llama-cpp-python providing working bindings to the latest llama.cpp (which updates faster than I can keep up), I'm no longer planning to maintain this repository and would like to kindly direct interested people to other solutions.
A quick and dirty script to call llama.cpp in Python. Supports streaming and interactive mode.
This Python script requires the compiled main
binary from LLaMA.cpp. You'll need to compile llama.cpp for your own machine, as well as grab a copy of the model weights and quantize them according to instructions provided in llama.cpp.
By default, the script assumes that you have your model weights as ./models/7B/ggml-model-q4_0.bin
, and the llama.cpp binary as ./llama.cpp/main
. However, you can point the script to your own paths.
If you just want the end result, without the streaming part:
from llama import llama
output = llama('LLaMA is a large language model that', streaming=False):
print(output)
Simplest example:
from llama import llama
for token in llama('LLaMA is a large language model that'):
print(token, end='', flush=True)
If you don't want to see the prompt, just the completion:
from llama import llama
for token in llama('LLaMA is a large language model that', skip_prompt=True):
print(token, end='', flush=True)
Additionally, you can choose to show a small tail of the prompt by specifying the character count:
from llama import llama
for token in llama(
'LLaMA is a large language model that can:\n1.',
skip_prompt=True,
trim_prompt=2, # the '1.' part of the prompt will be shown
):
print(token, end='', flush=True)
from llama import llama
for token in llama(
'Below is a conversation between a user and LLaMA:\nUser: Hello!\nLLaMA: Hi! I am LLaMA, a large language model.\nUser: ',
interactive=True,
reverse_prompt="User: "
):
print(token, end='', flush=True)
Here's the pull range of parameters that you can tweak. As you can see, you can change the executable and model path by supplying the executable
and model
parameters.
def llama_stream(
prompt='',
skip_prompt=True,
trim_prompt=0,
executable='./llama.cpp/main',
model='./models/7B/ggml-model-q4_0.bin',
threads=4,
temperature=0.7,
top_k=40,
top_p=0.5,
repeat_last_n=256,
repeat_penalty=1.17647,
n=4096,
interactive=False,
reverse_prompt="User:"
)