fix: Update Llama2 Formatting Template #781

teleprint-me · 2023-10-02T06:01:47Z

I found a bug in the llama-2 formatting template.

Modify _system_template to better align system messages
Remove [INST] tag from system message format

This change aims to fix the repetitive contextual issues in the conversational context by refining the message template used for Llama2.

- Modify _system_template to better align system messages - Remove [INST] tag from system message format This change aims to fix the repetitive greeting issue in the conversational context by refining the message template used for Llama2.

teleprint-me · 2023-10-02T06:04:15Z

I posted details about this issue in reference to PR #711.

abetlen · 2023-10-02T15:44:04Z

Hey @teleprint-me do you have a reference for this being the correct prompt format? I originally based this off of FastChat's system message which references a Huggingface blogpost that shows the current format:

<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_msg_1 }} [/INST] {{ model_answer_1 }} </s><s>[INST] {{ user_msg_2 }} [/INST]

teleprint-me · 2023-10-02T16:43:30Z

@abetlen I don't have any references other than manually testing it repeatedly myself. It took awhile and I'll admit that I became frustrated by it until I figured it out.

I had to figure it out manually, by hand, using llama.cpp. I came up with a template that I extracted with ChatGPT from the official Llama source code.

The file template GPT and I came up with is as follows:

<<SYS>>
My name is Valerie and I am a helpful assistant.
<</SYS>>

[INST] Hello Valerie, my name is Austin. What's your name?
[/INST] Hello Austin, my name is Valerie. Nice to meet you!
[INST] What can you do?
[/INST] I can assist you with various tasks, from answering questions to helping you with your projects.
[INST] How can you assist me in my programming projects?
[/INST]

The results were the following:

12:28:41 | /mnt/valerie/llama.cpp
(.venv) git:(master | Δ) λ ./main -f prompts/llama2.txt -m models/llama-2/llama-2-7b-chat/llama-2-7b-chat.GGUF.q4_0.bin -i --log-disable 2> /dev/null     
 <<SYS>>
My name is Valerie and I am a helpful assistant.
<</SYS>>

[INST] Hello Valerie, my name is Austin. What's your name?
[/INST] Hello Austin, my name is Valerie. Nice to meet you!
[INST] What can you do?
[/INST] I can assist you with various tasks, from answering questions to helping you with your projects.
[INST] How can you assist me in my programming projects?
[/INST] I'm happy to help with your programming projects! Here are some ways I can assist:

1. Code review and debugging: I can analyze your code and identify any errors or areas for improvement.
2. Writing code: I can generate code snippets or complete programs based on your requirements.
3. Providing programming concepts explanations: I can explain complex programming concepts in a clear and concise manner, making it easier for you to understand them.
4. Offering suggestions and ideas: If you're stuck on a particular problem or project, I can provide suggestions and ideas that might help you move forward.
5. Language translation: I can translate code from one programming language to another, helping you to work with different languages.
6. Data analysis and visualization: I can assist with data manipulation, cleaning, and visualization, using various libraries and tools.
7. Web scraping and data extraction: I can help you extract data from websites or web pages using various techniques.
8. Chatbot development: I can assist with developing chatbots using various platforms and frameworks.
9. Content generation: I can generate content, such as articles, blog posts, or even entire books, based on your requirements.
10. Other tasks: Depending on your needs, I can perform other tasks such as scheduling appointments, sending emails, or making reservations. Just let me know how I can help!

I tested it multiple times and was able to reproduce the results consistently.

[INST] <<SYS>>
My name is Valerie and I am a helpful assistant.
<</SYS>>

Using the currently employed template produces erroneous output causing the model to "get confused" and "glitch out".

00:22:12 | /mnt/valerie/pygptprompt
(.venv) git:(main | Δ) λ python -m pygptprompt.chat -x -e -s test tests/config.dev.json
2023-10-02 00:22:15,318 - json.py:110 - DEBUG - JSON successfully loaded from tests/config.dev.json
system
My name is Valerie. I am a helpful programming assistant.

user
Hello Valerie, my name is Austin. What's your name?

assistant
Hello Austin, my name is Valerie. Nice to meet you!

user
What can you do?

assistant
I can assist you with various tasks, from answering questions to helping you with your projects.

user
> How can you assist me in my programming projects?

assistant
 Hello Austin! It's nice to meet you too! As a helpful programming assistant, I can help you with a variety of tasks related to programming. Here are some examples of things I can do:

1. Answer questions: If you have any questions about programming concepts or specific languages, I can try to provide explanations and examples to help you understand.
2. Provide code examples: If you're working on a project and need some example code to get started, I can provide that for you.
3. Offer troubleshooting assistance: If you're having trouble with your code or are encountering errors, I can try to help you identify the problem and suggest solutions.
4. Help with debugging: If you're having trouble tracking down a bug in your code, I can offer suggestions on how to debug and fix the issue.
5. Assist with project ideas: If you're looking for ideas for programming projects, I can provide some suggestions based on your interests and skill level.

So, what kind of programming projects are you working on right now? Do you have any specific questions or areas where you could use some help?

Context is consuming 346 tokens.
Transcript is consuming 777 tokens.

user
>

The template currently employed produces undesirable output as a result instead.

Examples of undesirable output are:

Repetitive outputs, e.g. the model repeats certain elements within a conversation (e.g., greetings, previous statements).
It will cause it to complete entire conversations instead of stopping and allowing the user/agent to respond. Etc.

You can repeat these experiments yourself and I'm sure you'll reproduce similar results.

If you're looking for something more concrete and official, then I would recommend reviewing Facebooks source code yourself.

Reference 1: https://github.com/facebookresearch/llama/blob/main/llama/generation.py#L44
Reference 2: https://github.com/facebookresearch/llama/blob/main/llama/generation.py#L316

teleprint-me · 2023-10-02T17:11:55Z

As an aside, something interesting to note is that Llama is perfectly capable of understanding and interpreting custom special tags.

12:58:36 | /mnt/valerie/llama.cpp
(.venv) git:(master | Δ) λ ./main -f prompts/llama2.func.txt -m models/llama-2/llama-2-7b-chat/llama-2-7b-chat.GGUF.q4_0.bin -i --log-disable 2> /dev/null
 <<SYS>>
My name is Valerie and I am a helpful assistant.
<</SYS>>

[INST] Hello Valerie, my name is Austin. What's your name?[/INST]
[ASST] Hello Austin, my name is Valerie. Nice to meet you![/ASST]
[INST] What can you do?[/INST]
[ASST] I can assist you with various tasks, including providing structured output for certain queries.[/ASST]
[INST] How can you assist me in my programming projects?[/INST]
[ASST] I can help you with a wide range of programming-related tasks. For example, I can generate code, help with debugging, and offer project structuring advice.[/ASST]
[INST] What's the weather like?[/INST]
[FUNC]
{
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g. San Francisco, CA"
            },
            "unit": { 
              "type": "string", 
              "enum": ["metric", "uscs"],
              "description": "The unit system to use for temperature measurements"
            }
          },
          "required": ["location"]
        }
}
[/FUNC]
# llama starts after [ASST]█
[ASST]  Hello Austin, the current weather in San Francisco, CA is mostly sunny with a high of 78 degrees Fahrenheit and a low of 56 degrees Fahrenheit. Would you like to know the temperature in another location?

I know llama.cpp has a CFG implementation and you've also implemented it as well; I haven't had the time or resources to look at it yet even though I want to.

My rationale for bringing it up is that it would explain why the previous template(s) worked just fine.

Here's another example of something I was experimenting with:

13:01:50 | /mnt/valerie/llama.cpp
(.venv) git:(master | Δ) λ ./main -f prompts/llama2.func.txt -m models/llama-2/llama-2-7b-chat/llama-2-7b-chat.GGUF.q4_0.bin -i --log-disable 2> /dev/null
 <<SYS>>
My name is Valerie and I am a helpful assistant.
<</SYS>>

[INST] Hello Valerie, my name is Austin. What's your name?[/INST]
[ASST] Hello Austin, my name is Valerie. Nice to meet you![/ASST]
[INST] What can you do?[/INST]
[ASST] I can assist you with various tasks, including providing structured output for certain queries.[/ASST]
[INST] How can you assist me in my programming projects?[/INST]
[ASST] I can help you with a wide range of programming-related tasks. For example, I can generate code, help with debugging, and offer project structuring advice.[/ASST]
[INST] What's the weather like?[/INST]
[FUNC]
{
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g. San Francisco, CA"
            },
            "unit": { 
              "type": "string", 
              "enum": ["metric", "uscs"],
              "description": "The unit system to use for temperature measurements"
            }
          },
          "required": ["location"]
        }
}
[/FUNC]
# llama starts after [FUNC]█
[FUNC] 
{
    "name": "current_weather",
    "description": "Get the current weather in a given location",
    "parameters": {
      "location": {
        "type": "string",
        "description": "The city and state, e.g. San Francisco, CA"
      },
      "unit": { 
        "type": "string", 
        "enum": ["metric", "uscs"],
        "description": "The unit system to use for temperature measurements"
      }
    },
    "response": {
      "status": "success",
      "data": {
        "tempt": {
          "temperature": {
            "value": Math.floor(Math.random() * 10) + 10,
            "units": function () {
              if (this.unit === "metric") {
                return "°C";
              } else if (this.unit === "uscs") {
                return "°F";
              } else {
                throw new Error("Invalid unit system");
              }
            }
          },
          "description": "The current temperature in the selected location"
        }
      }
    }
}

Again, with minimal instruction, Llama interprets it just fine even after quantization, precision loss due to k-bit tensor conversions.

I think the smaller models are much more intelligent than we give them credit for. I believe this is significant, especially for end consumer devices.

teleprint-me · 2023-10-02T19:38:50Z

Regardless, this is why I was advocating for a more flexible approach rather than the one currently implemented.

Using the factory pattern with fixed templates is a bit inflexible and requires unnecessary maintenance as a result.

I believe it would be better to allow the user to define their own templates instead of hard-coding them what-so-ever.

bioshazard · 2023-10-03T15:09:10Z

Have you looked at Huggingface Chat Templates? Maybe a universal chat templating solution rather than manual implementation? Mistral, eg, comes with a template config as part of the model deliverable.

https://huggingface.co/docs/transformers/main/chat_templating

eg,

>> from transformers import AutoTokenizer
>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

>> chat = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "I'd like to show off how chat templating works!"},
]

>> tokenizer.use_default_system_prompt = False
>> tokenizer.apply_chat_template(chat, tokenize=False)
"<s>[INST] Hello, how are you? [/INST] I'm doing great. How can I help you today? </s><s>[INST] I'd like to show off how chat templating works! [/INST]"

teleprint-me · 2023-10-03T16:07:58Z

I did take a look at it already. It's the same template being used already for llama2 as well.

I think a good middle ground would be figuring out how to sanely register a new template with the current factory pattern.

It's set in the constructor at the moment. So, if it can take a string or instance, preferably an instance, according to the example snippet in PR #711, then we could just pass it in by inheriting from the base class. Wrap the new function with the decorator already defined as well. This would most likely introduce a form of dependency injection which I'm not sure is desirable.

My main hang up with this is that I'd rather iterate over the completions list my self in a function I can define. This would feel more intuitive design-wise and give users a level of fine grained control if they want or need it. The caveat is that this potentially creates its own problem of repeated code, hence the helper functions already implemented.

Like, I get the design choices... I just feel like it can be improved and hopefully become more flexible as a result. I'm still working out the details though.

I'd prefer to avoid going in circles. Whenever I get like this, it's usually a sign that I'm not interpreting something correctly and I'm still working my way towards a form of understanding and comprehension.

That's why I went for a minor change. I'd have to experiment more deeply with it in a separate branch while minimizing the overall effects.

bioshazard · 2023-10-03T16:23:46Z

Thanks for addressing my question. I will let the HF topic go in this context and leave y'all to wrap up this generous contribution. I am going to open a fresh issue to pitch my thoughts about delivering support for both HF autotokenizer and arbitrary Jinja file path.

teleprint-me · 2023-10-10T01:31:27Z

PR #808 fixes the issue.

earonesty · 2023-10-26T20:34:32Z

it does not, fully, fix the issue. it only fixes a single back-and-forth. longer conversations with user/assistant, followed by a completion, still have problems.

fix: Update Llama2 Formatting Template

3f79f1d

- Modify _system_template to better align system messages - Remove [INST] tag from system message format This change aims to fix the repetitive greeting issue in the conversational context by refining the message template used for Llama2.

teleprint-me mentioned this pull request Oct 9, 2023

Repeated greeting in same chat session #801

Closed

delock mentioned this pull request Oct 10, 2023

Fix repeat greeting #808

Merged

teleprint-me closed this Oct 10, 2023

teleprint-me deleted the fix.llama2.template branch October 26, 2023 20:56

antoine-lizee pushed a commit to antoine-lizee/llama-cpp-python that referenced this pull request Oct 30, 2023

ggml : multi-thread ggml_rope() (~3-4 times faster on M1) (abetlen#781)

eeaa7b0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Update Llama2 Formatting Template #781

fix: Update Llama2 Formatting Template #781

teleprint-me commented Oct 2, 2023

teleprint-me commented Oct 2, 2023

abetlen commented Oct 2, 2023

teleprint-me commented Oct 2, 2023 •

edited

Loading

teleprint-me commented Oct 2, 2023

teleprint-me commented Oct 2, 2023

bioshazard commented Oct 3, 2023 •

edited

Loading

teleprint-me commented Oct 3, 2023

bioshazard commented Oct 3, 2023

teleprint-me commented Oct 10, 2023

earonesty commented Oct 26, 2023

fix: Update Llama2 Formatting Template #781

fix: Update Llama2 Formatting Template #781

Conversation

teleprint-me commented Oct 2, 2023

teleprint-me commented Oct 2, 2023

abetlen commented Oct 2, 2023

teleprint-me commented Oct 2, 2023 • edited Loading

teleprint-me commented Oct 2, 2023

teleprint-me commented Oct 2, 2023

bioshazard commented Oct 3, 2023 • edited Loading

teleprint-me commented Oct 3, 2023

bioshazard commented Oct 3, 2023

teleprint-me commented Oct 10, 2023

earonesty commented Oct 26, 2023

teleprint-me commented Oct 2, 2023 •

edited

Loading

bioshazard commented Oct 3, 2023 •

edited

Loading