Configuring Custom Models

Warning

We will now walk through the steps of finding, downloading and configuring a custom model. All these steps are required for it to (possibly) work.
Jinja2 templates are complicated this wiki is written for advanced users.
Models found on Huggingface or anywhere else are "unsupported" you should follow this guide before asking for help.

Whether you "Sideload" or "Download" a custom model you must configure it to work properly.

We will refer to a "Download" as being any model that you found using the "Add Models" feature.
A custom model is one that is not provided within the "GPT4All" default models list in the "Explore Models" window. Custom models usually require configuration by the user.
A "Sideload" is any model you get from somewhere else and then put in the models directory.

Finding and downloading a custom model

Finding a model (within GPT4All)

Open GPT4All and click on "Find models". In this example, we use "HuggingFace" in the Explore Models window. Searching HuggingFace here will return a list of custom GGUF models. As an example, down below, we type "GPT4All-Community", which will find models from the GPT4All-Community repository.

Screenshot 2025-01-30 093256

It is strongly recommended to use custom models from the GPT4All-Community repository, which can be found using the search feature in the explore models page or alternatively can be sideloaded, but be aware, that those also have to be configured manually.

Downloading a custom model

The GGUF model below (GPT4All-Community/DeepSeek-R1-Distill-Llama-8B-GGUF) is an example of a custom model which at the time of this tutorial required rewriting the jinja2 template for minja compatibility.

You will find that most custom models will require similar work for the jinja2 template.

A GPT4All-Community model may have a compatible minja template, click "More info can be found here." which brings you to the HuggingFace Model Card.
Screenshot 2025-01-30 103929

Configuring the model

Keep in mind:

Some repos may not have fully tested the model provided.
The model authors may not have bothered to change the model configuration files from finetuning to inferencing workflows.
Even if they show you a template it may be wrong.
Each model has its own tokens and its own syntax.
- The models are trained using these tokens, which is why you must use them for the model to work.
- The model uploader may not understand this either and can fail to provide a good model or a mismatching template.

Look at the Model Card

Here, you find information that you need to configure the model and understand it better. (A model may be outdated, it may have been a failed experiment, it may not yet be compatible with GPT4All, it may be dangerous, it may also be GREAT!)

You should learn the maximum context for the model.
You need to know if there is a problem. See the community tab and look.

Screenshot 2025-01-30 103638
Maybe this won't affect you. Though it's a good place to find out.

GPT4All uses minja which is not fully compatible with python based jinja2 that is included in models.

Screenshot 2025-01-30 102101

Using the wrong template will cause problems. You may be lucky and get some output but it could be better. Maybe you will get nothing at all.

Drafting the System Prompt and Chat Template

Important

The chat templates must be followed on a per model basis. Every model is different. You can imagine them to be like magic spells.
Your magic won't work if you say the wrong word. It won't work if you say it at the wrong place or time.

At this step, we need to combine the chat template that we found in the model card with a special syntax that is compatible with the GPT4All-Chat application (The format shown in the above screenshot is only an example). If you looked into the tokenizer_config.json, see Advanced Topics: Jinja2 Explained

Model specific tokens

Special tokens like <|user|> will say the user is about to talk. <|end|> will tell the llm we are done with that, now continue on.

GPT4All special operators

2025/1/31 what specific operators exist now and what do they do? (wiki writers note)

That example prompt should (in theory) be compatible with GPT4All, it will look like this for you...

System Prompt

You need a clean prompt without any jinja2:

You are a helpful AI assistant.

You could get complicated and write a little json that the llm will interpret to dictate behavior.

{
 "talking_guidelines": {
   "format": "Communication happens before or after thoughts.",
   "description": "All outward communication must be outside of a thought either before or after the think tags."
  },
 "thinking_guidelines": {
   "format": "<think>All my thoughts must happen inside think tags.</think>",
   "description": "All internal thoughts of the character MUST be enclosed within these tags.  This includes reactions, observations, internal monologues, and any other thought processes.  Do not output thoughts outside of these tags. The tags themselves should not be modified. The content within the tags should be relevant."
 }
}

Settings

Defaults

The default settings are a good safe place to start. The default and provides good output for most models. For instance, you can't blow up your RAM on only 2048 context and you can always increase it to whatever the model supports.

Context Length

This is the maximum context that you will use with the model. Context is somewhat the sum of the models tokens in the system prompt + chat template + user prompts + model responses + tokens that were added to the models context via retrieval augmented generation (RAG), which would be the LocalDocs feature. You need to keep context length within two safe margins.

1. your system can only use so much memory. Using more than you have will cause severe slowdowns or even crashes.
1. your model is only capable of what it was trained for. Using more than that will give trash answers and gibberish.

Since we are talking about computer terminology here, 1k = 1024 not 1000. So 128k, as is advertised by the phi3 model will translate to (1024 x 128 = 131072).

Max Length

I will use 4192 which is 4k of a response. I like allowing for a great response but want to stop the model at that point. (Maybe you want it longer? Try 8192)

GPU Layers

This is one that you need to think about if you have a small GPU or a big model.

This will be set to load all layers on the GPU. You may need to use less to get the model to work for you.

Chat Name Prompt and SuggestedFollowUp Prompt

These settings are model independent. They are only for the GPT4All environment. You can play with them all you like.

The other settings (ToDo)

The rest of these are special settings that need more training and experience to learn. They don't need to be changed most of the time.

You should now have a fully configured model I hope it works for you!

Advanced Topics

Read on for more advanced topics such as:

Jinja2
- Explain Jinja2 templates and how to decode them for use in Gpt4All.
- Explain how the tokens work in the templates.
Configuration Files Explained
- Explain why the model is now configured but still doesn't work.
- Explain the .Json files used to make the gguf.
- Explain how the tokens work.

Jinja2 Explained

I see you are looking at a Jinja2 template.
Breaking down a Jinja2 template is fairly straight forward if you can follow a few rules.

You must keep the tokens as written in the Jinja2 and strip out all of the other syntax etc. Also try to watch for mistakes here. Sometimes they fail to input a functional Jinja2 template. The Jinja2 must have the following tokens:

role beginning identifier tag
role ending identifier tag
roles

Sometimes they are combined into one like this <|user|> which indicates both a role and a beginning tag.

Let's start at the beginning of this Jinja2.

> {% set loop_messages = messages %}
> {% for message in loop_messages %}
>     {% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}

Most of this has to be removed because it's irrelevant to the LLM unless we get a Jinja2 parser from some nice contributor.
We keep this <|start_header_id|> as it states it is the starting header for the role.
We translate this + message['role'] + into the role to be used for the template.

You will have to figure out what the role names used by this model are, but these are the common ones.
Sometimes the roles will be shown in the Jinja2 sometimes it won't.

system (if model supports a system prompt)
- look for something like "if role system"
user or human (sometimes)
assistant or model (sometimes)

We keep this <|end_header_id|>
We keep this \n\n which translates into one new line (press enter) for each \n you see. (two in this case)
Now we will translate message['content'] into the variable used by GPT4All.

%1 for user messages
%2 for assistant replies
We keep this <|eot_id|> which indicates the end of whatever the role was doing.

Now we have our "content" from this Jinja2 block. {% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %} and we removed all the extra stuff.

From what I can tell GPT4all sends the BOS automatically and waits for the LLM to send the EOS in return.

BOS will tell the LLM where it begins generating a new message from. You can skip the BOS token.
"content" is also sent automatically by GPT4all. You can skip this content. (not to be confused with message['content'])
This whole section is not used by the GPT4All template.

    {% if loop.index0 == 0 %}
        {% set content = bos_token + content %}
    {% endif %}
    {{ content }}
{% endfor %}

Finally, we get to the part that shows a role defined for the "assistant". The way it is written implies the other one above is for either a system or user role. (Probably both because it would simply show "user" if it wasn't dual purpose.)
This is left open ended for the model to generate from this point on forward. As we can see from its absence the LLM is expected to provide an eos tag when it is done generating. Follow the same rules as we did above.

{% if add_generation_prompt %}
    {{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
{% endif %}

This also provides us with an implied confirmation of how it should all look when it's done.

We will break this into two parts for GPT4All.
A System Prompt: (There is no variable you will just write what you want in it.)

<|start_header_id|>system<|end_header_id|>

YOUR CUSTOM SYSTEM PROMPT TEXT HERE<|eot_id|>

A Chat Template:

<|start_header_id|>user<|end_header_id|>

%1<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

%2

Why didn't it work? It looks like it's all good!
Hint: You probably did it right and the model is not built properly. You can find out by following the next segment below.

Configuration files Explained

So, the model you got from some stranger on the internet didn't work like you expected it to?
They probably didn't test it. They probably don't know it won't work for everyone else.
Some problems are caused by the settings provided in the config files used to make the gguf.
Perhaps llama.cpp doesn't support that model and GPT4All can't use it.
Sometimes the model is just bad. (maybe an experiment)

You will be lucky if they include the source files, used for this exact gguf. (This person did not.)
The model used in the example above only links you to the source, of their source. This means you can't tell what they did to it when they made the gguf using that source. After the gguf was made someone may have changed anything on either side, Microsoft or QuantFactory.

In the following example I will use a model with a known source. This source will have an error, and they can fix it, or you can, like we did. (Expert: Make your own gguf by converting and quantizing the source.)

The following relevant files were used in the making of the gguf.

config.json (Look for "eos_token_id")
tokenizer_config.json (Look for "eos_token" and "chat_template")
generation_config.json (Look for "eos_token_id")
special_tokens_map.json (Look for "eos_token" and "bos_token")
tokenizer.json (Make sure those, match this.)

We will begin in this tokenizer_config.json it defines how the model's tokenizer should process input text.

"add_bos_token": false,
"add_eos_token": false,
"add_prefix_space": true,
"added_tokens_decoder": {
"0": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<\|startoftext\|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "<\|endoftext\|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"7": {
"content": "<\|im_end\|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"bos_token": "<\|startoftext\|>",
"chat_template": "{% if messages[0]['role'] == 'system' %}{% set system_message = messages[0]['content'] %}{% endif %}{% if system_message is defined %}{{ system_message }}{% endif %}{% for message in messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<\|im_start\|>user\\n' + content + '<\|im_end\|>\\n<\|im_start\|>assistant\\n' }}{% elif message['role'] == 'assistant' %}{{ content + '<\|im_end\|>' + '\\n' }}{% endif %}{% endfor %}",
"clean_up_tokenization_spaces": false,
"eos_token": "<\|im_end\|>",
"legacy": true,
"model_max_length": 16384,
"pad_token": "<unk>",
"padding_side": "right",
"sp_model_kwargs": {},
"spaces_between_special_tokens": false,
"split_special_tokens": false,
"tokenizer_class": "LlamaTokenizer",
"unk_token": "<unk>",
"use_default_system_prompt": false
}

Here we want to make sure that the "chat_template" exists. (It exists, good.)

"chat_template": "{% if messages[0]['role'] == 'system' %}{% set system_message = messages[0]['content'] %}{% endif %}{% if system_message is defined %}{{ system_message }}{% endif %}{% for message in messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<\|im_start\|>user\\n' + content + '<\|im_end\|>\\n<\|im_start\|>assistant\\n' }}{% elif message['role'] == 'assistant' %}{{ content + '<\|im_end\|>' + '\\n' }}{% endif %}{% endfor %}",

There is a BOS token and an EOS token. (They exist, excellent!)
"bos_token": "<\|startoftext\|>",
"eos_token": "<\|im_end\|>",
You can also see what the id numbers to expect are, very nice.

"7": {
"content": "<\|im_end\|>",

Hopefully all of those tokens match in this file and in the other files as well. (let's see)

Open up the next important file special_tokens_map.json. This file is special because when the model is built, the tokens in this file will treat these tokens differently from regular vocabulary tokens. For example:

They may be exempt from subword tokenization, they can never be broken!
- For example, the word "unhappiness" might be tokenized into "un", "happy", and "ness".
- However, special tokens like [EOS], [BOS], are typically treated as single, indivisible units.
They have specific positions in input sequences, like the bos and eos, the model also learned a special meaning for them.
- BOS (Beginning of Sequence) token:
  - Often represented as "[BOS]" or ""
  - Typically placed at the very start of an input sequence.
  - Signals to the model that a new sequence is beginning.
- EOS (End of Sequence) token:
  - Often represented as "[EOS]" or ""
  - Typically placed at the very end of an input sequence.
  - Signals to the model that the sequence has ended.
  - Crucial for tasks where the model needs to know when to stop generating output.

Lets take a look at this special_tokens_map.json

{
  "bos_token": {
    "content": "<|startoftext|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "eos_token": {
    "content": "<|im_end|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "pad_token": {
    "content": "<unk>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "unk_token": {
    "content": "<unk>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  }
}

As you can imagine, if we are missing any special tokens here that were in the tokenizer_config.json, you can end up with gibberish as the output. It just might break those tokens up and never know it was supposed to stop or start or whatever else may be important to the training of that model.

Next let's look at the tokenizer.json file. This file includes all the "vocabulary" of the model. We should know this all matches the other files; this includes all the tokens the model will use and the "mapping" of them. For instance, we know the tokenizer_config.json believes a few things.

"7": {
"content": "<\|im_end\|>",

It must match the tokenizer.json to work. In this case take a close look at the first seven of the 64000 tokens.

"added_tokens": [
    {
      "id": 0,
      "content": "<unk>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 1,
      "content": "<|startoftext|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 2,
      "content": "<|endoftext|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 6,
      "content": "<|im_start|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 7,
      "content": "<|im_end|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },..................................
      "陨": 63999
    },

The id number of the token is 7, and the token itself is <\|im_end\|>.
This must be true to work, everything in the files must match, you need to cross-check each file for errors.

Now we will see the generation_config.json.

{
  "_from_model_config": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "transformers_version": "4.40.0"
}

If something is set here it is enforced during generation. You may have missed it if you weren't paying attention. This doesn't match our other files!
The other files tell the model to use "eos_token": "<\|im_end\|>", this one is watching for "eos_token_id": 2, and we all know that this model is using "id": 2 which is "content": "<|endoftext|>", Which isn't going to work. The gguf model you downloaded will have an endless generation loop, unless this is corrected.

Finally lets look at the config.json file. When a model is loaded this is what it will know about itself.

{
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 16384,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 48,
  "num_key_value_heads": 4,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 5000000,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.40.0",
  "use_cache": false,
  "vocab_size": 64000
}

Well here are all the things the model believes to be true. We can see it is also wrong. The model believes "eos_token_id": 2, will stop the generation, but it was trained to use "eos_token_id": 7, which the chat template is telling us to use. It is also found in the special_tokens_map.json so it will be protected for this purpose.

Now you know why your model won't work, hopefully you didn't download it yet!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly