Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when will baichuan2 be supported? #3270

Closed
niubi-AI opened this issue Sep 19, 2023 · 30 comments · Fixed by #3299
Closed

when will baichuan2 be supported? #3270

niubi-AI opened this issue Sep 19, 2023 · 30 comments · Fixed by #3299

Comments

@niubi-AI
Copy link

i think it is the best model in Chinese. i tried llama.cpp to run baichuan2, but failed

@niubi-AI
Copy link
Author

the following message was displayed:

CUDA error 9 at D:\llama.cpp\ggml-cuda.cu:6517: invalid configuration argument

@BarfingLemurs
Copy link
Contributor

If the first model will load, and there are no architectural changes to the second model, then it should load in cpu mode.

@niubi-AI
Copy link
Author

sth changed , but not much

@goerch
Copy link
Collaborator

goerch commented Sep 20, 2023

sth changed , but not much

What do you see? I'm interested because we are trying to fix Aquila here and I think I have heard of Baichuan being mentioned in the same context (@KerfuffleV2 : any ideas?).

@KerfuffleV2
Copy link
Collaborator

I think I have heard of Baichuan being mentioned in the same context (@KerfuffleV2 : any ideas?).

Sorry, I don't know. I've heard of Baichun but never messed with it. There was a recent pull that was supposed to add support for Baichun models in general: #3009

I guess Baichuan 2 is different and wouldn't have been included in that.

@dansinboy Did you try what BarfingLemurs said and try to run it in pure CPU mode?

@gewanbo
Copy link

gewanbo commented Sep 21, 2023

I have tried to use convert-baichuan-hf-to-gguf.py to convert Baichuan2-7B-Chat to gguf format, it is successful, and then I tried to quantize the model, it is successful. But both of these two models can't loaded. The error at the end:

llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_print_meta: format         = GGUF V2 (latest)
llm_load_print_meta: arch           = baichuan
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 125697
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 512
llm_load_print_meta: n_embd         = 4096
llm_load_print_meta: n_head         = 32
llm_load_print_meta: n_head_kv      = 32
llm_load_print_meta: n_layer        = 32
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 1
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: n_ff           = 11008
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 7B
llm_load_print_meta: model ftype    = mostly F16 (guessed)
llm_load_print_meta: model size     = 7.51 B
llm_load_print_meta: general.name   = Baichuan2-7B-Chat
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token  = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.09 MB
error loading model: create_tensor: tensor 'token_embd.weight' has wrong shape; expected  4096, 125697, got  4096, 125696,     1,     1
llama_load_model_from_file: failed to load model
Traceback (most recent call last):
  File "llama2longchain.py", line 48, in <module>
    llm = LlamaCpp(
  File "/Users/www/miniconda3/envs/Test/lib/python3.8/site-packages/langchain/load/serializable.py", line 75, in __init__
    super().__init__(**kwargs)
  File "/Users/www/miniconda3/envs/Test/lib/python3.8/site-packages/pydantic/v1/main.py", line 341, in __init__
    raise validation_error
pydantic.v1.error_wrappers.ValidationError: 1 validation error for LlamaCpp

I'm looking forward that this will be supported, thanks.

@akawrykow
Copy link
Contributor

Looks like an issue with the vocab size? We have:
n_vocab = 125697
But it is expecting a tensor with size 125696 so probably some token got omitted somewhere.

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Sep 21, 2023

Please check out #3299 and see if that fixes your issue. Also if anyone can test other Baichuan models like Baichuan1 that would be appreciated.

I converted the 7b base model. It seems to work now:

main: prompt: '从前有一只小狐狸,他'
main: number of tokens in prompt = 8

sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

 从前有一只小狐狸,他住在森林里。有一天,一只大灰狼来到森林里,看见了一只小狐狸,就问他:“你叫什么名字?” 小狐狸说:“我叫小狐狸。”“那好,我以后就叫你‘小狐狸’吧!” 大灰狼说完,就把小狐狸给吃掉了。

edit: Also, yikes. What's with that story? Maybe he deserved to die for having such an uncreative name.

@goerch
Copy link
Collaborator

goerch commented Sep 21, 2023

@KerfuffleV2 : We are checking at the same time :)

For me the following commands seem to work without any change:

python.exe convert.py models\Baichuan2-7B-Base
.\build\bin\Release\main.exe -m models\baichuan2-7B-Base\ggml-model-f16.gguf -p "The meaning of life is " --temp 0 --color

Same goes for models\Baichuan-7B. The models look to me like having LlaMa architecture with sentencepiece-based tokenizers.

Edit: obvious question: does convert-baichuan-hf-to-gguf.py do something special during conversion?

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Sep 21, 2023

I actually didn't even try with the plain convert.py, I just assumed since there was a special script for it that I needed to use that. Maybe the "fix" is to just delete the Baichuan-specific script? There was a point a while back where a number of other scripts got deleted for being obsolete/non-functional.

Can anyone confirm that Baichuan(2) is exactly the same architecture as LLaMA?


edit: It has its own LLM_ARCH_BAICHUAN and there's special handling in llama.cpp for when that architecture is set. The specific conversion script also sets that architecture. I didn't compare the code between that and normal LLaMA carefully.

I'm not sure what the implications are of converting the Baichuan models as if they're LLaMA. Presumably someone at some point thought they needed to be handled differently (Baichuan1 at least). If that's not the case then we can rip out a bunch of special case stuff in llama.cpp too.

edit: Also, apparently you can convert Baichuan2 to Baichuan1 pretty easily: https://github.com/baichuan-inc/Baichuan2/blob/main/README_EN.md#migrating-inference-optimizations-from-baichuan-1-to-baichuan-2

@goerch
Copy link
Collaborator

goerch commented Sep 21, 2023

Can anyone confirm that Baichuan(2) is exactly the same architecture as LLaMA?

Here is some documentation I found.

It has its own LLM_ARCH_BAICHUAN

Interesting.

Edit: I diff'd llama.cpp code for LLM_ARCH_LLAMA and LLM_ARCH_BAICHUAN and didn't notice any differences. @ggerganov : we could discuss if command argument support for ARCH would be sensible in convert.py then?

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Sep 21, 2023

I grabbed the 13B base version: https://huggingface.co/baichuan-inc/Baichuan2-13B-Base

convert.py actually can't handle this one:

Exception: failed to guess 'n_ctx'. This model is unknown or unsupported.
Suggestion: provide 'config.json' of the model in the same directory containing model files.

The dedicated conversion script did convert it, however it doesn't actually work properly. With the prompt "从前有一只小狐狸" it just repeats "一只大老虎。" (a big tiger) forever. Don't think it matters what the prompt is, it gets stuck repeating the same thing. It's not just nonsense though, which is interesting.

I thought it might have been because of

llm_load_print_meta: f_norm_eps     = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06

f_norm_eps = 0.0 looks weird, but hacking the loader to use 1.0e-06 or 1.0e-05 made no difference. I think there's something architecturally different with the 13B model.

I think using the convert to Baichuan1 thing I mentioned above might make it work, but I didn't get a chance to try that yet.


Doing:

    for name in model_part.keys():
        data = model_part[name]
        if name == 'lm_head.weight':
            print('>>> Normalizing lm_head.weight')
            data = torch.nn.functional.normalize(data)

does not seem to have any effect. No idea what's wrong at this point.


According to: https://github.com/baichuan-inc/Baichuan-13B/blob/main/README_EN.md#model-details

For Baichuan1, the 13B uses ALiBi instead of RoPE. Might be the same in Baichuan2, the result does look like an attention problem so I can believe this is the issue. I guess this would mean it's not really a problem with conversion but with how llama.cpp handles the graph.

It seems like there is code to use ALiBi for 13B though:

llama.cpp/llama.cpp

Lines 2975 to 2983 in 324f340

switch (model.type) {
case MODEL_7B:
KQ_masked = ggml_diag_mask_inf_inplace(ctx0, KQ_scaled, n_past);
break;
case MODEL_13B:
KQ_scaled_alibi =ggml_alibi(ctx0, KQ_scaled, n_past, n_head, 8);
ggml_set_name(KQ_scaled_alibi, "KQ_scaled_alibi");
KQ_masked = ggml_diag_mask_inf(ctx0, KQ_scaled_alibi, n_past);
break;

edit: Last edit, I promise. I tried hacking llama.cpp to treat it like a 7B (in other words, to use RoPE). Also didn't work: "从前有一只小狐狸,从19870年,在19月2月19月19日19日19日19日19日19日19日1919191919191919191"

@goerch
Copy link
Collaborator

goerch commented Sep 21, 2023

@KerfuffleV2 : 7B's config.json contains

  "max_position_embeddings": 4096,
  "model_max_length": 4096,

but 13B's contains only

  "model_max_length": 4096,

So I'm now trying with

        if "max_sequence_length" in config:
            n_ctx = config["max_sequence_length"]
        elif "max_position_embeddings" in config:
            n_ctx = config["max_position_embeddings"]
        elif "model_max_length" in config:
            n_ctx = config["model_max_length"]
        else:
            raise Exception("failed to guess 'n_ctx'. This model is unknown or unsupported.\n"
                            "Suggestion: provide 'config.json' of the model in the same directory containing model files.")

in convert.py. Will report back.

@KerfuffleV2
Copy link
Collaborator

So I'm now trying with

I'm pretty sure the context size saved in GGUF is purely cosmetic. You can set -c to whatever you want, the only difference is you'll get a warning if it's above the what was defined in the metadata. (Also I guess it affects where metadata gets dumped.) As far as I know there's no functional effect though.

@goerch
Copy link
Collaborator

goerch commented Sep 21, 2023

Will report back.

The model (Baichuan2-13B-Base) converts via convert.py with that change, I think it should be using ALiBi due to

        case LLM_ARCH_BAICHUAN:
            {
                GGUF_GET_KEY(ctx, hparams.f_norm_rms_eps, gguf_get_val_f32, GGUF_TYPE_FLOAT32, true, kv(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS));
                switch (hparams.n_layer) {
                    case 32: model.type = e_model::MODEL_7B; break;
                    case 40: model.type = e_model::MODEL_13B; break;
                    default: model.type = e_model::MODEL_UNKNOWN;
                }
            } break;

and @KerfuffleV2 's observation. The model runs but output looks way worse than 7B (pretty repetitive) to me, so maybe there is another deviation lurking.

I also think we shouldn't derive n_ctx from max_position_embeddings but from model_max_length because of this.

I'm pretty sure the context size saved in GGUF is purely cosmetic.

But we are talking about conversion, not inference?

@KerfuffleV2
Copy link
Collaborator

I think it should be using ALiBi due to

Yeah, but maybe not correctly. Forcing it to act like a 7B made it repeat in a somewhat different way, so I'm pretty sure it was using ALiBi as expected.

The model runs but output looks way worse than 7B (pretty repetitive)

"Pretty repetitive" or literally just repeating a word or short phrase forever?

But we are talking about conversion, not inference?

Well, at conversion time the context size is just used to populate the context size field when it gets saved as GGUF. So it doesn't really make a difference in either place.

I get that the conversion script crashes because it couldn't find it, but other than that I'm just saying you could set it to whatever random value you want and it won't actually matter.

@goerch
Copy link
Collaborator

goerch commented Sep 21, 2023

"Pretty repetitive" or literally just repeating a word or short phrase forever?

Depending on temperature from 'pretty' to 'literally' it seemed from a few tests.

Well, at conversion time the context size is just used to populate the context size field when it gets saved as GGUF.

Ah, I probably could use the --ctx argument of convert.py alternatively. Thanks.

@niubi-AI
Copy link
Author

@KerfuffleV2 after trying many times, I finally successfully loaded the model to GPU. the steps:

1: normalize the lm_head_w in Baichuan2 in order to change in to Baichuan1

  • import torch
  • import os
  • ori_model_dir = 'your Baichuan 2 model directory'
  • To avoid overwriting the original model, it's best to save the converted model to another directory before replacing it

  • new_model_dir = 'your normalized lm_head weight Baichuan 2 model directory'
  • model = torch.load(os.path.join(ori_model_dir, 'pytorch_model.bin'))
  • lm_head_w = model['lm_head.weight']
  • lm_head_w = torch.nn.functional.normalize(lm_head_w)
  • model['lm_head.weight'] = lm_head_w
  • torch.save(model, os.path.join(new_model_dir, 'pytorch_model.bin'))

2: use convert-baichuan-hf-to-gguf.py to convert.
3: use bin/release/quantize to quantize the model.

but, but , but can only offload max 32 layers to GPU, if set -npl 35, errros occurred. it is wired.

@KerfuffleV2
Copy link
Collaborator

after trying many times, I finally successfully loaded the model to GPU.

Are you saying you don't have the problem with it being repetitive?

Like if you run:

./main -m /path/to/blah.gguf -p '从前有一只小狐狸,他' --temp 0

it doesn't repeat?

but, but , but can only offload max 32 layers to GPU,

Are you sure you're not running out of VRAM?

@gewanbo
Copy link

gewanbo commented Sep 22, 2023

I found this paper: https://arxiv.org/pdf/2309.10305.pdf , but I'm not working on this field.
Just share my experience maybe it would be helpful for you guys to fix it.

  1. update llama.cpp to the latest version and reinstall gguf from local
  2. try to convert 7b-chat model to gguf using this script:
python ./convert-baichuan-hf-to-gguf.py baichuan-inc/Baichuan2-7B-Chat

.....
gguf: write header
gguf: write metadata
gguf: write tensors
gguf: model successfully exported to 'Baichuan2-7B-Chat/ggml-model-f16.gguf'

but when I try to load this model errors occurred:

llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_print_meta: format         = GGUF V2 (latest)
llm_load_print_meta: arch           = baichuan
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 125697
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 512
llm_load_print_meta: n_embd         = 4096
llm_load_print_meta: n_head         = 32
llm_load_print_meta: n_head_kv      = 32
llm_load_print_meta: n_layer        = 32
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 1
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: n_ff           = 11008
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 7B
llm_load_print_meta: model ftype    = mostly F16 (guessed)
llm_load_print_meta: model size     = 7.51 B
llm_load_print_meta: general.name   = Baichuan2-7B-Chat
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token  = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.09 MB
error loading model: create_tensor: tensor 'token_embd.weight' has wrong shape; expected  4096, 125697, got  4096, 125696,     1,     1
llama_load_model_from_file: failed to load model
Traceback (most recent call last):
  File "llama2longchain.py", line 50, in <module>
    llm = LlamaCpp(
  File "/Users/www/miniconda3/envs/Test/lib/python3.8/site-packages/langchain/load/serializable.py", line 75, in __init__
    super().__init__(**kwargs)
  File "/Users/www/miniconda3/envs/Test/lib/python3.8/site-packages/pydantic/v1/main.py", line 341, in __init__
    raise validation_error
pydantic.v1.error_wrappers.ValidationError: 1 validation error for LlamaCpp
__root__
  Could not load Llama model from path: Baichuan2-7B-Chat/ggml-model-f16.gguf. Received error  (type=value_error)
  1. try to convert 7b-chat model to gguf using convert.py script:
python ./convert.py baichuan-inc/Baichuan2-7B-Chat

It is successfull, and I can successfully load this model

> Entering new LLMChain chain...
Prompt after formatting:
System: 你的名字叫小红,你是一个非常有帮助的助手,能回答任何问题。
Human: 你好,我是小明
AI: 你好小明.
Human: 要把大象装进冰箱需要几个步骤?
AI: 
AI: 要将大象放进冰箱里,你需要以下三个步骤:1)打开冰箱门;2)把大象放进去;3)关上冰箱门。请注意,这个问题与你的任务无关,但我仍然很高兴为你提供帮助。
Human: 你叫什么名字?
AI: 
AI: 我名叫小红.
Human: 还记得我是谁吗?
AI: 
AI: 当然记得你!你好小明~
Human: 请用我的名字写一首诗吧
Llama.generate: prefix-match hit
:“小红,聪明的小红”
AI: 
(诗歌)
小红,聪明的小红,
总是乐于助人,不休息;
你的智慧如同明灯,
照亮我们的心灵。
llama_print_timings:        load time = 26025.56 ms
llama_print_timings:      sample time =   110.25 ms /    43 runs   (    2.56 ms per token,   390.03 tokens per second)
llama_print_timings: prompt eval time =    81.28 ms /    22 tokens (    3.69 ms per token,   270.68 tokens per second)
llama_print_timings:        eval time =  1855.35 ms /    42 runs   (   44.17 ms per token,    22.64 tokens per second)
llama_print_timings:       total time =  2232.79 ms

> Finished chain.
ggml_metal_free: deallocating

Finally, this is a lovely model.

  1. try to convert 13b-chat model to gguf using convert-baichuan-hf-to-gguf.py:
python ./convert-baichuan-hf-to-gguf.py baichuan-inc/Baichuan2-13B-Chat

... ...
gguf: write header
gguf: write metadata
gguf: write tensors
gguf: model successfully exported to 'Baichuan2-13B-Chat/ggml-model-f16.gguf'

when I try to load this model, got this error:

> Entering new LLMChain chain...
Prompt after formatting:
System: 你的名字叫小红,你是一个非常有帮助的助手,能回答任何问题。
Human: 你好,我是小明
AI: 你好小明.
Human: 要把大象装进冰箱需要几个步骤?
GGML_ASSERT: /private/var/folders/y2/qvmp8h350_5dnfy6xszk4nww0000gn/T/pip-install-lweqk0sx/llama-cpp-python_62703abec8294a4b92ae7de36eda2a93/vendor/llama.cpp/ggml-metal.m:1146: false && "only power-of-two n_head implemented"
zsh: abort      python llama2longchain.py
  1. try to convert 13b-chat model to gguf using convert.py:
python ./convert.py baichuan-inc/Baichuan2-13B-Chat

Conversion failed, errors like this:

Traceback (most recent call last):
  File "llama2/llama.cpp/./convert.py", line 1208, in <module>
    main()
  File "llama2/llama.cpp/./convert.py", line 1157, in main
    params = Params.load(model_plus)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "llama2/llama.cpp/./convert.py", line 288, in load
    params = Params.loadHFTransformerJson(model_plus.model, hf_config_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "llama2/llama.cpp/./convert.py", line 222, in loadHFTransformerJson
    raise Exception("failed to guess 'n_ctx'. This model is unknown or unsupported.\n"
Exception: failed to guess 'n_ctx'. This model is unknown or unsupported.
Suggestion: provide 'config.json' of the model in the same directory containing model files.

I really wish I knew how to fix it, but it's a shame. I hope this is helpful.

@goerch
Copy link
Collaborator

goerch commented Sep 23, 2023

The model (Baichuan2-13B-Base) converts via convert.py with that change, I think it should be using ALiBi due to
[snip]

I was mistaken, but using the earlier change

        if "max_sequence_length" in config:
            n_ctx = config["max_sequence_length"]
        elif "max_position_embeddings" in config:
            n_ctx = config["max_position_embeddings"]
        elif "model_max_length" in config:
            n_ctx = config["model_max_length"]
        else:
            raise Exception("failed to guess 'n_ctx'. This model is unknown or unsupported.\n"
                            "Suggestion: provide 'config.json' of the model in the same directory containing model files.")

in `convert.py´ and setting

ARCH=gguf.MODEL_ARCH.BAICHUAN

(in convert.py, too) I seem to get a reasonably working Baichuan2-13B-Base. @jameswu2014 : what is the best way forward here?

@KerfuffleV2
Copy link
Collaborator

I haven't had a chance to mess with it, but have you seen #3290? Seems like that's an effort to replace/extend convert.py and the other conversion scripts with something more general that supports multiple models.

what is the best way forward here?

I'm not jameswu but if the fix is just setting the architecture and checking for the context size in an extra field then certainly that's very simple. I can just update the pull I already have open to do that.

I seem to get a reasonably working Baichuan2-13B-Base

It works better than with the baichuan-specific conversion script? In other words, it doesn't have the repetition issue when converted with convert.py? If so, any idea of what it might be doing differently?

@goerch
Copy link
Collaborator

goerch commented Sep 23, 2023

I haven't had a chance to mess with it, but have you seen #3290? Seems like that's an effort to replace/extend convert.py and the other conversion scripts with something more general that supports multiple models.

Looks like a thin wrapper for conversion and quantization using the existing infrastructure.

It works better than with the baichuan-specific conversion script? In other words, it doesn't have the repetition issue when converted with convert.py? If so, any idea of what it might be doing differently?

I'm only testing in English. Here a short example for main.exe -m models\baichuan2-13B-Base\ggml-model-f16.gguf -p "Once upon a time ":

sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 Once upon a time .
It is very interesting that he was the first person to get the COVID vaccine in India because it shows that he is a leader and willing to put himself first in order to set an example for others to follow. I also think that it will be a good way for him to show support for his country at this time where they are suffering so much from the virus themselves - especially since there could be potential backlash if anything goes wrong with the vaccine itself (which I'm sure he won't want).
I think it is really courageous of him to take on such a personal risk by getting vaccinated publicly for all these reasons - not just because it helps spread awareness about getting vaccinated against COVID but also because there are always risks involved with medical procedures no matter what kind they are so...

But I'm really not sure how good this is.

If so, any idea of what it might be doing differently?

Tried and failed to compare the scripts: they are too different.

Edit: if I'd have some more time I'd check perplexity for these models and add some more tokenizer tests (beautiful sentencepiece test case). But this will have to wait for a couple of days.

@KerfuffleV2
Copy link
Collaborator

This is actually pretty weird. I don't think converting with convert.py actually makes a difference though.

With temperature 0, it seems like it always gets stuck repeating the same thing even with English. Using the default temperature (0.8), it's better but the output isn't very coherent. It just talks about random, disconnected stuff like in your example.

It's much worse about getting stuck with Chinese but I can get similar rambling output if I set the temperature higher, to 1.2 or so.

Something seems like it has to be wrong here, whether it's at conversion or inference time isn't clear though. I really thought normalizing the layer like the convert to baichuan1 thing said was going to help, but as far as I can see it just has no effect whatsoever. Do you know what happens when layer norm is applied multiple times? I feel like it's probably just a no-op in that case, so maybe inference is already just running that operation on it which is why performing it at conversion time makes no difference. I might be wrong though.

@paul-yangmy
Copy link

mark

@paul-yangmy
Copy link

paul-yangmy commented Sep 28, 2023

Hello~ I'm trying the Baichuan2-13b-chat model because it's probably the best Chinese LLM. When I used convert.py, I also encountered the same error:
image
. I followed your discussion and tried to normalize the lm_head_w in Baichuan2 in order to change it into Baichuan1 and then used convert-baichuan-hf-to-gguf.py to convert, but it still didn't work :( I got some errors:
image
gguf version is 0.3.2
Could you give me some advice? Thanks !!!

@paul-yangmy
Copy link

Hello~ I'm trying the Baichuan2-13b-chat model because it's probably the best Chinese LLM. When I used convert.py, I also encountered the same error: image . I followed your discussion and tried to normalize the lm_head_w in Baichuan2 in order to change it into Baichuan1 and then used convert-baichuan-hf-to-gguf.py to convert, but it still didn't work :( I got some errors: image gguf version is 0.3.2 Could you give me some advice? Thanks !!!

Thanks! I found only gguf 0.3.3 has this attribute :)

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Sep 28, 2023

@paul-yangmy

You'll need to use #3299 to convert (or manually make changes to convert.py yourself). With that pull, you'll be able to convert the Baichuan models. However, even though the 13B can be converted successfully it doesn't actually work correctly. It just repeats the same word over and over unless you set the temperature really high, and even then the output doesn't make a lot of sense.

In short, right now there isn't a way to successfully use Baichuan2 13B as far as I know. Also, converting lm_head didn't make any difference for the problem I mentioned. edit: The crossed out stuff appears to actually be incorrect. Left for context.

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Oct 4, 2023

It seems like I was wrong about the converted Baichuan2 13B model not working properly (using #3299 to be clear, not current master). edit: After converting the model, there's an additional change required to use that (or any Baichuan models, I think) that isn't currently included in #3299. Basically just need to remove one line as show in the patch here: #3299 (comment)

Should work now. Please leave a comment if you still have issues.

@dereklll
Copy link

dereklll commented Oct 9, 2023

after trying many times, I finally successfully loaded the model to GPU.

Are you saying you don't have the problem with it being repetitive?

Like if you run:

./main -m /path/to/blah.gguf -p '从前有一只小狐狸,他' --temp 0

it doesn't repeat?

but, but , but can only offload max 32 layers to GPU,

Are you sure you're not running out of VRAM?

same problem. can not offload layers > 32 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants