Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: device-side assert triggered when using Llama 2 from HF #38

Closed
andreasbinder opened this issue Sep 16, 2023 · 3 comments

Comments

@andreasbinder
Copy link

andreasbinder commented Sep 16, 2023

Good Day!
I tried to run the GSM8k example with the model from HF as you described: (only adjust the log and prompt paths)

  CUDA_VISIBLE_DEVICES=0,1 python examples/rap_gsm8k/inference.py --base_lm hf --hf_path meta-llama/Llama-2-70b-hf --hf_peft_path None --hf_quantized 'nf4'

However, I receive the following error

    RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I think this is related to the warning also mentioned in the log trace:

    llm-reasoners/reasoners/lm/hf_model.py:137: UserWarning: the eos_token '\n' is encoded into [29871, 13] with length != 1, using 13 as the eos_token_id
  warnings.warn(f'the eos_token {repr(token)} is encoded into {tokenized} with length != 1, '

When searching on GitHub, I think it is related to input mismatching due to some false tokenisation 1 2 3.
Did you also encounter this problem or how did you go about it?
I will try the other versions of Llama in the meantime.

I am using transformers 4.33.1

Thx!

@Ber666
Copy link
Collaborator

Ber666 commented Sep 18, 2023

Hi, for the CUDA error, could you try following the message? For debugging consider passing CUDA_LAUNCH_BLOCKING=1

The warning you showed shouldn't matter. It's expected in this example. We want the generation to stop at \n, and 13 is the token index of \n. For some reason, it's encoded into 2 tokens ([29871, 13]), so we just use 13 as the eos_token.

@Ber666 Ber666 closed this as completed Sep 18, 2023
@Ber666 Ber666 reopened this Sep 18, 2023
@Rem1L
Copy link
Collaborator

Rem1L commented Oct 6, 2023

please send us with more detailed information of error since the RuntimeError and warning cannot provide enough information. we are delighted to help you with our work:p

@andreasbinder
Copy link
Author

Hi! I am sorry for the late reply :(
I worked with TheBloke/Llama-2-13B-GPTQ for most experiments so far. I now tried Llama 2 again, and I did not run into a problem this time ^^
In case I find the encounter the error and the corresponding solution too, I will let you know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants