Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSError: failed to fill whole buffer #19

Open
Siegfried-qgf opened this issue Sep 25, 2024 · 7 comments
Open

OSError: failed to fill whole buffer #19

Siegfried-qgf opened this issue Sep 25, 2024 · 7 comments

Comments

@Siegfried-qgf
Copy link

when I initialize draftretriever.Reader, I meet this error.

python3 gen_model_answer_rest.py
loading the datastore ...
Traceback (most recent call last):
File "/mnt/gefei/REST/llm_judge/gen_model_answer_rest.py", line 493, in
run_eval(
File "/mnt/gefei/REST/llm_judge/gen_model_answer_rest.py", line 135, in run_eval
datastore = draftretriever.Reader(
File "/root/anaconda3/envs/rest/lib/python3.9/site-packages/draftretriever/init.py", line 43, in init
self.reader = draftretriever.Reader(
OSError: failed to fill whole buffer

@zhenyuhe00
Copy link
Collaborator

Hi,
I've not encountered this error before. I wonder if you've fully built the datastore without any interruptions.

@Siegfried-qgf
Copy link
Author

I check the datastore and find a segmentation fault
Namespace(model_path='/mnt/tianlian/deployment/llm_task_flows/model_original/hugging_face_finetune/Qwen2.5-14B-Instruct', large_datastore=False)
number of samples: 68623
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 68623/68623 [04:13<00:00, 271.09it/s]
[1] 32657 segmentation fault (core dumped) python3 get_datastore_chat.py

@Siegfried-qgf
Copy link
Author

when I limit the num of dataset=100 ,it's ok.
but when the num of dataset=2500, it's error.
python3 get_datastore_chat.py
Namespace(model_path='/mnt/tianlian/deployment/llm_task_flows/model_original/hugging_face_finetune/Qwen2.5-14B-Instruct', large_datastore=False)
number of samples: 100
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 342.49it/s]

╭─   /mnt/gefei/REST/datastore   main !2 ?3 ··················································································································································································································································································  5s  rest root@4514c07d970b  12:47:36
╰─❯ python3 get_datastore_chat.py
Namespace(model_path='/mnt/tianlian/deployment/llm_task_flows/model_original/hugging_face_finetune/Qwen2.5-14B-Instruct', large_datastore=False)
number of samples: 2500
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2500/2500 [00:08<00:00, 307.74it/s]
[1] 56767 segmentation fault (core dumped) python3 get_datastore_chat.py

Could this be related to how my image was created?

@zhenyuhe00
Copy link
Collaborator

zhenyuhe00 commented Sep 25, 2024

Hi,
I suppose it's because the vocab size of Qwen2.5 is 151936, which exceeds the range of u16 as I manually set in the DraftRetriever.
To fix the issue, you may change this line in writer from self.index_file.write_u16::<LittleEndian>(item as u16)?; to self.index_file.write_u32::<LittleEndian>(item as u32)?;
Besides, change this line these two lines in Reader from for i in (0..data_u8.len()).step_by(2) {let int = LittleEndian::read_u16(&data_u8[i..i+2]) as i32; to for i in (0..data_u8.len()).step_by(4) {let int = LittleEndian::read_u32(&data_u8[i..i+4]) as i32;

Hope these changes may fix the bug. If you have any further questions, please feel free to contact me.

@Siegfried-qgf
Copy link
Author

Thanks!I have fixed the bug !

@whisperzqh
Copy link

Hi, I suppose it's because the vocab size of Qwen2.5 is 151936, which exceeds the range of u16 as I manually set in the DraftRetriever. To fix the issue, you may change this line in writer from self.index_file.write_u16::<LittleEndian>(item as u16)?; to self.index_file.write_u32::<LittleEndian>(item as u32)?; Besides, change this line this line in Reader from let int = LittleEndian::read_u16(&data_u8[i..i+2]) as i32; to let int = LittleEndian::read_u32(&data_u8[i..i+4]) as i32;

Hope these changes may fix the bug. If you have any further questions, please feel free to contact me.

Hi, when I try to use deepseek-coder-6.7b-base to construct the datastore for code-related tasks, I also encounter the segmentation fault. However, the vocab size of deepseek-coder-6.7b-base is 32000, which is smaller than 65535. How can I resolve this problem? Thank you!

@zhenyuhe00
Copy link
Collaborator

Hi, I suppose it's because the vocab size of Qwen2.5 is 151936, which exceeds the range of u16 as I manually set in the DraftRetriever. To fix the issue, you may change this line in writer from self.index_file.write_u16::<LittleEndian>(item as u16)?; to self.index_file.write_u32::<LittleEndian>(item as u32)?; Besides, change this line this line in Reader from let int = LittleEndian::read_u16(&data_u8[i..i+2]) as i32; to let int = LittleEndian::read_u32(&data_u8[i..i+4]) as i32;
Hope these changes may fix the bug. If you have any further questions, please feel free to contact me.

Hi, when I try to use deepseek-coder-6.7b-base to construct the datastore for code-related tasks, I also encounter the segmentation fault. However, the vocab size of deepseek-coder-6.7b-base is 32000, which is smaller than 65535. How can I resolve this problem? Thank you!

Hi,
I assume the issue lies in the following code:
writer = draftretriever.Writer( index_file_path=datastore_path, max_chunk_len=512 * 1024 * 1024, vocab_size=tokenizer.vocab_size, )
Here, tokenizer.vocab_size is set to 32,000 for deepseek-coder-6.7b-base. However, the actual vocab_size for deepseek-coder-6.7b-base is 32,000 plus the number of added_tokens, which totals 32,021.

I've change the code vocab_size=tokenizer.vocab_size to vocab_size=tokenizer.vocab_size+len(tokenizer.get_added_vocab()). Sorry for the bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants