You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I try to test this low-VRAM method locally, I only get some garbage output. I use the standard "unicorns copypasta" for input. But the output looks like this: The U. Toe I think it Toe there areح To re the whole. </p). Sometimes it even worse than this.
Here's how I tested:
Create a new virtualenv for generating .pkl file. Install this:
Copy/move the gptneo.pkl file to this new virtualenv.
Run the following script:
neo_test.py
importosfromtransformersimportGPTNeoForCausalLM, AutoTokenizerimporttorchimportcopyimportgcimportpickleimporttorch.cuda.commimporttime# Pickle file for low ram loadingifTrue:
print("Setting up model, this will take a few minutes")
withopen('gptneo.pkl', 'rb') asf:
model=pickle.load(f)
tokenizer=AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token=tokenizer.eos_tokentokenizer.padding_side='left'breakmodel=Trueram_blocks=22fromtransformersimportGPTNeoForCausalLM,GPTNeoModelfromtransformers.modeling_outputsimportBaseModelOutputWithPastfromtransformers.models.gpt_neo.modeling_gpt_neoimportGPTNeoAttentionMixin#Define a new forward passdefnew_forward(
self,
input_ids=None,
past_key_values=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
use_cache=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
):
globalbreakmodelifbreakmodel:
globalram_blocksifnothasattr(self, 'extrastorage'):
importcopysetattr(self,"extrastorage",{})
self.wte.to("cuda")
self.wpe.to("cuda")
self.ln_f.to("cuda")
torch.cuda.empty_cache()
foriinrange(ram_blocks):
self.h[i].to("cpu")
self.extrastorage[i] =copy.deepcopy(self.h[i])
smalltensor=torch.tensor(0).to("cuda")
forparam1inself.h[i].parameters():
param1.data=smalltensorself.h[i].to("cuda")
foriinrange(ram_blocks,len(self.h)):
self.h[i].to("cuda")
forparaminself.wte.parameters():
param.requires_grad=Falseparam.data=param.data.detach()
gc.collect()
torch.cuda.empty_cache()
forparaminself.wpe.parameters():
param.requires_grad=Falseparam.data=param.data.detach()
gc.collect()
torch.cuda.empty_cache()
foriinrange(len(self.h)):
forparaminself.h[i].parameters():
param.requires_grad=Falseparam.data=param.data.detach()
gc.collect()
torch.cuda.empty_cache()
forparaminself.ln_f.parameters():
param.requires_grad=Falseforiinrange(ram_blocks):
forparaminself.extrastorage[i].parameters():
param.requires_grad=Falseparam.data=param.data.detach().pin_memory()
gc.collect()
torch.cuda.empty_cache()
forparam1,param2inzip(self.h[0].parameters(),self.extrastorage[0].parameters()):
param1.data=param2.data.to("cuda", non_blocking=False).detach()
forparam1,param2inzip(self.h[ram_blocks-1].parameters(),self.extrastorage[ram_blocks-1].parameters()):
param1.data=param2.data.to("cuda", non_blocking=False).detach()
output_attentions=output_attentionsifoutput_attentionsisnotNoneelseself.config.output_attentionsoutput_hidden_states= (
output_hidden_statesifoutput_hidden_statesisnotNoneelseself.config.output_hidden_states
)
use_cache=use_cacheifuse_cacheisnotNoneelseself.config.use_cachereturn_dict=return_dictifreturn_dictisnotNoneelseself.config.use_return_dictifinput_idsisnotNoneandinputs_embedsisnotNone:
raiseValueError("You cannot specify both input_ids and inputs_embeds at the same time")
elifinput_idsisnotNone:
input_shape=input_ids.size()
input_ids=input_ids.view(-1, input_shape[-1])
batch_size=input_ids.shape[0]
elifinputs_embedsisnotNone:
input_shape=inputs_embeds.size()[:-1]
batch_size=inputs_embeds.shape[0]
else:
raiseValueError("You have to specify either input_ids or inputs_embeds")
device=input_ids.deviceifinput_idsisnotNoneelseinputs_embeds.deviceiftoken_type_idsisnotNone:
token_type_ids=token_type_ids.view(-1, input_shape[-1])
ifposition_idsisnotNone:
position_ids=position_ids.view(-1, input_shape[-1])
ifpast_key_valuesisNone:
past_length=0past_key_values=tuple([None] *len(self.h))
else:
past_length=past_key_values[0][0].size(-2)
device=input_ids.deviceifinput_idsisnotNoneelseinputs_embeds.deviceifposition_idsisNone:
position_ids=torch.arange(past_length, input_shape[-1] +past_length, dtype=torch.long, device=device)
position_ids=position_ids.unsqueeze(0).view(-1, input_shape[-1])
# Attention mask.ifattention_maskisnotNone:
assertbatch_size>0, "batch_size has to be defined and > 0"global_attention_mask=attention_mask.view(batch_size, -1)
# We create a 3D attention mask from a 2D tensor mask.# Sizes are [batch_size, 1, 1, to_seq_length]# So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]# this attention mask is more simple than the triangular masking of causal attention# used in OpenAI GPT, we just need to prepare the broadcast dimension here.global_attention_mask=global_attention_mask[:, None, None, :]
# Since global_attention_mask is 1.0 for positions we want to attend and 0.0 for# masked positions, this operation will create a tensor which is 0.0 for# positions we want to attend and -10000.0 for masked positions.# Since we are adding it to the raw scores before the softmax, this is# effectively the same as removing these entirely.global_attention_mask=global_attention_mask.to(dtype=self.dtype) # fp16 compatibilityglobal_attention_mask= (1.0-global_attention_mask) *-10000.0else:
global_attention_mask=None# Local causal attention maskbatch_size, seq_length=input_shapefull_seq_length=seq_length+past_lengthlocal_attention_mask=GPTNeoAttentionMixin.create_local_attention_mask(
batch_size, full_seq_length, self.config.window_size, device, attention_mask
)
# Prepare head mask if needed# 1.0 in head_mask indicate we keep the head# attention_probs has shape bsz x num_heads x N x N# head_mask has shape n_layer x batch x num_heads x N x Nhead_mask=self.get_head_mask(head_mask, self.config.num_layers)
ifinputs_embedsisNone:
inputs_embeds=self.wte(input_ids)
position_embeds=self.wpe(position_ids)
hidden_states=inputs_embeds+position_embedsiftoken_type_idsisnotNone:
token_type_embeds=self.wte(token_type_ids)
hidden_states=hidden_states+token_type_embedshidden_states=self.drop(hidden_states)
output_shape=input_shape+ (hidden_states.size(-1),)
presents= () ifuse_cacheelseNoneall_self_attentions= () ifoutput_attentionselseNoneall_hidden_states= () ifoutput_hidden_stateselseNoneifbreakmodel :
copystream=torch.cuda.Stream(device=0,priority=-1)
fori, (block, layer_past) inenumerate(zip(self.h, past_key_values)):
ifbreakmodel :
ifiinrange(ram_blocks):
index1= (i+1)%ram_blocksforparam1,param2inzip(self.h[index1].parameters(),self.h[(i-1)%ram_blocks].parameters()):
param1.data=param2.dataforparam1,param2inzip(self.h[index1].parameters(),self.extrastorage[index1].parameters()):
withtorch.cuda.stream(copystream):
torch.cuda.comm.broadcast(param2.data,out= [param1.data])
attn_type=self.config.attention_layers[i]
attn_mask=global_attention_maskifattn_type=="global"elselocal_attention_maskifoutput_hidden_states:
all_hidden_states=all_hidden_states+ (hidden_states,)
ifgetattr(self.config, "gradient_checkpointing", False) andself.training:
ifuse_cache:
logger.warning(
"`use_cache=True` is incompatible with `config.gradient_checkpointing=True`. Setting ""`use_cache=False`..."
)
use_cache=Falsedefcreate_custom_forward(module):
defcustom_forward(*inputs):
# None for past_key_valuereturnmodule(*inputs, use_cache, output_attentions)
returncustom_forwardoutputs=torch.utils.checkpoint.checkpoint(
create_custom_forward(block),
hidden_states,
None,
attn_mask,
head_mask[i],
)
else:
outputs=block(
hidden_states,
layer_past=layer_past,
attention_mask=attn_mask,
head_mask=head_mask[i],
use_cache=use_cache,
output_attentions=output_attentions,
)
hidden_states=outputs[0]
ifuse_cacheisTrue:
presents=presents+ (outputs[1],)
ifoutput_attentions:
all_self_attentions=all_self_attentions+ (outputs[2ifuse_cacheelse1],)
ifbreakmodel:
ifiinrange(ram_blocks):
torch.cuda.synchronize()
ifbreakmodel:
delcopystreamtorch.cuda.empty_cache()
hidden_states=self.ln_f(hidden_states)
hidden_states=hidden_states.view(*output_shape)
# Add last hidden stateifoutput_hidden_states:
all_hidden_states=all_hidden_states+ (hidden_states,)
ifnotreturn_dict:
returntuple(vforvin [hidden_states, presents, all_hidden_states, all_self_attentions] ifvisnotNone)
returnBaseModelOutputWithPast(
last_hidden_state=hidden_states,
past_key_values=presents,
hidden_states=all_hidden_states,
attentions=all_self_attentions,
)
ifbreakmodel:
model.eval().half().to("cpu")
model.lm_head.to("cuda")
model.transformer.wte.to("cuda")
model.transformer.wpe.to("cuda")
model.transformer.ln_f.to("cuda")
gc.collect()
print(GPTNeoModel.forward)
print(new_forward)
GPTNeoModel.forward=new_forwardprint(GPTNeoModel.forward)
#@title Sampling settings (DO NOT SKIP)#@markdown You can modify sampling settings here. Don't forget to run the cell again after changing. The number of generated tokens is subtracted from the context window size, don't set it high.tail_free_sampling=0.95#@param {type:"number"}top_k=80#@param {type:"number"}top_p=0.8#@param {type:"number"}temperature=0.7#@param {type:"number"}number_generated_tokens=25#@param {type:"integer"}repetition_penalty=1.1#@param {type:"number"}repetition_penalty_range=512#@param {type:"number"}repetition_penalty_slope=3.33#@param {type:"number"}#@markdown If tail free sampling is enabled, top_p and top_k should probably not be used.enable_tfs=True#@param {type:"boolean"}enable_top_k=False#@param {type:"boolean"}enable_top_p=False#@param {type:"boolean"}ifnotenable_tfs:
tail_free_sampling=Noneifnotenable_top_k:
top_k=Noneifnotenable_top_p:
top_p=None#@markdown Temperatures seem to give results different from those in AID, so play around with it. Even 0.5 can give good results.basic_prompt="test "*10inputs=tokenizer(basic_prompt, return_tensors="pt",truncation=True,max_length=2000).to("cuda")
outputs=model(**inputs)
start_time=time.time()
withtorch.no_grad():
foriinrange(1):
outputs=model(**inputs)
print(time.time() -start_time)
delinputs,outputstorch.cuda.empty_cache()
defmore_text(inputtext):
#return "Epictest"withtorch.no_grad():
withtorch.cuda.amp.autocast(enabled=True):
#start_time = time.time()context=2000overhead=50currpoint=len(inputtext)
inputs=tokenizer(inputtext[-currpoint:], return_tensors="pt",truncation=True,max_length=context+overhead)
ifinputs.input_ids[0].size()[0] ==context+overhead:
low=0high=len(inputtext)
currpoint=0#BINARY SEARCH FOR A POINT WHERE TOKENIZER RETURNS BETWEEN CONTEXT AND CONTEXT + OVERHEAD TOKENSwhilelow<=high:
currpoint= (high+low) //2# If x is greater, ignore left halfinputs=tokenizer(inputtext[-currpoint:], return_tensors="pt",truncation=True,max_length=context+overhead)
ifinputs.input_ids[0].size()[0] <context:
low=currpoint+1# If x is smaller, ignore right halfelifinputs.input_ids[0].size()[0] ==context+overhead :
high=currpoint-1# means x is present at midelse:
breakids=tokenizer(inputtext[-currpoint:], return_tensors="pt",truncation=True,max_length=context+overhead).input_idselse:
ids=tokenizer(inputtext[-currpoint:], return_tensors="pt",truncation=True,max_length=context+overhead,padding='max_length').input_idsids=ids[:,-context:]
n_ids=ids.shape[1]
ifn_ids<1:
n_ids=1ids=torch.tensor([[tokenizer.eos_token_id]])
max_length=n_ids+number_generated_tokensgc.collect()
basic_output=model.generate(
ids.long().to("cuda"),
do_sample=True,
num_beams=1,
min_length=max_length,
max_length=max_length,
temperature=temperature,
top_k=top_k,
top_p=top_p,
repetition_penalty=repetition_penalty,
repetition_penalty_range=repetition_penalty_range,
repetition_penalty_slope=repetition_penalty_slope,
use_cache=True,
pad_token_id=tokenizer.eos_token_id,
num_return_sequences=1
).long()
gc.collect()
torch.cuda.empty_cache()
returntokenizer.decode(basic_output[0][-number_generated_tokens:])
#print(time.time() - start_time)#print(number_generated_tokens)initial_text="In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English."new_text=more_text(initial_text)
print(new_text)
print("DONE")
I've also tried to remove .half() when generating the .pkl file. I get the file that is twice as big, it takes much more time to load, but the result is the same. I've also fiddled with tail_free_sampling, top_k, top_p and enable_* params to no avail.
The text was updated successfully, but these errors were encountered:
When I try to test this low-VRAM method locally, I only get some garbage output. I use the standard "unicorns copypasta" for input. But the output looks like this:
The U. Toe I think it Toe there areح To re the whole. </p)
. Sometimes it even worse than this.Here's how I tested:
.pkl
file. Install this:neo_gen.py
This script generates
gptneo.pkl
(5.1G).Copy/move the
gptneo.pkl
file to this new virtualenv.Run the following script:
neo_test.py
To create this script I basically just copypasted all the relevant parts from here:
https://github.com/arrmansa/Basic-UI-for-GPT-Neo-with-low-vram/blob/main/Basic%20UI.ipynb
Also, i changed two lines:
And then added the test text.
When I run this script I get nonsensical output.
I've also tried to remove
.half()
when generating the.pkl
file. I get the file that is twice as big, it takes much more time to load, but the result is the same. I've also fiddled withtail_free_sampling
,top_k
,top_p
andenable_*
params to no avail.The text was updated successfully, but these errors were encountered: