Memory when passing external memories
Hi (again :)),
I'm having trouble when I try to run use model.generate with a lot of external memories (10 documents that together give approximately 100,000 words). Even when I run topk=0 it runs out of memory after an hour and does not finish a single question. Ideally, I would like to be able to run the model using the 100,000 tokens with a topk=10. I am using an instance with a memory 72 GiB.
Here is how I am loading my model:
configuration = transformers.AutoConfig.from_pretrained("normalcomputing/extended-mind-mpt-7b", trust_remote_code=True)
configuration.max_seq_len = 2048
configuration.init_device="meta"
configuration.attn_config['alibi'] = True
configuration.attn_config['attn_impl'] = torch
configuration.use_cache = True
generator = AutoModelForCausalLM.from_pretrained("normalcomputing/extended-mind-mpt-7b", device_map="cpu", config=configuration, trust_remote_code=True)
generator.empty_memories()
tokenizer = AutoTokenizer.from_pretrained("normalcomputing/extended-mind-mpt-7b", padding_side='left')
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
And this is the tokenisation and generation:
for question, question_index in tqdm(zip(question_data, question_indices), total=len(question_indices)):
print(question['answers'])
userprompt = question['question']
# Get the documents
docs = question['contexts']
doc_indices = random.sample(range(10), 10)
[docs.extend(data[i]['contexts']) for i in doc_indices]
# Create external memories
external_memories = " ".join(docs)
memory_ids = tokenizer(external_memories, return_tensors='pt')['input_ids'].to(device)
If you have any inputs as to what should be changed in order to be able to run the model with this many memories I would be very happy to hear them! Is there for example a possibility to send the tokenised external memories into the model in batches?
Hey! I'd recommend using memory_type=faiss
, for starters. You can also try increasing the stride
parameter in the generate_cache
method. This may result in lower quality memories, but will be faster! The stride is used in an analogous way as this tutorial if you want to check it out: https://huggingface.co./docs/transformers/en/perplexity. Let me know if that helps!
Thanks a lot for your quick response!
I have tried to set memory_type=faiss
and tried to increase stride to 2048, however, it still runs out of memory. Is there a way to estimate how much memory is expected to be used with large external memories? Then I can try to upgrade my resources to match these requirements :)
If you're using faiss, the main cost is generating the cache before you pass the vectors to the db store. That cost, (if you're using stride=2048
) is roughly n=input_length//2048
passes through the model. (You'll need memory for the model + ~2048 inputs, as well as for the growing vector db). Hope that helps!
It does indeed! Thanks :)