Memory when passing external memories

#3
by xmrt - opened

Hi (again :)),
I'm having trouble when I try to run use model.generate with a lot of external memories (10 documents that together give approximately 100,000 words). Even when I run topk=0 it runs out of memory after an hour and does not finish a single question. Ideally, I would like to be able to run the model using the 100,000 tokens with a topk=10. I am using an instance with a memory 72 GiB.

Here is how I am loading my model:

configuration = transformers.AutoConfig.from_pretrained("normalcomputing/extended-mind-mpt-7b", trust_remote_code=True)
configuration.max_seq_len = 2048
configuration.init_device="meta"
configuration.attn_config['alibi'] = True
configuration.attn_config['attn_impl'] = torch
configuration.use_cache = True

generator = AutoModelForCausalLM.from_pretrained("normalcomputing/extended-mind-mpt-7b", device_map="cpu", config=configuration, trust_remote_code=True)
generator.empty_memories()

tokenizer = AutoTokenizer.from_pretrained("normalcomputing/extended-mind-mpt-7b", padding_side='left')

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

And this is the tokenisation and generation:

for question, question_index in tqdm(zip(question_data, question_indices), total=len(question_indices)):
    print(question['answers'])
    userprompt = question['question']

    # Get the documents
    docs = question['contexts']
    doc_indices = random.sample(range(10), 10)
    [docs.extend(data[i]['contexts']) for i in doc_indices]

    # Create external memories
    external_memories = " ".join(docs)
    memory_ids = tokenizer(external_memories, return_tensors='pt')['input_ids'].to(device)

If you have any inputs as to what should be changed in order to be able to run the model with this many memories I would be very happy to hear them! Is there for example a possibility to send the tokenised external memories into the model in batches?

Normal Computing org

Hey! I'd recommend using memory_type=faiss, for starters. You can also try increasing the stride parameter in the generate_cache method. This may result in lower quality memories, but will be faster! The stride is used in an analogous way as this tutorial if you want to check it out: https://huggingface.co./docs/transformers/en/perplexity. Let me know if that helps!

Thanks a lot for your quick response!

I have tried to set memory_type=faiss and tried to increase stride to 2048, however, it still runs out of memory. Is there a way to estimate how much memory is expected to be used with large external memories? Then I can try to upgrade my resources to match these requirements :)

Normal Computing org

If you're using faiss, the main cost is generating the cache before you pass the vectors to the db store. That cost, (if you're using stride=2048) is roughly n=input_length//2048 passes through the model. (You'll need memory for the model + ~2048 inputs, as well as for the growing vector db). Hope that helps!

It does indeed! Thanks :)

Sign up or log in to comment