Edit model card

Intro

Activation Beacon compresses the original KV into fewer yet more compact states (a.k.a. beacons) and hence enables the LLM to perceive longer context given its fixed context window. It is known for the following features:

  • Effective
    • there is little information loss given a compression ratio of 2, 4, and 8;
  • Efficient
    • it drastically reduces the GPU consumption of KV cache;
  • Compatible
    • it can work together with position extrapolation (e.g. YaRN) to further extends the context length; it can also work with grouped query attention to further reduce the KV cache size;
  • Low-Cost
    • it is light-weight and can be efficiently trained with roughly 1B tokens.

Environment

pip install transformers
pip install flash-attn --no-build-isolation

Usage

import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "namespace-Pt/beacon-qwen-2-7b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    trust_remote_code=True, 
    torch_dtype=torch.bfloat16, 
    attn_implementation="flash_attention_2"
)

model = model.cuda().eval()

with torch.no_grad():
  # short context
  messages = [{"role": "user", "content": "Tell me about yourself."}]
  inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True).to("cuda")
  outputs = model.generate(**inputs, max_new_tokens=50)
  print(f"Input Length: {inputs['input_ids'].shape[1]}")
  print(f"Output:       {repr(tokenizer.decode(outputs[0], skip_special_tokens=True))}")

  # reset memory before new generation task
  model.memory.reset()

  # long context
  with open("infbench.json", encoding="utf-8") as f:
    example = json.load(f)
  messages = [{"role": "user", "content": example["context"]}]
  inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True).to("cuda")
  outputs = model.generate(**inputs, do_sample=False, top_p=1, temperature=1, max_new_tokens=20)[:, inputs["input_ids"].shape[1]:]
  print("*"*20)
  print(f"Input Length: {inputs['input_ids'].shape[1]}")
  print(f"Answers:      {example['answer']}")
  print(f"Prediction:   {tokenizer.decode(outputs[0], skip_special_tokens=True)}")

NOTE: It's okay to see warnings like This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (32768). Depending on the model, you may observe exceptions, performance degradation, or nothing at all. Just ignore it.

Results

LongBench

Model Single QA Multi QA Summarization Few-Shot Code AVG
qwen-2-7b-instruct 39.60 36.92 27.97 71.12 62.34 47.59
beacon-qwen-2-7b-instruct 40.76 43.73 27.23 68.87 68.47 49.81

NIAH

Downloads last month
111
Safetensors
Model size
8.08B params
Tensor type
BF16
·
Inference Examples
Inference API (serverless) is not available, repository is disabled.