numind
/

NuExtract-2-8B

@@ -1,199 +1,524 @@
 ---
-library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
+license: mit
+language:
+- multilingual
+tags:
+- nlp
+base_model: OpenGVLab/InternVL2_5-8B
+pipeline_tag: text-generation
+inference: true
 ---
+# NuExtract-2-8B by NuMind 🔥
+NuExtract 2.0 is a family of models trained specifically for structured information extraction tasks. It supports both multimodal inputs and is multilingual.
+We provide several versions of different sizes, all based on the InternVL2.5 family.
+| Model Size | Model Name | Base Model | Huggingface Link |
+|------------|------------|------------|------------------|
+| 2B | NuExtract-2.0-2B | [InternVL2_5-2B](https://huggingface.co/OpenGVLab/InternVL2_5-2B) | [NuExtract-2-2B](https://huggingface.co/numind/NuExtract-2-2B) |
+| 4B | NuExtract-2.0-4B | [InternVL2_5-4B](https://huggingface.co/OpenGVLab/InternVL2_5-4B) | [NuExtract-2-4B](https://huggingface.co/numind/NuExtract-2-4B) |
+| 8B | NuExtract-2.0-8B | [InternVL2_5-8B](https://huggingface.co/OpenGVLab/InternVL2_5-8B) | [NuExtract-2-8B](https://huggingface.co/numind/NuExtract-2-8B) |
+## Overview
+To use the model, provide an input text/image and a JSON template describing the information you need to extract. The template should be a JSON object, specifying field names and their expected type.
+Support types include:
+* `verbatim-string` - instructs the model to extract text that is present verbatim in the input.
+* `string` - a generic string field that can incorporate paraphrasing/abstraction.
+* `integer` - a whole number.
+* `number` - a whole or decimal number.
+* `date-time` - ISO formatted date.
+* `enum` - a choice from set of possible answers (represented in template as an array of options, e.g. `["yes", "no", "maybe"]`).
+* `multi-label` - an enum that can have multiple possible answers (represented in template as a double-wrapped array, e.g. `[["A", "B", "C"]]`).
+The following is an example template:
+```json
+{
+  "first_name": "verbatim-string",
+  "last_name": "verbatim-string",
+  "description": "string",
+  "age": "integer",
+  "gpa": "number",
+  "birth_date": "date-time",
+  "nationality": ["France", "England", "Japan", "USA", "China"],
+  "languages_spoken": [["English", "French", "Japanese", "Mandarin", "Spanish"]]
+}
+```
+⚠️ We recommend using NuExtract with a temperature at or very close to 0. Some inference frameworks, such as Ollama, use a default of 0.7 which is not well suited to many extraction tasks.
+## Inference
+Use the following code to handle loading and preprocessing of input data:
+```python
+import torch
+import torchvision.transforms as T
+from PIL import Image
+from torchvision.transforms.functional import InterpolationMode
+IMAGENET_MEAN = (0.485, 0.456, 0.406)
+IMAGENET_STD = (0.229, 0.224, 0.225)
+def build_transform(input_size):
+    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
+    transform = T.Compose([
+        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
+        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
+        T.ToTensor(),
+        T.Normalize(mean=MEAN, std=STD)
+    ])
+    return transform
+def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
+    best_ratio_diff = float('inf')
+    best_ratio = (1, 1)
+    area = width * height
+    for ratio in target_ratios:
+        target_aspect_ratio = ratio[0] / ratio[1]
+        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
+        if ratio_diff < best_ratio_diff:
+            best_ratio_diff = ratio_diff
+            best_ratio = ratio
+        elif ratio_diff == best_ratio_diff:
+            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
+                best_ratio = ratio
+    return best_ratio
+def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
+    orig_width, orig_height = image.size
+    aspect_ratio = orig_width / orig_height
+    # calculate the existing image aspect ratio
+    target_ratios = set(
+        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
+        i * j <= max_num and i * j >= min_num)
+    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
+    # find the closest aspect ratio to the target
+    target_aspect_ratio = find_closest_aspect_ratio(
+        aspect_ratio, target_ratios, orig_width, orig_height, image_size)
+    # calculate the target width and height
+    target_width = image_size * target_aspect_ratio[0]
+    target_height = image_size * target_aspect_ratio[1]
+    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
+    # resize the image
+    resized_img = image.resize((target_width, target_height))
+    processed_images = []
+    for i in range(blocks):
+        box = (
+            (i % (target_width // image_size)) * image_size,
+            (i // (target_width // image_size)) * image_size,
+            ((i % (target_width // image_size)) + 1) * image_size,
+            ((i // (target_width // image_size)) + 1) * image_size
+        )
+        # split the image
+        split_img = resized_img.crop(box)
+        processed_images.append(split_img)
+    assert len(processed_images) == blocks
+    if use_thumbnail and len(processed_images) != 1:
+        thumbnail_img = image.resize((image_size, image_size))
+        processed_images.append(thumbnail_img)
+    return processed_images
+def load_image(image_file, input_size=448, max_num=12):
+    image = Image.open(image_file).convert('RGB')
+    transform = build_transform(input_size=input_size)
+    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
+    pixel_values = [transform(image) for image in images]
+    pixel_values = torch.stack(pixel_values)
+    return pixel_values
+def prepare_inputs(messages, image_paths, tokenizer, device='cuda', dtype=torch.bfloat16):
+    """
+    Prepares multi-modal input components (supports multiple images per prompt).
+    Args:
+        messages: List of input messages/prompts (strings or dicts with 'role' and 'content')
+        image_paths: List where each element is either None (for text-only) or a list of image paths
+        tokenizer: The tokenizer to use for applying chat templates
+        device: Device to place tensors on ('cuda', 'cpu', etc.)
+        dtype: Data type for image tensors (default: torch.bfloat16)
+    Returns:
+        dict: Contains 'prompts', 'pixel_values_list', and 'num_patches_list' ready for the model
+    """
+    # Make sure image_paths list is at least as long as messages
+    if len(image_paths) < len(messages):
+        # Pad with None for text-only messages
+        image_paths = image_paths + [None] * (len(messages) - len(image_paths))
+    # Process images and collect patch information
+    loaded_images = []
+    num_patches_list = []
+    for paths in image_paths:
+        if paths and isinstance(paths, list) and len(paths) > 0:
+            # Load each image in this prompt
+            prompt_images = []
+            prompt_patches = []
+            for path in paths:
+                # Load the image
+                img = load_image(path).to(dtype=dtype, device=device)
+                # Ensure img has correct shape [patches, C, H, W]
+                if len(img.shape) == 3:  # [C, H, W] -> [1, C, H, W]
+                    img = img.unsqueeze(0)
+                prompt_images.append(img)
+                # Record the number of patches for this image
+                prompt_patches.append(img.shape[0])
+            loaded_images.append(prompt_images)
+            num_patches_list.append(prompt_patches)
+        else:
+            # Text-only prompt
+            loaded_images.append(None)
+            num_patches_list.append([])
+    # Create the concatenated pixel_values_list
+    pixel_values_list = []
+    for prompt_images in loaded_images:
+        if prompt_images:
+            # Concatenate all images for this prompt
+            pixel_values_list.append(torch.cat(prompt_images, dim=0))
+        else:
+            # Text-only prompt
+            pixel_values_list.append(None)
+    # Format messages for the model
+    if all(isinstance(m, str) for m in messages):
+        # Simple string messages: convert to chat format
+        batch_messages = [
+            [{"role": "user", "content": message}]
+            for message in messages
+        ]
+    else:
+        # Assume messages are already in the right format
+        batch_messages = messages
+    # Apply chat template
+    prompts = tokenizer.apply_chat_template(
+        batch_messages,
+        tokenize=False,
+        add_generation_prompt=True
+    )
+    return {
+        'prompts': prompts,
+        'pixel_values_list': pixel_values_list,
+        'num_patches_list': num_patches_list
+    }
+def construct_message(text, template, examples=None):
+    """
+    Construct the individual NuExtract message texts, prior to chat template formatting.
+    """
+    # add few-shot examples if needed
+    if examples is not None and len(examples) > 0:
+        icl = "# Examples:\n"
+        for row in examples:
+            icl += f"## Input:\n{row['input']}\n## Output:\n{row['output']}\n"
+    else:
+        icl = ""
+    return f"""# Template:\n{template}\n{icl}# Context:\n{text}"""
+```
+To handle inference:
+```python
+IMG_START_TOKEN='<img>'
+IMG_END_TOKEN='</img>'
+IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'
+def nuextract_generate(model, tokenizer, prompts, generation_config, pixel_values_list=None, num_patches_list=None):
+    """
+    Generate responses for a batch of NuExtract inputs.
+    Support for multiple and varying numbers of images per prompt.
+    Args:
+        model: The vision-language model
+        tokenizer: The tokenizer for the model
+        pixel_values_list: List of tensor batches, one per prompt
+                          Each batch has shape [num_images, channels, height, width] or None for text-only prompts
+        prompts: List of text prompts
+        generation_config: Configuration for text generation
+        num_patches_list: List of lists, each containing patch counts for images in a prompt
+    Returns:
+        List of generated responses
+    """
+    img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
+    model.img_context_token_id = img_context_token_id
+    # Replace all image placeholders with appropriate tokens
+    modified_prompts = []
+    total_image_files = 0
+    total_patches = 0
+    image_containing_prompts = []
+    for idx, prompt in enumerate(prompts):
+        # check if this prompt has images
+        has_images = (pixel_values_list and
+                      idx < len(pixel_values_list) and
+                      pixel_values_list[idx] is not None and
+                      isinstance(pixel_values_list[idx], torch.Tensor) and
+                      pixel_values_list[idx].shape[0] > 0)
+        if has_images:
+            # prompt with image placeholders
+            image_containing_prompts.append(idx)
+            modified_prompt = prompt
+            patches = num_patches_list[idx] if (num_patches_list and idx < len(num_patches_list)) else []
+            num_images = len(patches)
+            total_image_files += num_images
+            total_patches += sum(patches)
+            # replace each <image> placeholder with image tokens
+            for i, num_patches in enumerate(patches):
+                image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * model.num_image_token * num_patches + IMG_END_TOKEN
+                modified_prompt = modified_prompt.replace('<image>', image_tokens, 1)
+        else:
+            # text-only prompt
+            modified_prompt = prompt
+        modified_prompts.append(modified_prompt)
+    # process all prompts in a single batch
+    tokenizer.padding_side = 'left'
+    model_inputs = tokenizer(modified_prompts, return_tensors='pt', padding=True)
+    input_ids = model_inputs['input_ids'].to(model.device)
+    attention_mask = model_inputs['attention_mask'].to(model.device)
+    eos_token_id = tokenizer.convert_tokens_to_ids("<|im_end|>\n".strip())
+    generation_config['eos_token_id'] = eos_token_id
+    # prepare pixel values
+    flattened_pixel_values = None
+    if image_containing_prompts:
+        # collect and concatenate all image tensors
+        all_pixel_values = []
+        for idx in image_containing_prompts:
+            all_pixel_values.append(pixel_values_list[idx])
+        flattened_pixel_values = torch.cat(all_pixel_values, dim=0)
+        print(f"Processing batch with {len(prompts)} prompts, {total_image_files} actual images, and {total_patches} total patches")
+    else:
+        print(f"Processing text-only batch with {len(prompts)} prompts")
+    # generate outputs
+    outputs = model.generate(
+        pixel_values=flattened_pixel_values,  # will be None for text-only prompts
+        input_ids=input_ids,
+        attention_mask=attention_mask,
+        **generation_config
+    )
+    # Decode responses
+    responses = tokenizer.batch_decode(outputs, skip_special_tokens=True)
+    return responses
+```
+To load the model:
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = ""
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side='left')
+model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True,
+                                             torch_dtype=torch.bfloat16,
+                                             attn_implementation="flash_attention_2" # we recommend using flash attention
+                                            ).to("cuda")
+```
+Simple 0-shot text-only example:
+```python
+template = """{"names": ["verbatim-string"]}"""
+text = "John went to the restaurant with Mary. James went to the cinema."
+input_messages = [construct_message(text, template)]
+input_content = prepare_inputs(
+    messages=input_messages,
+    image_paths=[],
+    tokenizer=tokenizer,
+)
+generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
+with torch.no_grad():
+    result = nuextract_generate(
+        model=model,
+        tokenizer=tokenizer,
+        prompts=input_content['prompts'],
+        pixel_values_list=input_content['pixel_values_list'],
+        num_patches_list=input_content['num_patches_list'],
+        generation_config=generation_config
+    )
+for y in result:
+    print(y)
+# {"names": ["John", "Mary", "James"]}
+```
+Text-only input with an in-context example:
+```python
+template = """{"names": ["verbatim-string"], "female_names": ["verbatim-string"]}"""
+text = "John went to the restaurant with Mary. James went to the cinema."
+examples = [
+    {
+        "input": "Stephen is the manager at Susan's store.",
+        "output": """{"names": ["STEPHEN", "SUSAN"], "female_names": ["SUSAN"]}"""
+    }
+]
+input_messages = [construct_message(text, template, examples)]
+input_content = prepare_inputs(
+    messages=input_messages,
+    image_paths=[],
+    tokenizer=tokenizer,
+)
+generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
+with torch.no_grad():
+    result = nuextract_generate(
+        model=model,
+        tokenizer=tokenizer,
+        prompts=input_content['prompts'],
+        pixel_values_list=input_content['pixel_values_list'],
+        num_patches_list=input_content['num_patches_list'],
+        generation_config=generation_config
+    )
+for y in result:
+    print(y)
+# {"names": ["JOHN", "MARY", "JAMES"], "female_names": ["MARY"]}
+```
+Example with image input and an in-context example. Image inputs should use `<image>` placeholder instead of text and image paths should be provided in a list in order of appearance in the prompt (in this example `0.jpg` will be for the in-context example and `1.jpg` for the true input).
+```python
+template = """{"store": "verbatim-string"}"""
+text = "<image>"
+examples = [
+    {
+        "input": "<image>",
+        "output": """{"store": "Walmart"}"""
+    }
+]
+input_messages = [construct_message(text, template, examples)]
+images = [
+    ["0.jpg", "1.jpg"]
+]
+input_content = prepare_inputs(
+    messages=input_messages,
+    image_paths=images,
+    tokenizer=tokenizer,
+)
+generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
+with torch.no_grad():
+    result = nuextract_generate(
+        model=model,
+        tokenizer=tokenizer,
+        prompts=input_content['prompts'],
+        pixel_values_list=input_content['pixel_values_list'],
+        num_patches_list=input_content['num_patches_list'],
+        generation_config=generation_config
+    )
+for y in result:
+    print(y)
+# {"store": "Trader Joe's"}
+```
+Multi-modal batched input:
+```python
+inputs = [
+    # image input with no ICL examples
+    {
+        "text": "<image>",
+        "template": """{"store_name": "verbatim-string"}""",
+        "examples": None,
+    },
+    # image input with 1 ICL example
+    {
+        "text": "<image>",
+        "template": """{"store_name": "verbatim-string"}""",
+        "examples": [
+            {
+                "input": "<image>",
+                "output": """{"store_name": "Walmart"}""",
+            }
+        ],
+    },
+    # text input with no ICL examples
+    {
+        "text": "John went to the restaurant with Mary. James went to the cinema.",
+        "template": """{"names": ["verbatim-string"]}""",
+        "examples": None,
+    },
+    # text input with ICL example
+    {
+        "text": "John went to the restaurant with Mary. James went to the cinema.",
+        "template": """{"names": ["verbatim-string"], "female_names": ["verbatim-string"]}""",
+        "examples": [
+            {
+                "input": "Stephen is the manager at Susan's store.",
+                "output": """{"names": ["STEPHEN", "SUSAN"], "female_names": ["SUSAN"]}"""
+            }
+        ],
+    },
+]
+input_messages = [
+    construct_message(
+        x["text"],
+        x["template"],
+        x["examples"]
+    ) for x in inputs
+]
+images = [
+    ["0.jpg"],
+    ["0.jpg", "1.jpg"],
+    None,
+    None
+]
+input_content = prepare_inputs(
+    messages=input_messages,
+    image_paths=images,
+    tokenizer=tokenizer,
+)
+generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
+with torch.no_grad():
+    result = nuextract_generate(
+        model=model,
+        tokenizer=tokenizer,
+        prompts=input_content['prompts'],
+        pixel_values_list=input_content['pixel_values_list'],
+        num_patches_list=input_content['num_patches_list'],
+        generation_config=generation_config
+    )
+for y in result:
+    print(y)
+# {"store_name": "WAL*MART"}
+# {"store_name": "Trader Joe's"}
+# {"names": ["John", "Mary", "James"]}
+# {"names": ["JOHN", "MARY", "JAMES"], "female_names": ["MARY"]}
+```