liamcripwell commited on
Commit
f1dd4b5
·
verified ·
1 Parent(s): 12316a2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +521 -196
README.md CHANGED
@@ -1,199 +1,524 @@
1
  ---
2
- library_name: transformers
3
- tags: []
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
-
8
- <!-- Provide a quick summary of what the model is/does. -->
9
-
10
-
11
-
12
- ## Model Details
13
-
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
-
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: mit
3
+ language:
4
+ - multilingual
5
+ tags:
6
+ - nlp
7
+ base_model: OpenGVLab/InternVL2_5-8B
8
+ pipeline_tag: text-generation
9
+ inference: true
10
  ---
11
 
12
+ # NuExtract-2-8B by NuMind 🔥
13
+
14
+ NuExtract 2.0 is a family of models trained specifically for structured information extraction tasks. It supports both multimodal inputs and is multilingual.
15
+
16
+ We provide several versions of different sizes, all based on the InternVL2.5 family.
17
+ | Model Size | Model Name | Base Model | Huggingface Link |
18
+ |------------|------------|------------|------------------|
19
+ | 2B | NuExtract-2.0-2B | [InternVL2_5-2B](https://huggingface.co/OpenGVLab/InternVL2_5-2B) | [NuExtract-2-2B](https://huggingface.co/numind/NuExtract-2-2B) |
20
+ | 4B | NuExtract-2.0-4B | [InternVL2_5-4B](https://huggingface.co/OpenGVLab/InternVL2_5-4B) | [NuExtract-2-4B](https://huggingface.co/numind/NuExtract-2-4B) |
21
+ | 8B | NuExtract-2.0-8B | [InternVL2_5-8B](https://huggingface.co/OpenGVLab/InternVL2_5-8B) | [NuExtract-2-8B](https://huggingface.co/numind/NuExtract-2-8B) |
22
+
23
+ ## Overview
24
+
25
+ To use the model, provide an input text/image and a JSON template describing the information you need to extract. The template should be a JSON object, specifying field names and their expected type.
26
+
27
+ Support types include:
28
+ * `verbatim-string` - instructs the model to extract text that is present verbatim in the input.
29
+ * `string` - a generic string field that can incorporate paraphrasing/abstraction.
30
+ * `integer` - a whole number.
31
+ * `number` - a whole or decimal number.
32
+ * `date-time` - ISO formatted date.
33
+ * `enum` - a choice from set of possible answers (represented in template as an array of options, e.g. `["yes", "no", "maybe"]`).
34
+ * `multi-label` - an enum that can have multiple possible answers (represented in template as a double-wrapped array, e.g. `[["A", "B", "C"]]`).
35
+
36
+ The following is an example template:
37
+ ```json
38
+ {
39
+ "first_name": "verbatim-string",
40
+ "last_name": "verbatim-string",
41
+ "description": "string",
42
+ "age": "integer",
43
+ "gpa": "number",
44
+ "birth_date": "date-time",
45
+ "nationality": ["France", "England", "Japan", "USA", "China"],
46
+ "languages_spoken": [["English", "French", "Japanese", "Mandarin", "Spanish"]]
47
+ }
48
+ ```
49
+
50
+
51
+ ⚠️ We recommend using NuExtract with a temperature at or very close to 0. Some inference frameworks, such as Ollama, use a default of 0.7 which is not well suited to many extraction tasks.
52
+
53
+ ## Inference
54
+
55
+ Use the following code to handle loading and preprocessing of input data:
56
+
57
+ ```python
58
+ import torch
59
+ import torchvision.transforms as T
60
+ from PIL import Image
61
+ from torchvision.transforms.functional import InterpolationMode
62
+
63
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
64
+ IMAGENET_STD = (0.229, 0.224, 0.225)
65
+
66
+ def build_transform(input_size):
67
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
68
+ transform = T.Compose([
69
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
70
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
71
+ T.ToTensor(),
72
+ T.Normalize(mean=MEAN, std=STD)
73
+ ])
74
+ return transform
75
+
76
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
77
+ best_ratio_diff = float('inf')
78
+ best_ratio = (1, 1)
79
+ area = width * height
80
+ for ratio in target_ratios:
81
+ target_aspect_ratio = ratio[0] / ratio[1]
82
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
83
+ if ratio_diff < best_ratio_diff:
84
+ best_ratio_diff = ratio_diff
85
+ best_ratio = ratio
86
+ elif ratio_diff == best_ratio_diff:
87
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
88
+ best_ratio = ratio
89
+ return best_ratio
90
+
91
+ def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
92
+ orig_width, orig_height = image.size
93
+ aspect_ratio = orig_width / orig_height
94
+
95
+ # calculate the existing image aspect ratio
96
+ target_ratios = set(
97
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
98
+ i * j <= max_num and i * j >= min_num)
99
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
100
+
101
+ # find the closest aspect ratio to the target
102
+ target_aspect_ratio = find_closest_aspect_ratio(
103
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
104
+
105
+ # calculate the target width and height
106
+ target_width = image_size * target_aspect_ratio[0]
107
+ target_height = image_size * target_aspect_ratio[1]
108
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
109
+
110
+ # resize the image
111
+ resized_img = image.resize((target_width, target_height))
112
+ processed_images = []
113
+ for i in range(blocks):
114
+ box = (
115
+ (i % (target_width // image_size)) * image_size,
116
+ (i // (target_width // image_size)) * image_size,
117
+ ((i % (target_width // image_size)) + 1) * image_size,
118
+ ((i // (target_width // image_size)) + 1) * image_size
119
+ )
120
+ # split the image
121
+ split_img = resized_img.crop(box)
122
+ processed_images.append(split_img)
123
+ assert len(processed_images) == blocks
124
+ if use_thumbnail and len(processed_images) != 1:
125
+ thumbnail_img = image.resize((image_size, image_size))
126
+ processed_images.append(thumbnail_img)
127
+ return processed_images
128
+
129
+ def load_image(image_file, input_size=448, max_num=12):
130
+ image = Image.open(image_file).convert('RGB')
131
+ transform = build_transform(input_size=input_size)
132
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
133
+ pixel_values = [transform(image) for image in images]
134
+ pixel_values = torch.stack(pixel_values)
135
+ return pixel_values
136
+
137
+ def prepare_inputs(messages, image_paths, tokenizer, device='cuda', dtype=torch.bfloat16):
138
+ """
139
+ Prepares multi-modal input components (supports multiple images per prompt).
140
+
141
+ Args:
142
+ messages: List of input messages/prompts (strings or dicts with 'role' and 'content')
143
+ image_paths: List where each element is either None (for text-only) or a list of image paths
144
+ tokenizer: The tokenizer to use for applying chat templates
145
+ device: Device to place tensors on ('cuda', 'cpu', etc.)
146
+ dtype: Data type for image tensors (default: torch.bfloat16)
147
+
148
+ Returns:
149
+ dict: Contains 'prompts', 'pixel_values_list', and 'num_patches_list' ready for the model
150
+ """
151
+ # Make sure image_paths list is at least as long as messages
152
+ if len(image_paths) < len(messages):
153
+ # Pad with None for text-only messages
154
+ image_paths = image_paths + [None] * (len(messages) - len(image_paths))
155
+
156
+ # Process images and collect patch information
157
+ loaded_images = []
158
+ num_patches_list = []
159
+ for paths in image_paths:
160
+ if paths and isinstance(paths, list) and len(paths) > 0:
161
+ # Load each image in this prompt
162
+ prompt_images = []
163
+ prompt_patches = []
164
+
165
+ for path in paths:
166
+ # Load the image
167
+ img = load_image(path).to(dtype=dtype, device=device)
168
+
169
+ # Ensure img has correct shape [patches, C, H, W]
170
+ if len(img.shape) == 3: # [C, H, W] -> [1, C, H, W]
171
+ img = img.unsqueeze(0)
172
+
173
+ prompt_images.append(img)
174
+ # Record the number of patches for this image
175
+ prompt_patches.append(img.shape[0])
176
+
177
+ loaded_images.append(prompt_images)
178
+ num_patches_list.append(prompt_patches)
179
+ else:
180
+ # Text-only prompt
181
+ loaded_images.append(None)
182
+ num_patches_list.append([])
183
+
184
+ # Create the concatenated pixel_values_list
185
+ pixel_values_list = []
186
+ for prompt_images in loaded_images:
187
+ if prompt_images:
188
+ # Concatenate all images for this prompt
189
+ pixel_values_list.append(torch.cat(prompt_images, dim=0))
190
+ else:
191
+ # Text-only prompt
192
+ pixel_values_list.append(None)
193
+
194
+ # Format messages for the model
195
+ if all(isinstance(m, str) for m in messages):
196
+ # Simple string messages: convert to chat format
197
+ batch_messages = [
198
+ [{"role": "user", "content": message}]
199
+ for message in messages
200
+ ]
201
+ else:
202
+ # Assume messages are already in the right format
203
+ batch_messages = messages
204
+
205
+ # Apply chat template
206
+ prompts = tokenizer.apply_chat_template(
207
+ batch_messages,
208
+ tokenize=False,
209
+ add_generation_prompt=True
210
+ )
211
+
212
+ return {
213
+ 'prompts': prompts,
214
+ 'pixel_values_list': pixel_values_list,
215
+ 'num_patches_list': num_patches_list
216
+ }
217
+
218
+ def construct_message(text, template, examples=None):
219
+ """
220
+ Construct the individual NuExtract message texts, prior to chat template formatting.
221
+ """
222
+ # add few-shot examples if needed
223
+ if examples is not None and len(examples) > 0:
224
+ icl = "# Examples:\n"
225
+ for row in examples:
226
+ icl += f"## Input:\n{row['input']}\n## Output:\n{row['output']}\n"
227
+ else:
228
+ icl = ""
229
+
230
+ return f"""# Template:\n{template}\n{icl}# Context:\n{text}"""
231
+ ```
232
+
233
+ To handle inference:
234
+
235
+ ```python
236
+ IMG_START_TOKEN='<img>'
237
+ IMG_END_TOKEN='</img>'
238
+ IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'
239
+
240
+ def nuextract_generate(model, tokenizer, prompts, generation_config, pixel_values_list=None, num_patches_list=None):
241
+ """
242
+ Generate responses for a batch of NuExtract inputs.
243
+ Support for multiple and varying numbers of images per prompt.
244
+
245
+ Args:
246
+ model: The vision-language model
247
+ tokenizer: The tokenizer for the model
248
+ pixel_values_list: List of tensor batches, one per prompt
249
+ Each batch has shape [num_images, channels, height, width] or None for text-only prompts
250
+ prompts: List of text prompts
251
+ generation_config: Configuration for text generation
252
+ num_patches_list: List of lists, each containing patch counts for images in a prompt
253
+
254
+ Returns:
255
+ List of generated responses
256
+ """
257
+ img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
258
+ model.img_context_token_id = img_context_token_id
259
+
260
+ # Replace all image placeholders with appropriate tokens
261
+ modified_prompts = []
262
+ total_image_files = 0
263
+ total_patches = 0
264
+ image_containing_prompts = []
265
+ for idx, prompt in enumerate(prompts):
266
+ # check if this prompt has images
267
+ has_images = (pixel_values_list and
268
+ idx < len(pixel_values_list) and
269
+ pixel_values_list[idx] is not None and
270
+ isinstance(pixel_values_list[idx], torch.Tensor) and
271
+ pixel_values_list[idx].shape[0] > 0)
272
+
273
+ if has_images:
274
+ # prompt with image placeholders
275
+ image_containing_prompts.append(idx)
276
+ modified_prompt = prompt
277
+
278
+ patches = num_patches_list[idx] if (num_patches_list and idx < len(num_patches_list)) else []
279
+ num_images = len(patches)
280
+ total_image_files += num_images
281
+ total_patches += sum(patches)
282
+
283
+ # replace each <image> placeholder with image tokens
284
+ for i, num_patches in enumerate(patches):
285
+ image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * model.num_image_token * num_patches + IMG_END_TOKEN
286
+ modified_prompt = modified_prompt.replace('<image>', image_tokens, 1)
287
+ else:
288
+ # text-only prompt
289
+ modified_prompt = prompt
290
+
291
+ modified_prompts.append(modified_prompt)
292
+
293
+ # process all prompts in a single batch
294
+ tokenizer.padding_side = 'left'
295
+ model_inputs = tokenizer(modified_prompts, return_tensors='pt', padding=True)
296
+ input_ids = model_inputs['input_ids'].to(model.device)
297
+ attention_mask = model_inputs['attention_mask'].to(model.device)
298
+
299
+ eos_token_id = tokenizer.convert_tokens_to_ids("<|im_end|>\n".strip())
300
+ generation_config['eos_token_id'] = eos_token_id
301
+
302
+ # prepare pixel values
303
+ flattened_pixel_values = None
304
+ if image_containing_prompts:
305
+ # collect and concatenate all image tensors
306
+ all_pixel_values = []
307
+ for idx in image_containing_prompts:
308
+ all_pixel_values.append(pixel_values_list[idx])
309
+
310
+ flattened_pixel_values = torch.cat(all_pixel_values, dim=0)
311
+ print(f"Processing batch with {len(prompts)} prompts, {total_image_files} actual images, and {total_patches} total patches")
312
+ else:
313
+ print(f"Processing text-only batch with {len(prompts)} prompts")
314
+
315
+ # generate outputs
316
+ outputs = model.generate(
317
+ pixel_values=flattened_pixel_values, # will be None for text-only prompts
318
+ input_ids=input_ids,
319
+ attention_mask=attention_mask,
320
+ **generation_config
321
+ )
322
+
323
+ # Decode responses
324
+ responses = tokenizer.batch_decode(outputs, skip_special_tokens=True)
325
+
326
+ return responses
327
+ ```
328
+
329
+ To load the model:
330
+
331
+ ```python
332
+ import torch
333
+ from transformers import AutoModelForCausalLM, AutoTokenizer
334
+
335
+ model_name = ""
336
+
337
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side='left')
338
+ model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True,
339
+ torch_dtype=torch.bfloat16,
340
+ attn_implementation="flash_attention_2" # we recommend using flash attention
341
+ ).to("cuda")
342
+ ```
343
+
344
+ Simple 0-shot text-only example:
345
+ ```python
346
+ template = """{"names": ["verbatim-string"]}"""
347
+ text = "John went to the restaurant with Mary. James went to the cinema."
348
+
349
+ input_messages = [construct_message(text, template)]
350
+
351
+ input_content = prepare_inputs(
352
+ messages=input_messages,
353
+ image_paths=[],
354
+ tokenizer=tokenizer,
355
+ )
356
+
357
+ generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
358
+
359
+ with torch.no_grad():
360
+ result = nuextract_generate(
361
+ model=model,
362
+ tokenizer=tokenizer,
363
+ prompts=input_content['prompts'],
364
+ pixel_values_list=input_content['pixel_values_list'],
365
+ num_patches_list=input_content['num_patches_list'],
366
+ generation_config=generation_config
367
+ )
368
+ for y in result:
369
+ print(y)
370
+ # {"names": ["John", "Mary", "James"]}
371
+ ```
372
+
373
+ Text-only input with an in-context example:
374
+ ```python
375
+ template = """{"names": ["verbatim-string"], "female_names": ["verbatim-string"]}"""
376
+ text = "John went to the restaurant with Mary. James went to the cinema."
377
+ examples = [
378
+ {
379
+ "input": "Stephen is the manager at Susan's store.",
380
+ "output": """{"names": ["STEPHEN", "SUSAN"], "female_names": ["SUSAN"]}"""
381
+ }
382
+ ]
383
+
384
+ input_messages = [construct_message(text, template, examples)]
385
+
386
+ input_content = prepare_inputs(
387
+ messages=input_messages,
388
+ image_paths=[],
389
+ tokenizer=tokenizer,
390
+ )
391
+
392
+ generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
393
+
394
+ with torch.no_grad():
395
+ result = nuextract_generate(
396
+ model=model,
397
+ tokenizer=tokenizer,
398
+ prompts=input_content['prompts'],
399
+ pixel_values_list=input_content['pixel_values_list'],
400
+ num_patches_list=input_content['num_patches_list'],
401
+ generation_config=generation_config
402
+ )
403
+ for y in result:
404
+ print(y)
405
+ # {"names": ["JOHN", "MARY", "JAMES"], "female_names": ["MARY"]}
406
+ ```
407
+
408
+ Example with image input and an in-context example. Image inputs should use `<image>` placeholder instead of text and image paths should be provided in a list in order of appearance in the prompt (in this example `0.jpg` will be for the in-context example and `1.jpg` for the true input).
409
+ ```python
410
+ template = """{"store": "verbatim-string"}"""
411
+ text = "<image>"
412
+ examples = [
413
+ {
414
+ "input": "<image>",
415
+ "output": """{"store": "Walmart"}"""
416
+ }
417
+ ]
418
+
419
+ input_messages = [construct_message(text, template, examples)]
420
+
421
+ images = [
422
+ ["0.jpg", "1.jpg"]
423
+ ]
424
+
425
+ input_content = prepare_inputs(
426
+ messages=input_messages,
427
+ image_paths=images,
428
+ tokenizer=tokenizer,
429
+ )
430
+
431
+ generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
432
+
433
+ with torch.no_grad():
434
+ result = nuextract_generate(
435
+ model=model,
436
+ tokenizer=tokenizer,
437
+ prompts=input_content['prompts'],
438
+ pixel_values_list=input_content['pixel_values_list'],
439
+ num_patches_list=input_content['num_patches_list'],
440
+ generation_config=generation_config
441
+ )
442
+ for y in result:
443
+ print(y)
444
+ # {"store": "Trader Joe's"}
445
+ ```
446
+
447
+ Multi-modal batched input:
448
+ ```python
449
+ inputs = [
450
+ # image input with no ICL examples
451
+ {
452
+ "text": "<image>",
453
+ "template": """{"store_name": "verbatim-string"}""",
454
+ "examples": None,
455
+ },
456
+ # image input with 1 ICL example
457
+ {
458
+ "text": "<image>",
459
+ "template": """{"store_name": "verbatim-string"}""",
460
+ "examples": [
461
+ {
462
+ "input": "<image>",
463
+ "output": """{"store_name": "Walmart"}""",
464
+ }
465
+ ],
466
+ },
467
+ # text input with no ICL examples
468
+ {
469
+ "text": "John went to the restaurant with Mary. James went to the cinema.",
470
+ "template": """{"names": ["verbatim-string"]}""",
471
+ "examples": None,
472
+ },
473
+ # text input with ICL example
474
+ {
475
+ "text": "John went to the restaurant with Mary. James went to the cinema.",
476
+ "template": """{"names": ["verbatim-string"], "female_names": ["verbatim-string"]}""",
477
+ "examples": [
478
+ {
479
+ "input": "Stephen is the manager at Susan's store.",
480
+ "output": """{"names": ["STEPHEN", "SUSAN"], "female_names": ["SUSAN"]}"""
481
+ }
482
+ ],
483
+ },
484
+ ]
485
+
486
+ input_messages = [
487
+ construct_message(
488
+ x["text"],
489
+ x["template"],
490
+ x["examples"]
491
+ ) for x in inputs
492
+ ]
493
+
494
+ images = [
495
+ ["0.jpg"],
496
+ ["0.jpg", "1.jpg"],
497
+ None,
498
+ None
499
+ ]
500
+
501
+ input_content = prepare_inputs(
502
+ messages=input_messages,
503
+ image_paths=images,
504
+ tokenizer=tokenizer,
505
+ )
506
+
507
+ generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
508
+
509
+ with torch.no_grad():
510
+ result = nuextract_generate(
511
+ model=model,
512
+ tokenizer=tokenizer,
513
+ prompts=input_content['prompts'],
514
+ pixel_values_list=input_content['pixel_values_list'],
515
+ num_patches_list=input_content['num_patches_list'],
516
+ generation_config=generation_config
517
+ )
518
+ for y in result:
519
+ print(y)
520
+ # {"store_name": "WAL*MART"}
521
+ # {"store_name": "Trader Joe's"}
522
+ # {"names": ["John", "Mary", "James"]}
523
+ # {"names": ["JOHN", "MARY", "JAMES"], "female_names": ["MARY"]}
524
+ ```