TypeError: BackendCompilerFailed.__init__() missing 1 required positional argument: 'inner_exception'

#58
by pofce - opened
MODEL_ID = "google/gemma-2-2b-it"
TOKENIZER_ID = "google/gemma-2-2b-it"

llm = LLM(model=MODEL_ID, dtype="float16", device="cuda");
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_ID);

I am using A100 SXM4 on Vast.Ai

config.json: 100%
 838/838 [00:00<00:00, 54.0kB/s]
WARNING 01-21 19:29:08 config.py:2276] Casting torch.bfloat16 to torch.float16.
INFO 01-21 19:29:16 config.py:510] This model supports multiple tasks: {'classify', 'reward', 'generate', 'embed', 'score'}. Defaulting to 'generate'.
INFO 01-21 19:29:16 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='google/gemma-2-2b-it', speculative_config=None, tokenizer='google/gemma-2-2b-it', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=google/gemma-2-2b-it, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
tokenizer_config.json: 100%
 47.0k/47.0k [00:00<00:00, 4.44MB/s]
tokenizer.model: 100%
 4.24M/4.24M [00:00<00:00, 29.6MB/s]
tokenizer.json: 100%
 17.5M/17.5M [00:00<00:00, 39.9MB/s]
special_tokens_map.json: 100%
 636/636 [00:00<00:00, 59.1kB/s]
generation_config.json: 100%
 187/187 [00:00<00:00, 17.2kB/s]
INFO 01-21 19:29:20 selector.py:120] Using Flash Attention backend.
INFO 01-21 19:29:20 model_runner.py:1094] Starting to load model google/gemma-2-2b-it...
INFO 01-21 19:29:21 weight_utils.py:251] Using model weights format ['*.safetensors']
model-00002-of-00002.safetensors: 100%
 241M/241M [00:05<00:00, 41.5MB/s]
model-00001-of-00002.safetensors: 100%
 4.99G/4.99G [02:08<00:00, 37.5MB/s]
model.safetensors.index.json: 100%
 24.2k/24.2k [00:00<00:00, 1.95MB/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.23s/it]
INFO 01-21 19:31:33 model_runner.py:1099] Loading model weights took 4.8999 GB
INFO 01-21 19:31:34 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250121-193134.pkl...
INFO 01-21 19:31:34 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250121-193134.pkl.

RuntimeError Traceback (most recent call last)
File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/output_graph.py:1446, in OutputGraph._call_user_compiler(self, gm)
1445 compiler_fn = WrapperBackend(compiler_fn)
-> 1446 compiled_fn = compiler_fn(gm, self.example_inputs())
1447 _step_logger()(logging.INFO, f"done compiler function {name}")

File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/repro/after_dynamo.py:129, in WrapBackendDebug.call(self, gm, example_inputs, **kwargs)
128 else:
--> 129 compiled_gm = compiler_fn(gm, example_inputs)
131 return compiled_gm

File /opt/conda/lib/python3.11/site-packages/torch/init.py:2234, in TorchCompileInductorWrapper.call(self, model, inputs_)
2232 from torch.inductor.compile_fx import compile_fx
-> 2234 return compile_fx(model
, inputs_, config_patches=self.config)

File /opt/conda/lib/python3.11/site-packages/torch/inductor/compile_fx.py:1521, in compile_fx(model, example_inputs_, inner_compile, config_patches, decompositions)
1516 with V.set_fake_mode(fake_mode), torch.guards.tracing(
1517 tracing_context
1518 ), compiled_autograd.disable(), functorch_config.patch(
1519 unlift_effect_tokens=True
1520 ):
-> 1521 return aot_autograd(
1522 fw_compiler=fw_compiler,
1523 bw_compiler=bw_compiler,
1524 inference_compiler=inference_compiler,
1525 decompositions=decompositions,
1526 partition_fn=partition_fn,
1527 keep_inference_input_mutations=True,
1528 cudagraphs=cudagraphs,
1529 )(model
, example_inputs_)

File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/backends/common.py:72, in AotAutograd.call(self, gm, example_inputs, **kwargs)
71 with enable_aot_logging(), patch_config:
---> 72 cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
73 counters["aot_autograd"]["ok"] += 1

File /opt/conda/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py:1071, in aot_module_simplified(mod, args, fw_compiler, bw_compiler, partition_fn, decompositions, keep_inference_input_mutations, inference_compiler, cudagraphs)
1070 else:
-> 1071 compiled_fn = dispatch_and_compile()
1073 if isinstance(mod, torch._dynamo.utils.GmWrapper):
1074 # This function is called by the flatten_graph_inputs wrapper, which boxes
1075 # the inputs so that they can be freed before the end of this scope.
1076 # For overhead reasons, this is not the default wrapper, see comment:
1077 # https://github.com/pytorch/pytorch/pull/122535/files#r1560096481

File /opt/conda/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py:1056, in aot_module_simplified..dispatch_and_compile()
1055 with compiled_autograd.disable():
-> 1056 compiled_fn, _ = create_aot_dispatcher_function(
1057 functional_call,
1058 fake_flat_args,
1059 aot_config,
1060 fake_mode,
1061 shape_env,
1062 )
1063 return compiled_fn

File /opt/conda/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py:522, in create_aot_dispatcher_function(flat_fn, fake_flat_args, aot_config, fake_mode, shape_env)
521 with dynamo_timed("create_aot_dispatcher_function"):
--> 522 return _create_aot_dispatcher_function(
523 flat_fn, fake_flat_args, aot_config, fake_mode, shape_env
524 )

File /opt/conda/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py:759, in _create_aot_dispatcher_function(flat_fn, fake_flat_args, aot_config, fake_mode, shape_env)
757 compiler_fn = choose_dispatcher(needs_autograd, aot_config)
--> 759 compiled_fn, fw_metadata = compiler_fn(
760 flat_fn,
761 _dup_fake_script_obj(fake_flat_args),
762 aot_config,
763 fw_metadata=fw_metadata,
764 )
765 return compiled_fn, fw_metadata

File /opt/conda/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py:179, in aot_dispatch_base(flat_fn, flat_args, aot_config, fw_metadata)
178 with TracingContext.report_output_strides() as fwd_output_strides:
--> 179 compiled_fw = compiler(fw_module, updated_flat_args)
181 if fakified_out_wrapper.needs_post_compile:

File /opt/conda/lib/python3.11/site-packages/torch/_inductor/compile_fx.py:1350, in compile_fx..fw_compiler_base(model, example_inputs, is_inference)
1349 with dynamo_utils.dynamo_timed("compile_fx..fw_compiler_base"):
-> 1350 return _fw_compiler_base(model, example_inputs, is_inference)

File /opt/conda/lib/python3.11/site-packages/torch/_inductor/compile_fx.py:1421, in compile_fx.._fw_compiler_base(model, example_inputs, is_inference)
1413 user_visible_outputs = dict.fromkeys(
1414 n.name
1415 for n in model_outputs[
(...)
1418 if isinstance(n, torch.fx.Node)
1419 )
-> 1421 return inner_compile(
1422 model,
1423 example_inputs,
1424 static_input_idxs=get_static_input_idxs(fixed),
1425 cudagraphs=cudagraphs,
1426 graph_id=graph_id,
1427 is_inference=is_inference,
1428 boxed_forward_device_index=forward_device,
1429 user_visible_outputs=user_visible_outputs,
1430 )

File /opt/conda/lib/python3.11/site-packages/torch/_inductor/compile_fx.py:475, in compile_fx_inner(*args, **kwargs)
473 stack.enter_context(DebugContext())
--> 475 return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")(
476 *args, **kwargs
477 )

File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/repro/after_aot.py:85, in wrap_compiler_debug..debug_wrapper(gm, example_inputs, **kwargs)
82 try:
83 # Call the compiler_fn - which is either aot_autograd or inductor
84 # with fake inputs
---> 85 inner_compiled_fn = compiler_fn(gm, example_inputs)
86 except Exception as e:
87 # TODO: Failures here are troublesome because no real inputs,
88 # need a different serialization strategy

File /opt/conda/lib/python3.11/site-packages/torch/_inductor/compile_fx.py:661, in _compile_fx_inner(gm, example_inputs, cudagraphs, static_input_idxs, is_backward, graph_id, cpp_wrapper, aot_mode, is_inference, boxed_forward_device_index, user_visible_outputs, layout_opt, extern_node_serializer)
659 input._is_inductor_static = True # type: ignore[attr-defined]
--> 661 compiled_graph = FxGraphCache.load(
662 codegen_and_compile,
663 gm,
664 example_inputs,
665 graph_kwargs,
666 inputs_to_check,
667 local=config.fx_graph_cache,
668 remote=fx_graph_remote_cache,
669 )
670 else:

File /opt/conda/lib/python3.11/site-packages/torch/_inductor/codecache.py:1334, in FxGraphCache.load(compile_fx_fn, gm, example_inputs, fx_kwargs, inputs_to_check, local, remote)
1333 cache_event_time = start_time
-> 1334 compiled_graph = compile_fx_fn(
1335 gm, example_inputs, inputs_to_check, fx_kwargs
1336 )
1337 compiled_graph._time_taken_ns = time_ns() - start_time

File /opt/conda/lib/python3.11/site-packages/torch/_inductor/compile_fx.py:570, in _compile_fx_inner..codegen_and_compile(gm, example_inputs, inputs_to_check, fx_kwargs)
566 """
567 This function calls fx_codegen_and_compile and also adds some extra metadata to the resulting
568 compiled fx graph. The metadata is saved to FXGraphCache.
569 """
--> 570 compiled_graph = fx_codegen_and_compile(gm, example_inputs, **fx_kwargs)
571 if isinstance(compiled_graph, str):
572 # We only return a string in aot mode, in which case we don't
573 # need to do any post-compilation steps: we just return the string,
574 # which is the filename of the compiled code.

File /opt/conda/lib/python3.11/site-packages/torch/_inductor/compile_fx.py:878, in fx_codegen_and_compile(gm, example_inputs, cudagraphs, static_input_idxs, is_backward, graph_id, cpp_wrapper, aot_mode, is_inference, user_visible_outputs, layout_opt, extern_node_serializer)
877 _check_triton_bf16_support(graph)
--> 878 compiled_fn = graph.compile_to_fn()
879 num_bytes, nodes_num_elem, node_runtimes = graph.count_bytes()

File /opt/conda/lib/python3.11/site-packages/torch/_inductor/graph.py:1913, in GraphLowering.compile_to_fn(self)
1912 else:
-> 1913 return self.compile_to_module().call

File /opt/conda/lib/python3.11/site-packages/torch/_inductor/graph.py:1839, in GraphLowering.compile_to_module(self)
1836 with dynamo_timed(
1837 "GraphLowering.compile_to_module", phase_name="code_gen", fwd_only=False
1838 ):
-> 1839 return self._compile_to_module()

File /opt/conda/lib/python3.11/site-packages/torch/_inductor/graph.py:1845, in GraphLowering._compile_to_module(self)
1842 from .codecache import PyCodeCache
1844 code, linemap = (
-> 1845 self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
1846 )
1848 GraphLowering.save_output_code(code)

File /opt/conda/lib/python3.11/site-packages/torch/_inductor/graph.py:1784, in GraphLowering.codegen(self)
1783 self.wrapper_code.push_codegened_graph(self)
-> 1784 self.scheduler.codegen()
1786 log.debug(
1787 "Finished codegen for all nodes. The list of kernel names available: %s",
1788 V.graph.all_codegen_kernel_names,
1789 )

File /opt/conda/lib/python3.11/site-packages/torch/_inductor/scheduler.py:3383, in Scheduler.codegen(self)
3382 with dynamo_timed("Scheduler.codegen"):
-> 3383 return self._codegen()

File /opt/conda/lib/python3.11/site-packages/torch/_inductor/scheduler.py:3461, in Scheduler._codegen(self)
3460 elif isinstance(node, (FusedSchedulerNode, SchedulerNode)):
-> 3461 self.get_backend(device).codegen_node(node)
3462 else:

File /opt/conda/lib/python3.11/site-packages/torch/_inductor/codegen/cuda_combined_scheduling.py:80, in CUDACombinedScheduling.codegen_node(self, node)
79 def codegen_node(self, node: Union[FusedSchedulerNode, SchedulerNode]):
---> 80 return self._triton_scheduling.codegen_node(node)

File /opt/conda/lib/python3.11/site-packages/torch/_inductor/codegen/simd.py:1155, in SIMDScheduling.codegen_node(self, node)
1153 schedule_log.debug("Schedule:\n %s", node_schedule)
-> 1155 return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel)

File /opt/conda/lib/python3.11/site-packages/torch/_inductor/codegen/simd.py:1364, in SIMDScheduling.codegen_node_schedule(self, node_schedule, buf_accesses, numel, reduction_numel)
1363 with V.set_kernel_handler(kernel):
-> 1364 src_code = kernel.codegen_kernel()
1366 kernel_name = self.define_kernel(src_code, node_schedule, kernel)

File /opt/conda/lib/python3.11/site-packages/torch/_inductor/codegen/triton.py:2661, in TritonKernel.codegen_kernel(self, name)
2646 triton_meta = {
2647 "signature": triton_meta_signature,
2648 "device": DeviceProperties.create(
(...)
2651 "constants": {},
2652 }
2654 inductor_meta = {
2655 "autotune_hints": set(self.autotune_hints),
2656 "kernel_name": str(Placeholder.DESCRIPTIVE_NAME),
2657 "mutated_arg_names": mutated_args,
2658 "no_x_dim": self.no_x_dim,
2659 "num_load": self.num_load,
2660 "num_reduction": self.num_reduction,
-> 2661 **self.inductor_meta_common(),
2662 }
2664 num_gb = None

File /opt/conda/lib/python3.11/site-packages/torch/_inductor/codegen/triton.py:2532, in TritonKernel.inductor_meta_common()
2529 @staticmethod
2530 def inductor_meta_common():
2531 inductor_meta = {
-> 2532 "backend_hash": torch.utils._triton.triton_hash_with_backend(),
2533 "are_deterministic_algorithms_enabled": torch.are_deterministic_algorithms_enabled(),
2534 "assert_indirect_indexing": config.assert_indirect_indexing,
2535 "autotune_local_cache": config.autotune_local_cache,
2536 "autotune_pointwise": config.triton.autotune_pointwise,
2537 "autotune_remote_cache": config.autotune_remote_cache,
2538 "force_disable_caches": config.force_disable_caches,
2539 "dynamic_scale_rblock": config.dynamic_scale_rblock,
2540 "max_autotune": config.max_autotune,
2541 "max_autotune_pointwise": config.max_autotune_pointwise,
2542 "min_split_scan_rblock": config.triton.min_split_scan_rblock,
2543 "spill_threshold": config.triton.spill_threshold,
2544 "store_cubin": config.triton.store_cubin,
2545 }
2546 if torch.version.hip is not None:

File /opt/conda/lib/python3.11/site-packages/torch/utils/_triton.py:53, in triton_hash_with_backend()
51 from triton.compiler.compiler import triton_key
---> 53 backend = triton_backend()
54 key = f"{triton_key()}-{backend.hash()}"

File /opt/conda/lib/python3.11/site-packages/torch/utils/_triton.py:45, in triton_backend()
43 from triton.runtime.driver import driver
---> 45 target = driver.active.get_current_target()
46 return make_backend(target)

File /opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py:23, in LazyProxy.getattr(self, name)
22 def getattr(self, name):
---> 23 self._initialize_obj()
24 return getattr(self._obj, name)

File /opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py:20, in LazyProxy._initialize_obj(self)
19 if self._obj is None:
---> 20 self._obj = self._init_fn()

File /opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py:9, in _create_driver()
8 raise RuntimeError(f"{len(actives)} active drivers ({actives}). There should only be one.")
----> 9 return actives0

File /opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py:371, in CudaDriver.init(self)
370 def init(self):
--> 371 self.utils = CudaUtils() # TODO: make static
372 self.launcher_cls = CudaLauncher

File /opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py:80, in CudaUtils.init(self)
79 def init(self):
---> 80 mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
81 self.load_binary = mod.load_binary

File /opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py:57, in compile_module_from_src(src, name)
56 f.write(src)
---> 57 so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
58 with open(so, "rb") as f:

File /opt/conda/lib/python3.11/site-packages/triton/runtime/build.py:32, in _build(name, src, srcdir, library_dirs, include_dirs, libraries)
31 if cc is None:
---> 32 raise RuntimeError("Failed to find C compiler. Please specify via CC environment variable.")
33 # This function was renamed and made public in Python 3.10

RuntimeError: Failed to find C compiler. Please specify via CC environment variable.

The above exception was the direct cause of the following exception:

BackendCompilerFailed Traceback (most recent call last)
File /opt/conda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py:116, in dump_input_when_exception.._inner.._wrapper(*args, **kwargs)
115 try:
--> 116 return func(*args, **kwargs)
117 except Exception as err:

File /opt/conda/lib/python3.11/site-packages/vllm/worker/model_runner.py:1691, in ModelRunner.execute_model(self, model_input, kv_caches, intermediate_tensors, num_steps)
1689 with set_forward_context(model_input.attn_metadata,
1690 self.vllm_config):
-> 1691 hidden_or_intermediate_states = model_executable(
1692 input_ids=model_input.input_tokens,
1693 positions=model_input.input_positions,
1694 kv_caches=kv_caches,
1695 attn_metadata=model_input.attn_metadata,
1696 intermediate_tensors=intermediate_tensors,
1697 **MultiModalKwargs.as_kwargs(multi_modal_kwargs,
1698 device=self.device),
1699 **seqlen_agnostic_kwargs)
1701 if (self.observability_config is not None
1702 and self.observability_config.collect_model_forward_time):

File /opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
1735 else:
-> 1736 return self._call_impl(*args, **kwargs)

File /opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1745 or _global_backward_pre_hooks or _global_backward_hooks
1746 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747 return forward_call(*args, **kwargs)
1749 result = None

File /opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/gemma2.py:442, in Gemma2ForCausalLM.forward(self, input_ids, positions, kv_caches, attn_metadata, intermediate_tensors, inputs_embeds)
433 def forward(
434 self,
435 input_ids: torch.Tensor,
(...)
440 inputs_embeds: Optional[torch.Tensor] = None,
441 ) -> Union[torch.Tensor, IntermediateTensors]:
--> 442 hidden_states = self.model(input_ids, positions, kv_caches,
443 attn_metadata, intermediate_tensors,
444 inputs_embeds)
445 return hidden_states

File /opt/conda/lib/python3.11/site-packages/vllm/compilation/decorators.py:168, in _support_torch_compile..call(self, *args, **kwargs)
167 if self.do_not_compile or torch.compiler.is_compiling():
--> 168 return self.forward(*args, **kwargs)
170 # the first compilation needs to have dynamic shapes marked

File /opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/gemma2.py:304, in Gemma2Model.forward(self, input_ids, positions, kv_caches, attn_metadata, intermediate_tensors, inputs_embeds)
303 layer = self.layers[i]
--> 304 hidden_states, residual = layer(
305 positions,
306 hidden_states,
307 kv_caches[i - self.start_layer],
308 attn_metadata,
309 residual,
310 )
311 if not get_pp_group().is_last_rank:

File /opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
1735 else:
-> 1736 return self._call_impl(*args, **kwargs)

File /opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1745 or _global_backward_pre_hooks or _global_backward_hooks
1746 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747 return forward_call(*args, **kwargs)
1749 result = None

File /opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/gemma2.py:229, in Gemma2DecoderLayer.forward(self, positions, hidden_states, kv_cache, attn_metadata, residual)
228 residual = hidden_states
--> 229 hidden_states = self.input_layernorm(hidden_states)
230 else:

File /opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
1735 else:
-> 1736 return self._call_impl(*args, **kwargs)

File /opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1745 or _global_backward_pre_hooks or _global_backward_hooks
1746 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747 return forward_call(*args, **kwargs)
1749 result = None

File /opt/conda/lib/python3.11/site-packages/vllm/model_executor/custom_op.py:24, in CustomOp.forward(self, *args, **kwargs)
23 def forward(self, *args, **kwargs):
---> 24 return self._forward_method(*args, **kwargs)

File /opt/conda/lib/python3.11/site-packages/vllm/model_executor/layers/layernorm.py:212, in GemmaRMSNorm.forward_cuda(self, x, residual)
211 self._is_compiled = True
--> 212 return self.forward_native(x, residual)

File /opt/conda/lib/python3.11/site-packages/vllm/model_executor/layers/layernorm.py:197, in GemmaRMSNorm.forward_native(self, x, residual)
196 """PyTorch-native implementation equivalent to forward()."""
--> 197 return self.forward_static(self.weight.data, self.variance_epsilon, x,
198 residual)

File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:465, in _TorchDynamoContext.call.._fn(*args, **kwargs)
464 try:
--> 465 return fn(*args, **kwargs)
466 finally:
467 # Restore the dynamic layer stack depth if necessary.

File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py:1269, in CatchErrorsWrapper.call(self, frame, cache_entry, frame_state)
1267 with compile_lock, _disable_current_modes():
1268 # skip=1: skip this frame
-> 1269 return self._torchdynamo_orig_callable(
1270 frame, cache_entry, self.hooks, frame_state, skip=1
1271 )

File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py:1064, in ConvertFrame.call(self, frame, cache_entry, hooks, frame_state, skip)
1063 try:
-> 1064 result = self._inner_convert(
1065 frame, cache_entry, hooks, frame_state, skip=skip + 1
1066 )
1067 counters["frames"]["ok"] += 1

File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py:526, in ConvertFrameAssert.call(self, frame, cache_entry, hooks, frame_state, skip)
512 signpost_event(
513 "dynamo",
514 "_convert_frame_assert._compile",
(...)
523 },
524 )
--> 526 return _compile(
527 frame.f_code,
528 frame.f_globals,
529 frame.f_locals,
530 frame.f_builtins,
531 self._torchdynamo_orig_callable,
532 self._one_graph,
533 self._export,
534 self._export_constraints,
535 hooks,
536 cache_entry,
537 cache_size,
538 frame,
539 frame_state=frame_state,
540 compile_id=compile_id,
541 skip=skip + 1,
542 )

File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py:924, in _compile(code, globals, locals, builtins, compiler_fn, one_graph, export, export_constraints, hooks, cache_entry, cache_size, frame, frame_state, compile_id, skip)
923 try:
--> 924 guarded_code = compile_inner(code, one_graph, hooks, transform)
925 return guarded_code

File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py:666, in _compile..compile_inner(code, one_graph, hooks, transform)
665 with CompileTimeInstructionCounter.record():
--> 666 return _compile_inner(code, one_graph, hooks, transform)

File /opt/conda/lib/python3.11/site-packages/torch/_utils_internal.py:87, in compile_time_strobelight_meta..compile_time_strobelight_meta_inner..wrapper_function(*args, **kwargs)
86 if not StrobelightCompileTimeProfiler.enabled:
---> 87 return function(*args, **kwargs)
89 return StrobelightCompileTimeProfiler.profile_compile_time(
90 function, phase_name, *args, **kwargs
91 )

File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py:699, in _compile.._compile_inner(code, one_graph, hooks, transform)
698 try:
--> 699 out_code = transform_code_object(code, transform)
700 break

File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/bytecode_transformation.py:1322, in transform_code_object(code, transformations, safe)
1320 propagate_line_nums(instructions)
-> 1322 transformations(instructions, code_options)
1323 return clean_and_assemble_instructions(instructions, keys, code_options)[1]

File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py:219, in preserve_global_state.._fn(*args, **kwargs)
218 try:
--> 219 return fn(*args, **kwargs)
220 finally:

File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py:634, in _compile..transform(instructions, code_options)
633 with tracing(tracer.output.tracing_context), tracer.set_current_tx():
--> 634 tracer.run()
635 except exc.UnspecializeRestartAnalysis:

File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py:2796, in InstructionTranslator.run(self)
2795 def run(self):
-> 2796 super().run()

File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py:983, in InstructionTranslatorBase.run(self)
982 self.output.push_tx(self)
--> 983 while self.step():
984 pass

File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py:895, in InstructionTranslatorBase.step(self)
894 try:
--> 895 self.dispatch_table[inst.opcode](self, inst)
896 return not self.output.should_exit

File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py:2987, in InstructionTranslator.RETURN_VALUE(self, inst)
2986 def RETURN_VALUE(self, inst):
-> 2987 self._return(inst)

File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py:2972, in InstructionTranslator._return(self, inst)
2971 log.debug("%s triggered compile", inst.opname)
-> 2972 self.output.compile_subgraph(
2973 self,
2974 reason=GraphCompileReason(
2975 "return_value", [self.frame_summary()], graph_break=False
2976 ),
2977 )
2978 return_inst = (
2979 create_instruction("RETURN_VALUE")
2980 if inst.opname == "RETURN_VALUE"
2981 else create_instruction("RETURN_CONST", argval=inst.argval)
2982 )

File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/output_graph.py:1117, in OutputGraph.compile_subgraph(self, tx, partial_convert, reason)
1115 # optimization to generate better code in a common case
1116 self.add_output_instructions(
-> 1117 self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root)
1118 + [create_instruction("UNPACK_SEQUENCE", arg=len(stack_values))]
1119 )
1120 # restore all the live local vars

File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/output_graph.py:1369, in OutputGraph.compile_and_call_fx_graph(self, tx, rv, root)
1368 with self.restore_global_state():
-> 1369 compiled_fn = self.call_user_compiler(gm)
1371 from torch.fx._lazy_graph_module import _LazyGraphModule

File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/output_graph.py:1416, in OutputGraph.call_user_compiler(self, gm)
1413 with dynamo_timed(
1414 "OutputGraph.call_user_compiler", phase_name="backend_compile"
1415 ):
-> 1416 return self._call_user_compiler(gm)

File /opt/conda/lib/python3.11/site-packages/torch/_dynamo/output_graph.py:1465, in OutputGraph._call_user_compiler(self, gm)
1464 except Exception as e:
-> 1465 raise BackendCompilerFailed(self.compiler_fn, e) from e
1467 signpost_event(
1468 "dynamo",
1469 "OutputGraph.call_user_compiler",
(...)
1475 },
1476 )

BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Failed to find C compiler. Please specify via CC environment variable.

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True

During handling of the above exception, another exception occurred:

TypeError Traceback (most recent call last)
Cell In[9], line 1
----> 1 llm = LLM(model=MODEL_ID, dtype="float16", device="cuda");
2 tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_ID);

File /opt/conda/lib/python3.11/site-packages/vllm/utils.py:986, in deprecate_args..wrapper..inner(*args, **kwargs)
979 msg += f" {additional_message}"
981 warnings.warn(
982 DeprecationWarning(msg),
983 stacklevel=3, # The inner function takes up one level
984 )
--> 986 return fn(*args, **kwargs)

File /opt/conda/lib/python3.11/site-packages/vllm/entrypoints/llm.py:230, in LLM.init(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, allowed_local_media_path, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, hf_overrides, mm_processor_kwargs, task, override_pooler_config, compilation_config, **kwargs)
227 self.engine_class = self.get_engine_class()
229 # TODO(rob): enable mp by default (issue with fork vs spawn)
--> 230 self.llm_engine = self.engine_class.from_engine_args(
231 engine_args, usage_context=UsageContext.LLM_CLASS)
233 self.request_counter = Counter()

File /opt/conda/lib/python3.11/site-packages/vllm/engine/llm_engine.py:517, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers)
515 executor_class = cls._get_executor_cls(engine_config)
516 # Create the LLM engine.
--> 517 engine = cls(
518 vllm_config=engine_config,
519 executor_class=executor_class,
520 log_stats=not engine_args.disable_log_stats,
521 usage_context=usage_context,
522 stat_loggers=stat_loggers,
523 )
525 return engine

File /opt/conda/lib/python3.11/site-packages/vllm/engine/llm_engine.py:276, in LLMEngine.init(self, vllm_config, executor_class, log_stats, usage_context, stat_loggers, input_registry, mm_registry, use_cached_outputs)
273 self.model_executor = executor_class(vllm_config=vllm_config, )
275 if self.model_config.runner_type != "pooling":
--> 276 self._initialize_kv_caches()
278 # If usage stat is enabled, collect relevant info.
279 if is_usage_stats_enabled():

File /opt/conda/lib/python3.11/site-packages/vllm/engine/llm_engine.py:416, in LLMEngine._initialize_kv_caches(self)
409 """Initialize the KV cache in the worker(s).
410
411 The workers will determine the number of blocks in both the GPU cache
412 and the swap CPU cache.
413 """
414 start = time.time()
415 num_gpu_blocks, num_cpu_blocks = (
--> 416 self.model_executor.determine_num_available_blocks())
418 if self.cache_config.num_gpu_blocks_override is not None:
419 num_gpu_blocks_override = self.cache_config.num_gpu_blocks_override

File /opt/conda/lib/python3.11/site-packages/vllm/executor/gpu_executor.py:68, in GPUExecutor.determine_num_available_blocks(self)
64 def determine_num_available_blocks(self) -> Tuple[int, int]:
65 """Determine the number of available KV blocks by invoking the
66 underlying worker.
67 """
---> 68 return self.driver_worker.determine_num_available_blocks()

File /opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py:116, in context_decorator..decorate_context(*args, **kwargs)
113 @functools.wraps(func)
114 def decorate_context(*args, **kwargs):
115 with ctx_factory():
--> 116 return func(*args, **kwargs)

File /opt/conda/lib/python3.11/site-packages/vllm/worker/worker.py:202, in Worker.determine_num_available_blocks(self)
196 # Execute a forward pass with dummy inputs to profile the memory usage
197 # of the model.
198 with memory_profiling(baseline_memory_in_bytes=total_gpu_memory -
199 self.init_gpu_memory,
200 weights_memory_in_bytes=self.model_runner.
201 model_memory_usage) as result:
--> 202 self.model_runner.profile_run()
203 torch.cuda.synchronize()
205 self._assert_memory_footprint_increased_during_profiling()

File /opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py:116, in context_decorator..decorate_context(*args, **kwargs)
113 @functools.wraps(func)
114 def decorate_context(*args, **kwargs):
115 with ctx_factory():
--> 116 return func(*args, **kwargs)

File /opt/conda/lib/python3.11/site-packages/vllm/worker/model_runner.py:1331, in GPUModelRunnerBase.profile_run(self)
1325 if not get_pp_group().is_first_rank:
1326 intermediate_tensors = self.model.make_empty_intermediate_tensors(
1327 batch_size=batch_size,
1328 dtype=self.model_config.dtype,
1329 device=self.device)
-> 1331 self.execute_model(model_input, kv_caches, intermediate_tensors)
1332 torch.cuda.synchronize()
1333 return

File /opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py:116, in context_decorator..decorate_context(*args, **kwargs)
113 @functools.wraps(func)
114 def decorate_context(*args, **kwargs):
115 with ctx_factory():
--> 116 return func(*args, **kwargs)

File /opt/conda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py:152, in dump_input_when_exception.._inner.._wrapper(*args, **kwargs)
146 raise type(err)(f"Error in model execution: "
147 f"{str(err)}") from err
149 logger.info(
150 "Completed writing input of failed execution to %s.",
151 filename)
--> 152 raise type(err)(
153 f"Error in model execution (input dumped to {filename}): "
154 f"{str(err)}") from err

TypeError: BackendCompilerFailed.init() missing 1 required positional argument: 'inner_exception'

Google org

Hi @pofce ,

I successfully executed the code on an NVIDIA Tesla A100 (40 GB) with 1 GPU. Could you please refer to this gist file and ensure that the specified library versions are used as mentioned below.

    transformers = 4.47.1
    torch = 2.5.1+cu121

If you still persists an issue, please let us know.

Thank you.

Installing the required libraries didn't resolve my issue (

image.png

Google org

Hi @pofce ,

To resolve above error, could you please install triton using below command:

    pip install triton

and also include below code:

    TORCH_LOGS="+dynamo"
    TORCHDYNAMO_VERBOSE=1
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

Even after running the code will get another error :

   ValueError: Model architectures ['Gemma2ForCausalLM'] failed to be inspected. Please check the logs for more details.

Which means Gemma2ForCausalLM architecture is not currently supported by vLLM. To resolve this issue, you can use the Hugging Face transformers library, which supports the Gemma2ForCausalLM architecture. Please modify your code as follows:

image.png

Thank you.

Sign up or log in to comment