Issue when using this build for unsloth

#4
by OygenValue - opened

Hi! I am trying to run the following notebook to finetune Llama 3.1 8b with Unsloth API:
https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing

I am running on Windows 11 with Python 3.10.
The triton install seemed to work in the sense that I can import the library, and dont get any problems up until:

#TRAIN
trainer_stats = trainer.train()

Then I get the following gigantic error:
C:/msys64/ucrt64/bin/../lib/gcc/x86_64-w64-mingw32/13.2.0/../../../../x86_64-w64-mingw32/bin/ld.exe: C:\Users\david\AppData\Local\Temp\cclK14Ke.o:main.c:(.text+0x69): undefined reference to __imp__PyArg_ParseTuple_SizeT' C:/msys64/ucrt64/bin/../lib/gcc/x86_64-w64-mingw32/13.2.0/../../../../x86_64-w64-mingw32/bin/ld.exe: C:\Users\david\AppData\Local\Temp\cclK14Ke.o:main.c:(.text+0x105): undefined reference to __imp_PyGILState_Ensure'
C:/msys64/ucrt64/bin/../lib/gcc/x86_64-w64-mingw32/13.2.0/../../../../x86_64-w64-mingw32/bin/ld.exe: C:\Users\david\AppData\Local\Temp\cclK14Ke.o:main.c:(.text+0x12f3): undefined reference to __imp_PyExc_RuntimeError' .... .... .... ..... C:/msys64/ucrt64/bin/../lib/gcc/x86_64-w64-mingw32/13.2.0/../../../../x86_64-w64-mingw32/bin/ld.exe: C:\Users\david\AppData\Local\Temp\cclK14Ke.o:main.c:(.text+0x1303): undefined reference to __imp_PyErr_SetString'
collect2.exe: error: ld returned 1 exit status
Traceback (most recent call last):
File "C:\Users\david\Projects\local_llms\model_scripts\unsloth_test.py", line 118, in
trainer_stats = trainer.train()
File "", line 145, in train
File "", line 363, in _fast_inner_training_loop
File "C:\ProgramData\anaconda3\envs\llm_local\lib\site-packages\transformers\trainer.py", line 3318, in training_step
loss = self.compute_loss(model, inputs)
File "C:\ProgramData\anaconda3\envs\llm_local\lib\site-packages\transformers\trainer.py", line 3363, in compute_loss
outputs = model(**inputs)
File "C:\Users\david\AppData\Roaming\Python\Python310\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\david\AppData\Roaming\Python\Python310\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
.....
.....
File "C:\ProgramData\anaconda3\envs\llm_local\lib\site-packages\accelerate\hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "C:\ProgramData\anaconda3\envs\llm_local\lib\site-packages\unsloth\models\llama.py", line 771, in LlamaModel_fast_forward
hidden_states = Unsloth_Offloaded_Gradient_Checkpointer.apply(
File "C:\Users\david\AppData\Roaming\Python\Python310\site-packages\torch\autograd\function.py", line 574, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "C:\Users\david\AppData\Roaming\Python\Python310\site-packages\torch\amp\autocast_mode.py", line 455, in decorate_fwd
return fwd(*args, **kwargs)
File "C:\ProgramData\anaconda3\envs\llm_local\lib\site-packages\unsloth\models_utils.py", line 782, in forward
output = forward_function(hidden_states, *args)
File "C:\Users\david\AppData\Roaming\Python\Python310\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\david\AppData\Roaming\Python\Python310\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "C:\ProgramData\anaconda3\envs\llm_local\lib\site-packages\accelerate\hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "C:\ProgramData\anaconda3\envs\llm_local\lib\site-packages\unsloth\models\llama.py", line 485, in LlamaDecoderLayer_fast_forward
hidden_states = fast_rms_layernorm(self.input_layernorm, hidden_states)
File "C:\ProgramData\anaconda3\envs\llm_local\lib\site-packages\unsloth\kernels\rms_layernorm.py", line 192, in fast_rms_layernorm
out = Fast_RMS_Layernorm.apply(X, W, eps, gemma)
File "C:\Users\david\AppData\Roaming\Python\Python310\site-packages\torch\autograd\function.py", line 574, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "C:\ProgramData\anaconda3\envs\llm_local\lib\site-packages\unsloth\kernels\rms_layernorm.py", line 144, in forward
fx[(n_rows,)](
File "C:\ProgramData\anaconda3\envs\llm_local\lib\site-packages\triton\runtime\jit.py", line 166, in
return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
File "C:\ProgramData\anaconda3\envs\llm_local\lib\site-packages\triton\runtime\jit.py", line 348, in run
device = driver.get_current_device()
File "C:\ProgramData\anaconda3\envs\llm_local\lib\site-packages\triton\runtime\driver.py", line 230, in getattr
self._initialize_obj()
File "C:\ProgramData\anaconda3\envs\llm_local\lib\site-packages\triton\runtime\driver.py", line 227, in _initialize_obj
self._obj = self._init_fn()
File "C:\ProgramData\anaconda3\envs\llm_local\lib\site-packages\triton\runtime\driver.py", line 260, in initialize_driver
return CudaDriver()
File "C:\ProgramData\anaconda3\envs\llm_local\lib\site-packages\triton\runtime\driver.py", line 122, in init
self.utils = CudaUtils()
File "C:\ProgramData\anaconda3\envs\llm_local\lib\site-packages\triton\runtime\driver.py", line 69, in init
so = _build("cuda_utils", src_path, tmpdir)
File "C:\ProgramData\anaconda3\envs\llm_local\lib\site-packages\triton\common\build.py", line 124, in _build
ret = subprocess.check_call(cc_cmd)
File "C:\ProgramData\anaconda3\envs\llm_local\lib\subprocess.py", line 369, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['C:\msys64\ucrt64\bin\g++.EXE', 'C:\Users\david\AppData\Local\Temp\tmpbcorowqg\main.c', '-O3', '-shared', '-IC:\ProgramData\anaconda3\envs\llm_local\lib\site-packages\triton\common\..\third_party\cuda\include', '-IC:\ProgramData\anaconda3\envs\llm_local\Include', '-IC:\Users\david\AppData\Local\Temp\tmpbcorowqg', '-LC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\lib\x64', '-LC:\ProgramData\anaconda3\envs\llm_local\libs', '-lcuda', '-o', 'C:\Users\david\AppData\Local\Temp\tmpbcorowqg\cuda_utils.cp310-win_amd64.pyd']' returned non-zero exit status 1.
0%| | 0/60 [00:01<?, ?it/s]

which seems to be a problem with running the build with g++. I tried setting an environment variable such that cc in build.py in triton would become g++ instead of gcc but that didnt fix the problem.
Any ideas of what I could be missing? Thanks and great work! :)

Pretty sure you need to use cuda 11 fot this triton build

madbuda changed discussion status to closed

Sign up or log in to comment