vkerkez
/

Llama-GitVac-Turbo-3B

Safetensors

llama

Model card Files Files and versions Community

vkerkez commited on 6 days ago

Commit

e7dfeda

verified ·

1 Parent(s): 38799c8

Update README.md

Browse files

Files changed (1) hide show

README.md +5 -11

README.md CHANGED Viewed

@@ -1,9 +1,3 @@
----
-license: mit
-tags:
-- unsloth
----
 # GitVac
 Don't forget to vacuum your git repo.
@@ -16,7 +10,7 @@ GitVac is like a vacuum cleaner for code fixes. It's a series of 3B, 8B, 14B, an
 # How were the models made?
- I distilled samples from r1 and o3 through multiple rounds of trial and error. About 2.4k questions fired off, with 1.1k making the verification cut. My rough estimate puts o3 at about 30% success rate, with deepseek cruising around 15%.
 # How is verification done?
 A lot of models are already trained on function calling syntax.
@@ -342,13 +336,14 @@ These models create pre-made actions that are higher quality than the turbo mode
 # Benchmarks
 I started with 2,400 patches/issues.
-- Only 1,100 problems could be solved by o3/r1
 - Each problem was attempted up to 3 times. earlier scripts were doing up to 10.
 - The remaining 1,300 problems were ones that these top models failed to solve even after 9,000 total attempts
 To evaluate GitVac models, I randomly selected a sample size from the unfinished dataset.
 ## Performance Results
 | Model | Success Rate | Notes |
 |-------|--------------|-------|
@@ -363,7 +358,7 @@ Start by gathering your patches and extracting all the necessary components - st
 Combine the tool calls into a list and shuffle them randomly. This randomization turned out to be a crucial factor in improving dataset quality.
-Initially, I presented the function calls in a fixed order (writes, reads, deletes). The models would blindly follow this pattern - making changes before reading files, which makes no logical sense. Simply instructing them to do otherwise in the prompt had no effect. Both r1 and o3 exhibited this same behavior.
 A breakthrough came when I randomized the function calls. This seemed to break the models out of their rigid patterns and activate more natural problem-solving behaviors. They started properly reading files before modifying them and demonstrated more realistic roleplay capabilities.
@@ -529,7 +524,7 @@ Think before you respond.
 <br>
 # Cost & Details
-The total cost for this project was approximately $400, with the majority spent on OpenAI.
 I used an automated script to handle the full training pipeline - from finetuning through evaluation across all model sizes up to 32B parameters. The hardware setup included an A100 80GB GPU and a rented H200 140GB+ GPU from RunPod.
 Training times varied from 1.5 hours for smaller models up to 8 hours for the largest ones. All reasoning models went through 3 epochs of training.
@@ -543,4 +538,3 @@ This does a few things:
 With this dataset, we can fine-tune to get a base model. This model can then be further improved through RLHF (Reinforcement Learning from Human Feedback) and GRPO (Guided Reward Policy Optimization) training, where it will continuously learn from new datasets generated by the pipeline. This creates a virtuous cycle of improvement, with each iteration building on the knowledge gained from previous runs.
 I should probably write up a whole separate post on this extended pipeline someday. For now enjoy this repo!

 # GitVac
 Don't forget to vacuum your git repo.
 # How were the models made?
+ I distilled samples from r1 through multiple rounds of trial and error. About 2.4k questions fired off, with 1.1k making the verification cut. My rough estimate puts it around 45%.
 # How is verification done?
 A lot of models are already trained on function calling syntax.
 # Benchmarks
 I started with 2,400 patches/issues.
+- Only 1,100 problems could be solved by r1
 - Each problem was attempted up to 3 times. earlier scripts were doing up to 10.
 - The remaining 1,300 problems were ones that these top models failed to solve even after 9,000 total attempts
 To evaluate GitVac models, I randomly selected a sample size from the unfinished dataset.
 ## Performance Results
+Tested against the remainder 1300 r1 could not pass. These datasets were never seen in the models training.
 | Model | Success Rate | Notes |
 |-------|--------------|-------|
 Combine the tool calls into a list and shuffle them randomly. This randomization turned out to be a crucial factor in improving dataset quality.
+Initially, I presented the function calls in a fixed order (writes, reads, deletes). The models would blindly follow this pattern - making changes before reading files, which makes no logical sense. Simply instructing them to do otherwise in the prompt had no effect.
 A breakthrough came when I randomized the function calls. This seemed to break the models out of their rigid patterns and activate more natural problem-solving behaviors. They started properly reading files before modifying them and demonstrated more realistic roleplay capabilities.
 <br>
 # Cost & Details
+The total cost for this project was approximately $400, with the majority spent on inference.
 I used an automated script to handle the full training pipeline - from finetuning through evaluation across all model sizes up to 32B parameters. The hardware setup included an A100 80GB GPU and a rented H200 140GB+ GPU from RunPod.
 Training times varied from 1.5 hours for smaller models up to 8 hours for the largest ones. All reasoning models went through 3 epochs of training.
 With this dataset, we can fine-tune to get a base model. This model can then be further improved through RLHF (Reinforcement Learning from Human Feedback) and GRPO (Guided Reward Policy Optimization) training, where it will continuously learn from new datasets generated by the pipeline. This creates a virtuous cycle of improvement, with each iteration building on the knowledge gained from previous runs.
 I should probably write up a whole separate post on this extended pipeline someday. For now enjoy this repo!