vkerkez commited on
Commit
e7dfeda
·
verified ·
1 Parent(s): 38799c8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -11
README.md CHANGED
@@ -1,9 +1,3 @@
1
- ---
2
- license: mit
3
- tags:
4
- - unsloth
5
- ---
6
-
7
  # GitVac
8
  Don't forget to vacuum your git repo.
9
 
@@ -16,7 +10,7 @@ GitVac is like a vacuum cleaner for code fixes. It's a series of 3B, 8B, 14B, an
16
 
17
 
18
  # How were the models made?
19
- I distilled samples from r1 and o3 through multiple rounds of trial and error. About 2.4k questions fired off, with 1.1k making the verification cut. My rough estimate puts o3 at about 30% success rate, with deepseek cruising around 15%.
20
 
21
  # How is verification done?
22
  A lot of models are already trained on function calling syntax.
@@ -342,13 +336,14 @@ These models create pre-made actions that are higher quality than the turbo mode
342
  # Benchmarks
343
  I started with 2,400 patches/issues.
344
 
345
- - Only 1,100 problems could be solved by o3/r1
346
  - Each problem was attempted up to 3 times. earlier scripts were doing up to 10.
347
  - The remaining 1,300 problems were ones that these top models failed to solve even after 9,000 total attempts
348
 
349
  To evaluate GitVac models, I randomly selected a sample size from the unfinished dataset.
350
 
351
  ## Performance Results
 
352
 
353
  | Model | Success Rate | Notes |
354
  |-------|--------------|-------|
@@ -363,7 +358,7 @@ Start by gathering your patches and extracting all the necessary components - st
363
 
364
  Combine the tool calls into a list and shuffle them randomly. This randomization turned out to be a crucial factor in improving dataset quality.
365
 
366
- Initially, I presented the function calls in a fixed order (writes, reads, deletes). The models would blindly follow this pattern - making changes before reading files, which makes no logical sense. Simply instructing them to do otherwise in the prompt had no effect. Both r1 and o3 exhibited this same behavior.
367
 
368
  A breakthrough came when I randomized the function calls. This seemed to break the models out of their rigid patterns and activate more natural problem-solving behaviors. They started properly reading files before modifying them and demonstrated more realistic roleplay capabilities.
369
 
@@ -529,7 +524,7 @@ Think before you respond.
529
  <br>
530
 
531
  # Cost & Details
532
- The total cost for this project was approximately $400, with the majority spent on OpenAI.
533
  I used an automated script to handle the full training pipeline - from finetuning through evaluation across all model sizes up to 32B parameters. The hardware setup included an A100 80GB GPU and a rented H200 140GB+ GPU from RunPod.
534
  Training times varied from 1.5 hours for smaller models up to 8 hours for the largest ones. All reasoning models went through 3 epochs of training.
535
 
@@ -543,4 +538,3 @@ This does a few things:
543
 
544
  With this dataset, we can fine-tune to get a base model. This model can then be further improved through RLHF (Reinforcement Learning from Human Feedback) and GRPO (Guided Reward Policy Optimization) training, where it will continuously learn from new datasets generated by the pipeline. This creates a virtuous cycle of improvement, with each iteration building on the knowledge gained from previous runs.
545
  I should probably write up a whole separate post on this extended pipeline someday. For now enjoy this repo!
546
-
 
 
 
 
 
 
 
1
  # GitVac
2
  Don't forget to vacuum your git repo.
3
 
 
10
 
11
 
12
  # How were the models made?
13
+ I distilled samples from r1 through multiple rounds of trial and error. About 2.4k questions fired off, with 1.1k making the verification cut. My rough estimate puts it around 45%.
14
 
15
  # How is verification done?
16
  A lot of models are already trained on function calling syntax.
 
336
  # Benchmarks
337
  I started with 2,400 patches/issues.
338
 
339
+ - Only 1,100 problems could be solved by r1
340
  - Each problem was attempted up to 3 times. earlier scripts were doing up to 10.
341
  - The remaining 1,300 problems were ones that these top models failed to solve even after 9,000 total attempts
342
 
343
  To evaluate GitVac models, I randomly selected a sample size from the unfinished dataset.
344
 
345
  ## Performance Results
346
+ Tested against the remainder 1300 r1 could not pass. These datasets were never seen in the models training.
347
 
348
  | Model | Success Rate | Notes |
349
  |-------|--------------|-------|
 
358
 
359
  Combine the tool calls into a list and shuffle them randomly. This randomization turned out to be a crucial factor in improving dataset quality.
360
 
361
+ Initially, I presented the function calls in a fixed order (writes, reads, deletes). The models would blindly follow this pattern - making changes before reading files, which makes no logical sense. Simply instructing them to do otherwise in the prompt had no effect.
362
 
363
  A breakthrough came when I randomized the function calls. This seemed to break the models out of their rigid patterns and activate more natural problem-solving behaviors. They started properly reading files before modifying them and demonstrated more realistic roleplay capabilities.
364
 
 
524
  <br>
525
 
526
  # Cost & Details
527
+ The total cost for this project was approximately $400, with the majority spent on inference.
528
  I used an automated script to handle the full training pipeline - from finetuning through evaluation across all model sizes up to 32B parameters. The hardware setup included an A100 80GB GPU and a rented H200 140GB+ GPU from RunPod.
529
  Training times varied from 1.5 hours for smaller models up to 8 hours for the largest ones. All reasoning models went through 3 epochs of training.
530
 
 
538
 
539
  With this dataset, we can fine-tune to get a base model. This model can then be further improved through RLHF (Reinforcement Learning from Human Feedback) and GRPO (Guided Reward Policy Optimization) training, where it will continuously learn from new datasets generated by the pipeline. This creates a virtuous cycle of improvement, with each iteration building on the knowledge gained from previous runs.
540
  I should probably write up a whole separate post on this extended pipeline someday. For now enjoy this repo!