Back to homepage

Local Distillation Experiment

Ongoing Experiment

Distilling GPT-5.4's Code Repair Behavior into a Local Qwen3.5 35B

The real question was not whether distillation could run, but how to make a local model improve on real code-repair tasks in a way that stayed interpretable and repeatable.

This experiment used GPT-5.4 as the teacher and covered the full loop from sampling and response cleaning to verified best-of-k selection, MLX LoRA training, and repeat-3 holdout evaluation. At this stage, I care less about any single score and more about which effects look stable versus which ones are really coming from upstream data formation.

Strict 800
14/24

better than non-strict at 12/24

Non-strict 1200
14/24

better than strict 1200

Repeat-3
stdev = 0

the evaluation stage itself stayed stable

Why I Ran This

I wanted to distill GPT-5.4's code-repair behavior into a local model rather than stop at an online-only workflow.

The hard part was not getting data and training scripts to run. The hard part was understanding where improvement actually came from and whether the result could be trusted.

Repeat-3 evaluation showed the holdout result itself was stable, which pushed the real uncertainty back toward teacher sampling and dataset formation.

How I Structured the Pipeline

01

Teacher sampling

Sampled multiple GPT-5.4 teacher candidates per training task and kept the raw responses so formatting noise and content noise stayed visible.

02

Cleaning + verification

Cleaned GPT-5.4 teacher outputs into executable file updates, then verified them against task checks so only real passing fixes survived.

03

Verified best-of-k

Selected only verified passing candidates for each task instead of feeding plan text, JSON, or invalid repairs into the student.

04

MLX LoRA + holdout repeat-3

Trained adapters locally, then repeated the same holdout evaluation three times to check whether the result was actually stable.

How the Experiment Evolved

Start with best-of-k

The first goal was to improve sample quality through verified selection rather than simply increase teacher volume.

Add a strict prompt profile

I tightened both the system prompt and the task-local output contract to suppress plans, JSON, wrapper prose, and shell commands.

Finish with repeat-3

I did not want to trust a single report, so I ran the exact same holdout set three times at each budget to see where the variance really lived.

What Strict vs Non-strict Means

Non-strict means a looser output contract for the GPT-5.4 teacher. It gives the model more room to elaborate, which also makes it more likely to emit plans, JSON, explanatory prose, or other wrapper text.

Strict means a tighter output contract. I explicitly push the GPT-5.4 teacher to produce content that is closer to the final repair so cleaning and verification fail less often.

The comparison is not between two models. It is the same GPT-5.4 teacher under two output contracts, distilled into the same local Qwen3.5 35B.

Latest Results

Data quality

Non-strict cleaned110/120

10 clean failures, 1 verify failure

Strict cleaned120/120

0 clean failures, 0 verify failures

Selected tasks30/30

both runs covered the full train set

Repeat-3 holdout

Non-strict 80012/24

base 11/24

Non-strict 120014/24

base 13/24

Strict 80014/24

base 11/24

Strict 120013/24

base 13/24

What the Numbers Mean

800 token budget

Non-strict
12/24
Strict
14/24

Strict prompting clearly helped at lower budget, which suggests cleaner teacher contracts improved low-budget reliability.

1200 token budget

Non-strict
14/24
Strict
13/24

The gain was not universal. Once the budget opened up, the strict profile lost some flexibility on a subset of tasks.

What I'm Seeing

After running repeat-3, I'm fairly confident the holdout evaluation itself is stable. At least in this round, it does not look like the main source of variation.

The bigger uncertainty is still upstream: changes in teacher sampling affect cleaning quality, which affects what gets selected, which then shows up in student performance.

So if I keep pushing this experiment, I would not keep tightening the prompt. I would test a medium-constraint variant and focus on the regression at the 1200-token budget.

How My Read Changed During the Experiment

At the beginning, I leaned toward the idea that tighter prompting would simply make the distillation cleaner and more reliable.

After the repeat-3 runs and task-level comparisons, that read changed. Strict prompting clearly helped at lower budget, but once the budget opened up, the real question became whether I was over-constraining the model and suppressing useful flexibility.

So the next step now looks different to me: spend less effort tightening the prompt even further, and more effort on teacher sampling plus a medium-constraint variant.