Why I Ran This
I wanted to distill GPT-5.4's code-repair behavior into a local model rather than stop at an online-only workflow.
The hard part was not getting data and training scripts to run. The hard part was understanding where improvement actually came from and whether the result could be trusted.
Repeat-3 evaluation showed the holdout result itself was stable, which pushed the real uncertainty back toward teacher sampling and dataset formation.