Introduction

In our previous blog titled “The LLM Hangover: Why SLMs Are Making a Comeback,” the central thesis was simple: the future of enterprise AI is not just about building with larger models, but about compressing the right behavior into smaller, deployable ones. That naturally leads to the next question.

If Small Language Models (SLMs) are increasingly being used to power production agents, how exactly should they be trained?

The default instinct has been straightforward. Take a large LLM teacher model, collect its chain-of-thought traces, and fine-tune the smaller student model to imitate the reasoning of the teacher model. This works, but only up to a point. Mainly because distilling with just teacher outputs has a hidden problem.

A student model does not usually fail because it has never seen a correct answer. It fails because, during generation, it still assigns too much probability to its own wrong reasoning paths. Standard supervised fine-tuning teaches what the right trace looks like, but it does not explicitly teach the model what kinds of reasoning paths it should move away from.

This is precisely what our work titled “ORPO-Distill-Mixed-Policy Preference Optimization for Cross-Architecture LLM Distillation” published in NeurIPS Efficient Reasoning 2025 aims to address.

Instead of viewing black-box distillation as pure imitation, our work reformulates it as a preference optimization problem. The student is not only shown what the teacher did correctly, but is also trained to prefer those teacher traces over its own incorrect generations.

Our proposed distillation pipeline ORPO-Distill has the following key differentiators:

Architecture-agnostic: Does not require access to teacher logits or internal states, enabling black-box cross-architecture distillation.
Addresses autoregressive training–inference mismatch: Mitigates the bias caused by training on fixed teacher outputs while inference at test time relies on the student’s own generations.
Consistently stronger performance: Experimental validation across five datasets and two student model architectures with consistent gains over distilling/fine-tuning directly on teacher reasoning traces.

Comparison of ORPO-Distill against prior distillation pipelines

Figure 1: Comparison of ORPO-Distill against prior distillation pipelines

Key Contributions and Approach: ORPO-Distill

ORPO-Distill reframes black-box reasoning distillation as a preference learning problem rather than pure imitation.

For each prompt, the training example is a triplet ⟨Prompt, Chosen, Rejected⟩ where:

Chosen output is a teacher-generated correct reasoning trace
Rejected output is a student-generated incorrect reasoning trace. Instead of only learning to imitate the teacher, the student is explicitly trained to assign higher preference to correct teacher reasoning than to its own wrong generations.

ORPO loss penalizes the student’s negative generations

The core of the method is the usage of the Odds Ratio Preference Optimization (ORPO) objective instead of a cross entropy objective. It combines standard supervised learning on the teacher’s correct trace with a contrastive preference term that penalizes the student when it assigns relatively high probability to its incorrect output.

Here, yP is the teacher’s correct reasoning trace and yN is the student’s incorrect generation. This makes the learning signal explicitly contrastive. Standard cross-entropy only says “increase probability on the correct output.” whereas ORPO says “increase probability on the correct output relative to the incorrect student output,” which directly suppresses the student’s own failure modes.
Diverse sampling and rejection sampling improve the distillation set

ORPO-Distill does not rely on a single teacher chain-of-thought per prompt. Instead, both teacher and student generate multiple reasoning traces using temperature sampling. This increases coverage over different valid reasoning paths and avoids overfitting the student to one narrow phrasing or trajectory.

To keep only informative samples, the pipeline applies rejection sampling based on ROUGE-L overlap. Highly similar traces are discarded, reducing redundancy and preserving diversity in the training set. The key idea is simple: in reasoning distillation, more data is not always better. A smaller set of high-quality, diverse, non-duplicate traces often provides a stronger learning signal than a much larger pool of repetitive generations.
Off-policy, on-policy, and mixed-policy updates

Since the negative samples come from the student itself, the sampled dataset can be updated during training. This matters because the student’s error distribution changes as learning progresses.
- Off-policy updates: negative traces are generated once using the initial student model and kept fixed throughout training.
- On-policy updates: negative traces are regenerated at every epoch using the latest student checkpoint.
- Mixed-policy updates: negative traces are drawn from a mixture of the initial student and the latest student checkpoint.

These updates are needed because fixed negatives may become stale as the student improves, while fully on-policy negatives can become too narrow and too close to correct reasoning, reducing contrastive diversity. The paper shows that mixed-policy updates strike the best balance: they preserve the broader failure distribution of the base student while still incorporating harder, more recent mistakes from the current student. This helps avoid narrow learning dynamics and gives stronger performance than both purely off-policy and purely on-policy training.

In short, ORPO-Distill’s key contribution is not just better data collection, but a more effective learning recipe: diverse trace sampling, explicit contrastive preference optimization, and intelligently refreshed student negatives. All these insights work together to create a practical enterprise-ready distillation pipeline.

Architecture pipeline of ORPO-Distill. Proposed pipelines use contrastive distillation

Figure 2: Architecture pipeline of ORPO-Distill. Proposed pipelines use contrastive distillation between reasoning traces of Teacher and Student models.

What the Experimental Results Actually Show

Experiment	Accuracy (%)	Training
Experiment	Accuracy (%)	Tokens	Time (hr)	Cost ($)
Zero-Shot CoT Eval (Student)	35.82	–	–	–
Label Fine Tuning	42.5	20 M	3.5	12.84
CoT Fine Tuning	40.56	160 M	8	29.36
Pre-training + Label Fine Tuning	45.4	4020 M	144	528.48
ORPO-Distill	50.43	160 M	10	36.7
Zero-Shot CoT Eval (Teacher)	50.98	–	–	–

Table 1: Comparing Accuracy, Tokens, Training time and cost on MedQA Benchmark highlighting the performance-cost tradeoff b/w Finetuning and ORPO-Distill strategy.

The results from our work can be divided into three clear takeaways:

ORPO-Distill beats direct fine-tuning, and not just because of better data

Across more than five benchmark datasets and two different student model families, ORPO-Distill consistently outperforms direct fine-tuning on teacher CoT traces.
This remains true even against diverse CoT fine-tuning, which already controls for the simple “more or better teacher samples” argument by using the same scale of reasoning data.

The conclusion is important: the gains are not coming only from data diversity, but from the clever usage of contrastive preference-based learning objective that explicitly teaches the student to prefer teacher-correct reasoning over its own incorrect generations.
ORPO-Distill delivers a much better cost-to-performance tradeoff

The training table shows that ORPO-Distill achieves the strongest accuracy with only a modest increase in training cost over standard CoT fine-tuning.
In contrast, closing the gap through heavier alternatives such as additional domain pre-training is far more expensive in both tokens and training time.

The practical takeaway is that ORPO-Distill extracts more value from roughly the same supervision budget, making it a much more attractive recipe for enterprise-scale reasoning distillation.

The results shown here for MedQA is one representative example. The full paper shows the same broader pattern across datasets and model families, and can be referred to for the full picture.
Mixed-policy updates are the most stable way to refresh student negatives

The policy ablation (fig. 3) shows that how the rejected student samples are updated during training matters significantly.

Off-policy updates eventually become stale because they keep training against an older student error distribution. On-policy updates can become too narrow over time, since the latest student generations start collapsing toward a smaller set of mistakes, weakening contrastive diversity. Mixed-policy updates strike the best balance: they retain coverage over the student’s broader failure modes while still incorporating harder, more recent negatives.

This is why mixed-policy training shows the strongest and most stable behavior across epochs, and why it is a key ingredient of the final ORPO-Distill pipeline.

Ablation study over impact of Policy update strategy

Figure 3: Ablation study over impact of Policy update strategy

Key Takeaways for Enterprise Distillation Workflows

Prioritize quality over volume

Enterprise distillation works best when the supervision set is aggressively curated for diversity, quality, and non-redundancy rather than simply scaled up.
Optimize for preference, not just imitation

Training the student to prefer correct teacher reasoning over its own incorrect generations is more effective than only fine-tuning on teacher traces.
Refresh distillation dataset with balance

Rejected student samples should be updated during training, but with a balanced strategy that preserves both fresh mistakes and broad failure diversity.

If this work resonates with how you are approaching distillation for your Enterprise LLMs, you can refer to our full paper for more details. Stay tuned for more research and updates from Phi Labs, Quantiphi on advancing model distillation and post-training for Small Language Models.