Experimental Setup & Reproducibility
Open Source Implementation: Complete TDLM codebase available at https://github.com/jibarix/tdlm5 with comprehensive documentation, test suite, and configuration files for full reproducibility.
Hardware Configuration: NVIDIA GeForce RTX 3070 Ti Laptop GPU (8.59GB VRAM), Intel 12th Gen processor, Windows 10. All experiments conducted on consumer-grade hardware to validate that diffusion advantages extend beyond high-end research clusters.
Tiny Model Architecture: Model scaled down to 2 layers, 64 hidden dimensions, 4 attention heads, 128 max sequence length (6.54M total parameters) for accessible experimentation. Training: 5 epochs, 64 batch size, 1 gradient accumulation step, AdamW optimizer (3e-4 learning rate), ~30 minutes per run.
Fair Comparison Protocol: Only training mode changed between experiments - identical architecture, data (WikiText-2), optimizer settings, batch configuration, random seed (42), and training duration. The sole difference: discrete diffusion vs autoregressive training objective.
Reproduction Instructions: git clone https://github.com/jibarix/tdlm5 && python main.py --config config/quick_test.yaml
for diffusion, then change training_mode: "autoregressive"
and re-run for comparison. Complete test guide included in repository documentation.
Theoretical Foundation: Latest Discrete Diffusion Research
Key Research Papers: Austin et al. (2021) foundational framework, Ni et al. (2025) "super data learners" analysis, Prabhudesai et al. (2025) data efficiency findings, Nie et al. (2025) LLaDA competitive performance, Zhang (2025) optimal scheduling theory
Implementation Repository: All theoretical components implemented and validated in working code at github.com/jibarix/tdlm5 with comprehensive documentation and testing framework.
• Austin et al. (2021): Theoretical Foundation
Core Framework: Established discrete diffusion via absorbing state masking with theoretically grounded time-dependent loss weighting. Provides the mathematical foundation for training diffusion models on discrete sequences.
Critical Implementation: Proper weighted ELBO formulation essential for fair comparison with autoregressive models - incorrect unweighted loss leads to unfair evaluations.
• Ni et al. (2025): "Diffusion Language Models are Super Data Learners"
Key Finding: Diffusion models are "super data learners" that excel through bidirectional modeling and computational super-density. They extract >3x data potential compared to autoregressive models by trading additional FLOPs for improved learning.
Methodological Framework: Emphasizes downstream task performance over validation loss for fair comparison, addressing limitations in traditional evaluation metrics.
• Prabhudesai et al. (2025): "Diffusion Beats Autoregressive"
Data-Constrained Analysis: Demonstrates that diffusion models significantly outperform autoregressive models when data, not compute, is the bottleneck. Shows diffusion models continue improving beyond 100 epochs while AR models saturate.
Critical Compute Threshold: Establishes power-law relationship for the compute point where diffusion begins outperforming autoregressive models in data-constrained settings.
• LLaDA (2025): Variable Masking Strategy
Training Innovation: Variable masking ratios per sequence (not single ratio per batch) crucial for competitive performance. Each sequence gets different corruption levels during training.
Scaling Validation: LLaDA 8B achieves competitive performance with LLaMA3 8B, proving discrete diffusion viability at scale.
• Zhang (2025): Optimal Scheduling Theory
Theoretical Optimality: Proves cosine schedule is Fisher-Rao optimal for discrete diffusion inference. Creates equally difficult denoising steps for maximum generation quality.
Practical Impact: Provides theoretical foundation for schedule choice, moving beyond experimental tuning to principled optimization.
Head-to-Head Comparison Results
Identical model architectures, training data (WikiText-2), and optimization settings. Both models trained for equivalent training duration with same batch configuration. The only difference: training objective (discrete diffusion vs autoregressive). Results demonstrate clear diffusion advantage across all language modeling metrics. Data verified from actual experimental runs with Weights & Biases tracking.
Key Observation: Despite identical training conditions and computational budget, diffusion achieved 37.4% better perplexity (382.74 vs 611.01). This demonstrates diffusion's superior learning efficiency per gradient step, validating the "super data learner" hypothesis - extracting more signal from the same training data through bidirectional modeling.
Test Perplexity
Test Perplexity
Best Val Loss
Best Val Loss
Early Training Dominance: Diffusion Superiority from First Steps
Diffusion shows superiority from the very beginning, not just at convergence
Demonstrates diffusion's learning efficiency advantage is not a late-stage phenomenon
Advanced Comparison Analysis: Perfect Hyperparameter Mirroring
Hyperparameter-Mirrored Protocol: Our validation employs perfect hyperparameter mirroring - identical architecture, optimization settings, batch configuration, random seed (42), and training duration. The sole experimental variable: training mode (diffusion vs autoregressive). This eliminates confounding factors and provides the cleanest possible comparison.
Methodological Rigor: Following recommendations from Ni et al. (2025), we complement traditional validation loss metrics with downstream task performance. This addresses known limitations where autoregressive models compute exact likelihood while diffusion models provide upper bounds, ensuring fair evaluation protocols.
Traditional Metrics
Fair Comparison
Research Alignment
Scale Validation
Advanced Evaluation Results: Beyond Traditional Metrics
Comprehensive comparison following latest research recommendations
Downstream task results reflect 6.54M parameter scale limitations - both tasks typically require 100M+ parameters for meaningful performance
*Limited-run evaluations on hardcoded sample questions designed for quick validation (samples available in code)
• Super Data Learner Validation: Bidirectional Modeling Advantage
Theoretical Confirmation: Our results confirm Ni et al.'s "super data learner" hypothesis at tiny scale. Despite identical training data, diffusion's bidirectional attention mechanism extracts 37.4% more signal through diverse token orderings versus autoregressive left-to-right factorization.
Computational Super-Density: Diffusion models achieve superior performance by trading computational efficiency for data efficiency - exactly the trade-off predicted by recent theoretical analysis.
• Perfect Hyperparameter Mirroring: Isolating Training Objective Impact
Experimental Control: Our validation uses identical hyperparameters for both models - same architecture (2 layers, 64 hidden, 4 heads), optimizer settings (AdamW, 3e-4 LR), batch configuration (64), and random seed (42). Only the training mode differs.
Pure Comparison: This hyperparameter-mirrored approach eliminates all confounding variables, providing the cleanest possible assessment of diffusion vs autoregressive training objectives on identical computational budgets.
• Enhanced Evaluation Framework: Beyond Traditional Metrics
Multi-Metric Assessment: Following Ni et al.'s recommendations, we evaluate models using downstream task performance (HellaSwag, MMLU) alongside traditional validation loss, providing more reliable comparison framework.
Scale Considerations: At 6.54M parameters, both models show expected limitations on complex reasoning tasks. In limited-run evaluations on hardcoded sample questions, both models achieved 0% on HellaSwag while diffusion correctly answered one of two MMLU questions (50% vs 0% for AR). These results suggest potential benefits that may scale with model size.
• Research Methodology Compliance: Mirroring Latest Standards
Implementation Correctness: Our evaluation addresses methodological concerns raised in recent literature by implementing proper time-dependent loss weighting (Austin et al. 2021) and variable masking strategies (LLaDA 2025).
Comparative Rigor: Identical experimental conditions, comprehensive metric collection, and downstream task evaluation ensure our results contribute meaningfully to the diffusion vs. autoregressive debate.
Novel Finding: Immediate Diffusion Dominance at Tiny Scale
Anomalous Results: Our experimental results reveal a phenomenon that contradicts established research patterns. Unlike larger-scale studies where AR models initially outperform diffusion models before potential crossover, we observe immediate diffusion dominance from the very first gradient steps. This represents a fundamentally different dynamic than the documented "late-stage crossover" pattern.
The Contradiction: Established research shows AR models starting better, then diffusion potentially crossing over due to overfitting with data repetition. Our 6.54M parameter results show diffusion achieving 4.3x better perplexity at step 2, sustained throughout training - no crossover event, just immediate and persistent dominance.
• Established Pattern: Late-Stage Crossover
Literature Consensus: Prabhudesai et al. and Ni et al. demonstrate that AR models initially outperform diffusion models, with diffusion potentially crossing over only after AR models begin overfitting from excessive data repetition.
Mechanism: The documented advantage emerges through AR degradation, not inherent diffusion superiority. Single-epoch regimes consistently favor AR models across all documented scales.
• Our Anomalous Finding: Immediate Dominance
Contradictory Evidence: At 6.54M parameters, diffusion demonstrates immediate superiority from the first evaluation step (4.3x better at step 2), sustained throughout training. No crossover event occurs because diffusion never trails.
Scale-Dependent Behavior: This suggests that at sufficiently small scales, diffusion may possess an inherent efficiency advantage that is masked or disappears at larger scales documented in the literature.
• Potential Scale-Dependent Mechanisms
Tiny-Scale Hypothesis: At 6.54M parameters, bidirectional attention overhead may be minimized while data efficiency advantages remain intact, creating a favorable efficiency trade-off unique to tiny scales.
Hyperparameter Sensitivity: Our specific architecture (2 layers, 64 hidden dimensions) may represent a sweet spot where diffusion's training objective alignment outweighs computational disadvantages observed at larger scales.
• Research Implications: Novel Scale Dynamics
Paradigm Challenge: Our results suggest that diffusion advantages may not solely depend on AR overfitting but could emerge from scale-dependent efficiency dynamics previously unobserved in larger model studies.
Future Investigation: This finding warrants systematic investigation across the 1M-100M parameter range to identify the precise scale threshold where diffusion behavior transitions from immediate dominance to delayed crossover patterns.
Immediate Dominance: A Novel Phenomenon
Unlike established research showing AR initial superiority, we observe diffusion dominance from step 1
Results suggest previously undocumented scale-dependent diffusion advantages that merit further investigation