News
Meet Harness-1: A 20 B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b
1+ hour, 16+ min ago (671+ words) Harness-1 reaches 0. 730 average curated recall across eight benchmarks, trailing only Opus-4. 6 among the searchers tested. Their answer is Harness-1, a 20 B retrieval subagent built on gpt-oss-20b. It was trained with reinforcement learning inside a stateful search harness. The harness holds…...
Parallax: A Parameterized Local Linear Attention That Keeps Softmax and Adds a Learned Covariance Correction Branch
6+ day, 3+ hour ago (1040+ words) The Transformer's attention mechanism has barely changed since 2017. Most efficiency work has tried to replace softmax attention outright. A new paper takes a different route. It keeps softmax attention and bolts on a correction branch. A team of researchers from…...
Trajectory Releases a Concurrent Multi-Lo RA Training Stack for Continual Learning, Reporting a 2. 81" Experiment-Throughput Gain
1+ week, 5+ hour ago (962+ words) Most language models improve in discontinuous jumps. A team collects data, trains, and ships a new version. This takes months and produces remarkable or catastrophic behavior for users. Trajectory wants to replace that cycle with continual learning. The Trajectory team…...
NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3. 82 Average Points on Llama-3. 2-1 B
1+ week, 1+ day ago (917+ words) Knowledge distillation (KD) transfers "dark knowledge" from a large teacher model to a smaller student. The student learns from the teacher's full output probability distribution over tokens, not just correct answers. This is done via per-position Kullback'Leibler (KL) divergence over…...
Sakana AI Proposes Diffusion Blocks: a Block-wise Training Framework That Converts Residual Networks into Independently Trainable Denoising Modules
1+ week, 3+ day ago (900+ words) Researchers from Sakana AI and the University of Tokyo propose Diffusion Blocks. It trains transformer-based networks one block at a time. Training memory is reduced by a factor of B, where B is the number of blocks. Performance is maintained…...
MEMO: A Modular Framework for Training a Dedicated Memory Model on New Knowledge Without Modifying LLM Parameters
1+ week, 4+ day ago (876+ words) Large language models become static after pretraining. Their knowledge does not update as the world changes. Retraining a full LLM is too expensive at modern scales. Fine-tuning risks degrading previously learned knowledge. Retrieval-augmented generation (RAG) struggles when answers require reasoning…...
Design a Complete Multimodal RLVR Pipeline with Open-MM-RL, Vision-Language Prompting, Reward Scoring, and GRPO Export
1+ week, 5+ day ago (935+ words) In this tutorial, we explore the Turing Enterprises/Open-MM-RL dataset as a practical foundation for multimodal reasoning and reinforcement learning with verifiable rewards. We load the dataset, inspect its schema, analyze domains, formats, question lengths, answer types, and image distributions,…...
Step by Step Guide to Build and Compare Fed Avg and Fed Prox Federated Learning on Non-IID CIFAR-10 with NVIDIA FLARE
1+ week, 5+ day ago (628+ words) In this tutorial, we build an advanced federated learning experiment with NVIDIA FLARE. We compare Fed Avg and Fed Prox on a non-IID CIFAR-10 setup, where client data is split using a Dirichlet distribution to simulate realistic label imbalance across…...
Stochastic Gradient Descent (SGD's) Frequency Bias and How Adam Fixes It
2+ week, 5+ day ago (868+ words) Modern language models are trained on data with extremely uneven token distributions. A small number of words appear in almost every sentence, while many rare but meaningful tokens occur only occasionally. This creates a hidden optimization challenge: parameters associated with…...
Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1. 41. 7" Pretraining Speedup at Long Context
3+ week, 9+ hour ago (1602+ words) Lighthouse takes a different approach on both design decisions. It pools queries, keys, and values symmetrically across a multi-level pyramid, and it places selection entirely outside the attention kernel. After selection, the system gathers the chosen entries into a contiguous,…...