Explore | alphaXiv

Explore | alphaXiv
alphaXiv
Explore
Briefs
Blog
Labs
Feedback
Browser Extension
Upgrade to Pro
Dark mode
We're hiring
Ask or search anything...
What are the most popular benchmarks for math reasoning?
Alt+↵ To search
Events
Watch Recordings
Bring Research to Life: alphaXiv x marimo
05/01
· alphaXiv x marimo
Hot
Likes
Image Generators are Generalist Vision Learners
22 Apr 2026
Valentin Gabeur
Shangbang Long
Songyou Peng
Google's Vision Banana model, created by instruction-tuning a pretrained image generator, demonstrates that generative models can achieve state-of-the-art performance in both visual understanding and generation. It surpasses existing specialized models on tasks like semantic segmentation and metric depth estimation while maintaining high-quality image generation capabilities.
View blog
computer-science
artificial-intelligence
computer-vision-and-pattern-recognition
Resources
1,619
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
20 Apr 2026
Guanting Dong
Junting Lu
Junjie Huang
Renmin University of China and ByteDance Seed researchers introduced Agent-World, a framework for advancing general agent intelligence through scalable real-world environment synthesis and a continuous self-evolving training mechanism. The approach enables agents to learn and adapt by autonomously discovering and constructing a diverse ecosystem of stateful, executable tools and environments from real-world sources, achieving consistent performance improvements across 23 challenging agent benchmarks and outperforming prior environment-scaling methods.
View blog
computer-science
artificial-intelligence
computation-and-language
Resources
2,363
Neural Garbage Collection: Learning to Forget while Learning to Reason
20 Apr 2026
Michael Y. Li
Jubayer Ibn Hamid
Emily B. Fox
Researchers at Stanford University developed Neural Garbage Collection (NGC), an end-to-end reinforcement learning framework that allows large language models to jointly learn chain-of-thought reasoning and KV cache memory management. This method effectively reduces peak KV cache size by 2.4x to 5x while consistently outperforming heuristic eviction strategies across various arithmetic and mathematical reasoning tasks.
View blog
agents
attention-mechanisms
computer-science
Resources
685
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
22 Apr 2026
Inclusion AI
Tiwei Bie
Haoxing Chen
LLaDA2.0-Uni introduces a unified discrete diffusion large language model (dLLM) framework, featuring a semantic visual tokenizer and MoE backbone, to integrate multimodal understanding and generation. The model achieves competitive performance across a broad spectrum of benchmarks, demonstrating advancements in image generation fidelity, reasoning capabilities, and inference efficiency.
View blog
computer-science
computer-vision-and-pattern-recognition
fine-tuning
Resources
13
265
Scaling Self-Play with Self-Guidance
22 Apr 2026
Luke Bailey
Kaiyue Wen
Kefan Dong
An algorithm called Self-Guided Self-Play (SGS) was introduced by researchers at Stanford University, which uses an integrated "Guide" language model to ensure the generation of high-quality, relevant synthetic problems for a Solver model. This approach enabled a 7B parameter model to exceed the pass@4 performance of a 671B parameter model in formal theorem proving and achieve a 7% higher asymptotic solve rate compared to baseline reinforcement learning methods.
View blog
agents
computer-science
machine-learning
Resources
201
Qwen3.5-Omni Technical Report
21 Apr 2026
Qwen Team
The Qwen Team introduced Qwen3.5-Omni, a fully omnimodal Large Language Model capable of understanding, reasoning, and generating content across text, images, audio, and video modalities. This model achieved state-of-the-art performance across 215 audio and audio-visual benchmarks, demonstrated enhanced real-time interaction with low latency, and exhibited emergent capabilities like "Audio-Visual Vibe Coding."
View blog
computer-science
computation-and-language
audio-and-speech-processing
Resources
3,979
OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
20 Apr 2026
Jinghui Lu
Jiayi Guan
Zhijian Huang
The Xiaomi Embodied Intelligence Team developed OneVL, a framework integrating a Vision-Language-Action model with a world model auxiliary for autonomous driving, which is the first latent Chain-of-Thought (CoT) method to surpass explicit autoregressive CoT in trajectory prediction performance while maintaining answer-only inference latency. OneVL achieved an 88.84 PDM-score on NAVSIM, outperforming prior 8B models by up to 2.64 points, with inference latency comparable to answer-only prediction.
View blog
autonomous-vehicles
causal-inference
chain-of-thought
Resources
953
620
AI scientists produce results without reasoning scientifically
20 Apr 2026
Martiño Ríos-García
Nawaf Alampara
Chandan Gupta
A study found that large language model-based scientific agents consistently fail to employ disciplined scientific reasoning patterns, such as refutation or hypothesis testing, even when completing tasks successfully. This research demonstrated that current outcome-focused benchmarks do not reveal these epistemic deficiencies, which are primarily driven by the base model and lead to unreliable performance on complex, novel problems.
View blog
agentic-frameworks
agents
computer-science
Resources
203
Near-Future Policy Optimization
22 Apr 2026
Chuanyu Qin
Chenxu Yang
Qingyi Si
Near-Future Policy Optimization (NPO) introduces a mixed-policy Reinforcement Learning with Verifiable Rewards (RLVR) scheme that leverages verified trajectories from a model's own near-future self to guide current policy training. The automated variant, AutoNPO, achieved an average performance of 63.15% across eight multimodal reasoning benchmarks, demonstrating improved convergence speed and exploration compared to existing methods.
View blog
agents
computer-science
machine-learning
Resources
136
FASTER: Value-Guided Sampling for Fast RL
21 Apr 2026
Perry Dong
Alexander Swerdlow
Dorsa Sadigh
FASTER is a framework that improves the computational efficiency of diffusion-based reinforcement learning policies by filtering unpromising action candidates at the initial noise level. It achieves up to an 8x reduction in inference FLOPs and a 4.5x speedup in training time, enabling large Vision-Language-Action (VLA) models to maintain comparable task success rates to standard best-of-N sampling with significantly lower overhead.
View blog
computer-science
artificial-intelligence
machine-learning
Resources
141
Context Unrolling in Omni Models
23 Apr 2026
Ceyuan Yang
Zhijie Lin
Yang Zhao
Omni, a unified multimodal model, introduces "Context Unrolling" as an emergent capability for explicit reasoning across heterogeneous modal representations. This mechanism dynamically constructs and composes task-relevant contexts, leading to improved fidelity across multimodal understanding, generation, and 3D geometry tasks.
View blog
computer-science
computer-vision-and-pattern-recognition
generative-models
Resources
94
Wan-Image: Pushing the Boundaries of Generative Visual Intelligence
21 Apr 2026
Chaojie Mao
Chen-Wei Xie
Chongyang Zhong
The Wan-Image system from Alibaba Group introduces a unified visual generation framework designed to meet the demands of professional design workflows, going beyond general image synthesis. It delivers precise spatial and logical alignment, robust typography rendering, and strict identity preservation, achieving an approximately 80% pass rate in human evaluations for interactive editing and image series generation.
View blog
computer-science
computer-vision-and-pattern-recognition
data-curation
Resources
167
Generalization at the Edge of Stability
21 Apr 2026
Mario Tuci
Caner Korkmaz
Umut Şimşekli
This research introduces the Sharpness Dimension (SD) as a novel complexity measure derived from random dynamical systems theory, aiming to explain generalization in neural networks operating at the 'edge of stability' (EoS). The study establishes a theoretical generalization bound linked to SD and empirically demonstrates its strong correlation with generalization performance across various architectures, including MLPs and GPT-2, while also providing insights into grokking.
View blog
computer-science
artificial-intelligence
computer-vision-and-pattern-recognition
Resources
254
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
21 Apr 2026
Haoyu Wu
Jiwen Yu
Yingtian Zou
MultiWorld presents a framework for scalable multi-agent, multi-view video world modeling, integrating a Multi-Agent Condition Module (MACM) and a Global State Encoder (GSE). The model demonstrates improved action controllability, multi-view consistency, and scalability across varying numbers of agents and camera views, outperforming baselines on multi-player game and multi-robot manipulation tasks.
View blog
agent-based-systems
computer-science
computer-vision-and-pattern-recognition
Resources
346
OpenGame: Open Agentic Coding for Games
20 Apr 2026
Yilei Jiang
Jinyuan Hu
Qianyin Xiao
OpenGame is an open-source agentic framework that autonomously generates fully playable 2D web games from natural language specifications. It leverages a domain-specialized code model, a structured multimodal coding agent, and "Game Skill" to achieve 72.4 Build Health, 67.2 Visual Usability, and 65.1 Intent Alignment on a custom benchmark, surpassing prior methods.
View blog
computer-science
software-engineering
Resources
132
221
A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression
21 Apr 2026
Jincheng Ren
Siwei Wu
Yizhi Li
The TACO framework introduces a self-evolving, training-free approach to dynamically compress terminal outputs for large language model agents, mitigating context window saturation. This system improves agent performance across diverse software engineering tasks while simultaneously reducing token costs by managing interaction history more efficiently.
View blog
agentic-frameworks
agents
computer-science
Resources
109
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery
22 Apr 2026
Hanzhi Liu
Chaofan Shou
Xiaonan Liu
AgentFlow is a system designed to automatically synthesize multi-agent harnesses for LLM-driven vulnerability discovery, unifying the search space for all harness components and leveraging fine-grained runtime diagnostic feedback. It achieved an 84.3% pass rate on the TerminalBench-2 leaderboard and identified ten zero-day vulnerabilities, including two critical sandbox escapes, in Google Chrome.
View blog
computer-science
cryptography-and-security
Resources
80
DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data
21 Apr 2026
Venus Team
Sunhao Dai
Yong Deng
The Venus Team at Ant Group developed DR-Venus, a framework for training frontier edge-scale deep research agents using only 10K open data. This approach combines a meticulously filtered supervised fine-tuning stage with an information gain-based reinforcement learning method, enabling a 4B parameter model to achieve competitive performance on deep research benchmarks, sometimes matching or exceeding much larger 30B-class models.
View blog
agents
computer-science
artificial-intelligence
Resources
12
82
Exploring Spatial Intelligence from a Generative Perspective
22 Apr 2026
Muzhi Zhu
Shunyao Jiang
Huanyi Zheng
Researchers from Zhejiang University and collaborators introduce Generative Spatial Intelligence (GSI), a new capability for multimodal large language models (MLLMs) to manipulate 3D spatial constraints during image generation. They develop GSI-Bench, a benchmark with synthetic and real-world data, demonstrating that training MLLMs on synthetic GSI data improves generative spatial editing and boosts spatial understanding performance across various benchmarks.
View blog
computer-science
computer-vision-and-pattern-recognition
fine-tuning
Resources
82
PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance
22 Apr 2026
Yupeng Zheng
Xiang Li
Songen Gu
PokeVLA introduces a lightweight Vision-Language-Action model (1.22B parameters) that achieves state-of-the-art performance in language-instructed robot manipulation. It integrates comprehensive world knowledge through specialized pre-training, multi-view geometry alignment, and goal-aware semantic segmentation, resulting in an 83.5% success rate on the LIBERO-Plus benchmark and improved robustness in real-world tasks.
View blog
computer-science
robotics
Resources
87
There are no more papers matching your filters at the moment.