论文雷达

聚焦大模型架构、训练、推理、评测与多模态方向,提炼论文贡献、边界和工程启发。

自动雷达看板

由抓取与清洗脚本自动生成,每次更新会重建每日流、每周精选和关键词趋势。

总条目

363

可视窗口

120

来源分布(总)

Manual 3 · arXiv 360

来源分布(窗口)

Manual 0 · arXiv 120

最新构建

2026/03/17 16:44

每周精选

范围 2026-03-10 到 2026-03-17

自然语言处理 NLP

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.

2026 · arXiv

近 7 天内更新,主题 NLP,综合评分 30.8

通用主题 General

The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon's multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategic reasoning and generalization under partial observability in competitive Pokemon battles, and our Speedrunning Track, which requires long-horizon planning and sequential decision-making in the Pokemon RPG. Our Battling Track supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM-based baselines capable of high-level competitive play. Our Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons of harness-based LLM approaches. Our NeurIPS 2025 competition validates both the quality of our resources and the research community's interest in Pokemon, with over 100 teams competing across both tracks and winning solutions detailed in our paper. Participant submissions and our baselines reveal considerable gaps between generalist (LLM), specialist (RL), and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites and positioning Pokemon as an unsolved benchmark that can drive RL and LLM research forward. We transition to a living benchmark with a live leaderboard for Battling and self-contained evaluation for Speedrunning at https://pokeagentchallenge.com.

2026 · arXiv

近 7 天内更新,主题 General,综合评分 30.8

自然语言处理 NLP

TrinityGuard: A Unified Framework for Safeguarding Multi-Agent Systems

With the rapid development of LLM-based multi-agent systems (MAS), their significant safety and security concerns have emerged, which introduce novel risks going beyond single agents or LLMs. Despite attempts to address these issues, the existing literature lacks a cohesive safeguarding system specialized for MAS risks. In this work, we introduce TrinityGuard, a comprehensive safety evaluation and monitoring framework for LLM-based MAS, grounded in the OWASP standards. Specifically, TrinityGuard encompasses a three-tier fine-grained risk taxonomy that identifies 20 risk types, covering single-agent vulnerabilities, inter-agent communication threats, and system-level emergent hazards. Designed for scalability across various MAS structures and platforms, TrinityGuard is organized in a trinity manner, involving an MAS abstraction layer that can be adapted to any MAS structures, an evaluation layer containing risk-specific test modules, alongside runtime monitor agents coordinated by a unified LLM Judge Factory. During Evaluation, TrinityGuard executes curated attack probes to generate detailed vulnerability reports for each risk type, where monitor agents analyze structured execution traces and issue real-time alerts, enabling both pre-development evaluation and runtime monitoring. We further formalize these safety metrics and present detailed case studies across various representative MAS examples, showcasing the versatility and reliability of TrinityGuard. Overall, TrinityGuard acts as a comprehensive framework for evaluating and monitoring various risks in MAS, paving the way for further research into their safety and security.

2026 · arXiv

近 7 天内更新,主题 NLP,综合评分 30.7

通用主题 General

SWE-Skills-Bench: Do Agent Skills Actually Help in Real-World Software Engineering?

Agent skills, structured procedural knowledge packages injected at inference time, are increasingly used to augment LLM agents on software engineering tasks. However, their real utility in end-to-end development settings remains unclear. We present SWE-Skills-Bench, the first requirement-driven benchmark that isolates the marginal utility of agent skills in real-world software engineering (SWE). It pairs 49 public SWE skills with authentic GitHub repositories pinned at fixed commits and requirement documents with explicit acceptance criteria, yielding approximately 565 task instances across six SWE subdomains. We introduce a deterministic verification framework that maps each task's acceptance criteria to execution-based tests, enabling controlled paired evaluation with and without the skill. Our results show that skill injection benefits are far more limited than rapid adoption suggests: 39 of 49 skills yield zero pass-rate improvement, and the average gain is only +1.2%. Token overhead varies from modest savings to a 451% increase while pass rates remain unchanged. Only seven specialized skills produce meaningful gains (up to +30%), while three degrade performance (up to -10%) due to version-mismatched guidance conflicting with project context. These findings suggest that agent skills are a narrow intervention whose utility depends strongly on domain fit, abstraction level, and contextual compatibility. SWE-Skills-Bench provides a testbed for evaluating the design, selection, and deployment of skills in software engineering agents. SWE-Skills-Bench is available at https://github.com/GeniusHTX/SWE-Skills-Bench.

2026 · arXiv

近 7 天内更新,主题 General,综合评分 30.7

计算机视觉 Vision

What Matters for Scalable and Robust Learning in End-to-End Driving Planners?

End-to-end autonomous driving has gained significant attention for its potential to learn robust behavior in interactive scenarios and scale with data. Popular architectures often build on separate modules for perception and planning connected through latent representations, such as bird's eye view feature grids, to maintain end-to-end differentiability. This paradigm emerged mostly on open-loop datasets, with evaluation focusing not only on driving performance, but also intermediate perception tasks. Unfortunately, architectural advances that excel in open-loop often fail to translate to scalable learning of robust closed-loop driving. In this paper, we systematically re-examine the impact of common architectural patterns on closed-loop performance: (1) high-resolution perceptual representations, (2) disentangled trajectory representations, and (3) generative planning. Crucially, our analysis evaluates the combined impact of these patterns, revealing both unexpected limitations as well as underexplored synergies. Building on these insights, we introduce BevAD, a novel lightweight and highly scalable end-to-end driving architecture. BevAD achieves 72.7% success rate on the Bench2Drive benchmark and demonstrates strong data-scaling behavior using pure imitation learning. Our code and models are publicly available here: https://dmholtz.github.io/bevad/

2026 · arXiv

近 7 天内更新,主题 Vision,综合评分 30.6

计算机视觉 Vision

Towards Generalizable Robotic Manipulation in Dynamic Environments

Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks. All code and data are available at https://github.com/H-EmbodVis/DOMINO.

2026 · arXiv

近 7 天内更新,主题 Vision,综合评分 30.5

关键词趋势

近 7 天 vs 前 23 天(总窗口 30 天)

关键词数 16
上升 10
稳定 3
回落 3
样本窗口 排序回退窗 · 84/276
主题簇 应用与场景 5 · 基础模型 4 · 安全与治理 2

可见 12 / 12

Markdown Preview

研究版摘要预览

当前筛选结果将生成在这里。

会基于当前筛选、摘要模板和摘要选项,同时生成研究卡 / 运营卡 / 工程卡三张图片;ZIP 包适合一次性发给团队。

Asset Center

模板资产中心

查看当前筛选下将生成的文件名、适用场景与 ZIP 打包文件,方便直接发给设计、运营或研发同学。

ZIP 包

radar-template-kit-0000-00-00.zip

研究卡

论文解读 / 周报正文 / 知识库封面

当前模板
radar-research-card-0000-00-00.png

适合强调结论结构、关键词信号和阅读延展。

运营卡

社群播报 / 热点海报 / 公众号配图

当前模板
radar-research-card-0000-00-00.png

适合高频传播、快速转发和封面型内容场景。

工程卡

系统复盘 / 部署同步 / 研发例会

当前模板
radar-research-card-0000-00-00.png

适合强调工程关注项、指标与判断依据。

Bundle Manifest

ZIP 清单预览

这里会展示当前筛选下将被打进 ZIP 包的文件清单,方便导出前快速确认。

radar-template-kit-0000-00-00.zip

研究卡

适合知识沉淀与深度解读

当前模板
radar-research-card-0000-00-00.png

运营卡

适合传播与热点封面

当前模板
radar-research-card-0000-00-00.png

工程卡

适合工程复盘与同步

当前模板
radar-research-card-0000-00-00.png

通用智能 General AI

研究前沿

近 7 天
前 23 天

36/100 上升 Δ 0.066 R 1.18

推理优化 Inference

系统工程

近 7 天
前 23 天

9/54 回落 Δ -0.088 R 0.55

评测基准 Benchmark

评测与工具

近 7 天
前 23 天

21/50 上升 Δ 0.069 R 1.38

多模态 Multimodal

应用与场景

近 7 天
前 23 天

4/35 回落 Δ -0.079 R 0.38

机器学习 Machine Learning

基础模型

近 7 天
前 23 天

31/109 回落 Δ -0.026 R 0.93

长上下文 Long Context

基础模型

近 7 天
前 23 天

7/7 上升 Δ 0.058 R 3.29

计算机视觉 Vision

应用与场景

近 7 天
前 23 天

31/105 稳定 Δ -0.011 R 0.97

智能体 Agent

应用与场景

近 7 天
前 23 天

15/40 上升 Δ 0.034 R 1.23

机器人 Robotics

应用与场景

近 7 天
前 23 天

8/14 上升 Δ 0.044 R 1.88

对齐与安全 Alignment

安全与治理

近 7 天
前 23 天

10/25 上升 Δ 0.029 R 1.31

自然语言处理 NLP

基础模型

近 7 天
前 23 天

15/46 稳定 Δ 0.012 R 1.07

增强与虚拟现实 AR/VR

应用与场景

近 7 天
前 23 天

2/0 上升 Δ 0.024 R 3.00

站内论文列表

按关键词、年份和主题过滤,并支持排序浏览。

0 条结果