HyperAI

Main

GPU

Console
Docs
Pricing

Pulse

News

Resources

Papers
Notebooks
Datasets
Wiki

Benchmarks

SOTA
LLM Models
GPU Leaderboard

Community

Events

Utility

About Terms of Service Privacy Policy
English

Command Palette

Search for a command to run...

HyperAI
Papers

Papers

Daily updated cutting-edge AI research papers to help you keep up with the latest AI trends

Build the Future of Artificial Intelligence

About

About Us Dataset Help

Products

News Papers Notebooks Datasets Wiki

Links

© HyperAI

GitHub Discord X (formerly Twitter)

HyperAI

Main

GPU

Console
Docs
Pricing

Pulse

News

Resources

Papers
Notebooks
Datasets
Wiki

Benchmarks

SOTA
LLM Models
GPU Leaderboard

Community

Events

Utility

About Terms of Service Privacy Policy
English

Command Palette

Search for a command to run...

HyperAI
Papers

Papers

Daily updated cutting-edge AI research papers to help you keep up with the latest AI trends

Build the Future of Artificial Intelligence

About

About Us Dataset Help

Products

News Papers Notebooks Datasets Wiki

Links

© HyperAI

GitHub Discord X (formerly Twitter)

AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness

AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness

Code Generation

Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, et al.

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Video Understanding

Visual Question Answering

Yunzhe Wang, Runhui Xu, Kexin Zheng, et al.

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Jeonghye Kim, Xufang Luo, Minbeom Kim, et al.

UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience

Zichuan Lin, Feiyu Liu, Yijun Yang, et al.

T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

Hyomin Lee, Sangwoo Park, Yumin Choi, et al.

CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

Video Understanding

Xiangru Jian, Shravan Nayak, Kevin Qinghong Lin, et al.

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

Video Understanding

Yaolun Zhang, Ruohui Wang, Jiahao Wang, et al.

Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation

Diffusion Model

Image Inpainting

Brian Chao, Lior Yariv, Howard Xiao, et al.

Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos

Video Understanding

Shoubin Yu, Lei Shu, Antoine Yang, et al.

From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents

Ling Yue, Kushal Raj Bhandari, Ching-Yun Ko, et al.

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Haoyu Huang, Jinfa Huang, Zhongwei Wan, et al.

DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models

Diffusion Model

Video Processing

Jaewon Min, Jaeeun Lee, Yeji Choi, et al.

PEARL: Personalized Streaming Video Understanding Model

Video Understanding

Yuanhong Zheng, Ruichuan An, Xiaopeng Lin, et al.

WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

Video Generation

Action Recognition

Zhen Li, Zian Meng, Shuwei Shi, et al.

MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

Diffusion Model

Hejun Dong, Junbo Niu, Bin Wang, et al.

PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost

Supervised Fine-Tuning

Reinforcement Learning

Junkeun Yi, Damon Mosk-Aoyama, Baihe Huang, et al.

F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting

Injae Kim, Chaehyeon Kim, Minseong Bae, et al.

SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

Multimodal Representation

Byungwoo Jeon, Dongyoung Kim, Huiwon Jang, et al.

VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

Video Understanding

Visual Question Answering

Ruoliu Yang, Chu Wu, Caifeng Shan, et al.

LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

Jianing Wang, Jianfei Zhang, Qi Guo, et al.

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

SII-GAIR, Sand. ai, Ethan Chern, et al.

Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models

Video Generation

Meiqi Wu, Zhixin Cai, Fufangchen Zhao, et al.

PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation

Huadai Liu, Kaicheng Luo, Wen Wang, et al.

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Multimodal Representation

Lucas Maes, Quentin Le Lidec, Damien Scieur, et al.

FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow

Zhifei Yang, Guangyao Zhai, Keyang Lu, et al.

LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

Diffusion Model

Jiazheng Xing, Fei Du, Hangjie Yuan, et al.

The Y-Combinator for LLMs: Solving Long-Context Rot with λ-Calculus

Text Generation

Amartya Roy, Rasul Tutunov, Xiaotong Ji, et al.

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

Visual Question Answering

Thomas De Min, Subhankar Roy, Stéphane Lathuilière, et al.

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Visual Question Answering

Yan Shu, Bin Ren, Zhitong Xiong, et al.

Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

Video Generation

Songchun Zhang, Zeyue Xue, Siming Fu, et al.

HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

Visual Question Answering

Shenzhi Wang, Shixuan Liu, Jing Zhou, et al.

Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

Diffusion Model

Video Generation

Chenyang Gu, Mingyuan Zhang, Haozhe Xie, et al.

AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness

AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness

Code Generation

Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, et al.

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Video Understanding

Visual Question Answering

Yunzhe Wang, Runhui Xu, Kexin Zheng, et al.

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Jeonghye Kim, Xufang Luo, Minbeom Kim, et al.

UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience

Zichuan Lin, Feiyu Liu, Yijun Yang, et al.

T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

Hyomin Lee, Sangwoo Park, Yumin Choi, et al.

CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

Video Understanding

Xiangru Jian, Shravan Nayak, Kevin Qinghong Lin, et al.

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

Video Understanding

Yaolun Zhang, Ruohui Wang, Jiahao Wang, et al.

Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation

Diffusion Model

Image Inpainting

Brian Chao, Lior Yariv, Howard Xiao, et al.

Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos

Video Understanding

Shoubin Yu, Lei Shu, Antoine Yang, et al.

From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents

Ling Yue, Kushal Raj Bhandari, Ching-Yun Ko, et al.

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Haoyu Huang, Jinfa Huang, Zhongwei Wan, et al.

DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models

Diffusion Model

Video Processing

Jaewon Min, Jaeeun Lee, Yeji Choi, et al.

PEARL: Personalized Streaming Video Understanding Model

Video Understanding

Yuanhong Zheng, Ruichuan An, Xiaopeng Lin, et al.

WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

Video Generation

Action Recognition

Zhen Li, Zian Meng, Shuwei Shi, et al.

MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

Diffusion Model

Hejun Dong, Junbo Niu, Bin Wang, et al.

PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost

Supervised Fine-Tuning

Reinforcement Learning

Junkeun Yi, Damon Mosk-Aoyama, Baihe Huang, et al.

F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting

Injae Kim, Chaehyeon Kim, Minseong Bae, et al.

SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

Multimodal Representation

Byungwoo Jeon, Dongyoung Kim, Huiwon Jang, et al.

VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

Video Understanding

Visual Question Answering

Ruoliu Yang, Chu Wu, Caifeng Shan, et al.

LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

Jianing Wang, Jianfei Zhang, Qi Guo, et al.

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

SII-GAIR, Sand. ai, Ethan Chern, et al.

Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models

Video Generation

Meiqi Wu, Zhixin Cai, Fufangchen Zhao, et al.

PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation

Huadai Liu, Kaicheng Luo, Wen Wang, et al.

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Multimodal Representation

Lucas Maes, Quentin Le Lidec, Damien Scieur, et al.

FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow

Zhifei Yang, Guangyao Zhai, Keyang Lu, et al.

LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

Diffusion Model

Jiazheng Xing, Fei Du, Hangjie Yuan, et al.

The Y-Combinator for LLMs: Solving Long-Context Rot with λ-Calculus

Text Generation

Amartya Roy, Rasul Tutunov, Xiaotong Ji, et al.

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

Visual Question Answering

Thomas De Min, Subhankar Roy, Stéphane Lathuilière, et al.

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Visual Question Answering

Yan Shu, Bin Ren, Zhitong Xiong, et al.

Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

Video Generation

Songchun Zhang, Zeyue Xue, Siming Fu, et al.

HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

Visual Question Answering

Shenzhi Wang, Shixuan Liu, Jing Zhou, et al.

Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

Diffusion Model

Video Generation

Chenyang Gu, Mingyuan Zhang, Haozhe Xie, et al.

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience

T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation

Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos

From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models

PEARL: Personalized Streaming Video Understanding Model

WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost

F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting

SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models

PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow

LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

The Y-Combinator for LLMs: Solving Long-Context Rot with λ-Calculus

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience

T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation

Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos

From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models

PEARL: Personalized Streaming Video Understanding Model

WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost

F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting

SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models

PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow

LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

The Y-Combinator for LLMs: Solving Long-Context Rot with λ-Calculus

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer