HyperAIHyperAI

Command Palette

Search for a command to run...

SemanticMoments: Training-Free Motion Similarity via Third Moment Features

Saar Huberman Kfir Goldberg Or Patashnik Sagie Benaim Ron Mokady

Abstract

Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.

One-sentence Summary

Researchers from BRIA AI, Tel Aviv University, and Hebrew University propose SemanticMoments, a training-free method using higher-order temporal statistics over semantic features to disentangle motion from appearance, enabling accurate motion-centric video retrieval where prior models fail.

Key Contributions

  • We identify a critical bias in existing video representations that prioritize static appearance over motion dynamics, revealing their failure to disentangle motion from visual context in retrieval tasks.
  • We introduce the SimMotion benchmarks—combining synthetic and human-annotated real-world datasets—to rigorously evaluate motion-centric video similarity and expose the limitations of current methods.
  • We propose SemanticMoments, a training-free method that computes higher-order temporal statistics over semantic features, achieving state-of-the-art performance on motion retrieval without requiring optical flow or labeled motion data.

Introduction

The authors tackle the challenge of retrieving videos based on semantic motion rather than static appearance or scene context — a capability critical for applications like motion-aware video editing, generative modeling, and dataset curation. Prior methods, whether supervised, self-supervised, or flow-based, inherit biases that prioritize visual consistency over temporal dynamics, often failing to disentangle motion from appearance even when motion is identical. To address this, they introduce SemanticMoments, a training-free approach that computes higher-order temporal moments (variance, skewness) over patch-level embeddings from pretrained semantic models like DINO. This yields compact, motion-sensitive descriptors that outperform existing methods across synthetic and real-world benchmarks, without requiring optical flow, labeled data, or additional training.

Dataset

  • The authors use two benchmarks — SimMotion-Synthetic and SimMotion-Real — to evaluate motion similarity beyond categorical labels, focusing on structural and dynamic properties through relative comparisons.

  • SimMotion-Synthetic contains 250 triplets (750 videos total), each with a reference, a motion-preserving positive, and a hard negative with matching appearance but different motion. Videos are 5 seconds long, sampled at 16 fps, 512x512 resolution. Triplets are grouped into five categories: Static Object, Dynamic Object, Dynamic Appearance, Scene Style, and View — each isolating a specific visual variation while preserving motion.

  • Videos are generated using GPT-4.1 to craft prompts, then synthesized via Gemini2.5-Flash for images and WAN 2.2 for temporally synchronized videos. Hard negatives are generated from the same base image with a different motion prompt. This ensures precise motion alignment while varying appearance, subject, or viewpoint.

  • SimMotion-Real contains 40 reference-positive-negative triplets curated from real-world videos. Positives are sourced from Pexels via text queries and ranked by human annotators for motion similarity, ignoring appearance. Negatives are drawn from the same source video but show different motions, plus random clips from Kinetics-400 to increase retrieval difficulty.

  • Both benchmarks are used to test whether models capture motion structure rather than visual cues. The authors extract patch-wise features per frame (e.g., using DINO), summarize them over time via first three moments (mean, variance, skewness), then spatially aggregate to form global motion-centric embeddings for evaluation.

Method

The authors leverage a parametric moment-based representation, termed M+\mathcal{M}+M+, to encode temporal dynamics in video data by preserving structured statistical descriptors rather than collapsing temporal information into a single pooled vector. The framework begins with a patch-wise embedding stage, where each frame of a video is processed through a pretrained backbone (e.g., DINOv2, Video-MAE, or VideoPrism) to extract spatial patch features. These features are then used to compute temporal moments independently for each spatial patch across the video’s temporal axis.

Refer to the framework diagram: the process starts with a sequence of video frames fed into a patch-wise embedder, which outputs a spatiotemporal tensor of patch features. For each patch ppp, the first temporal moment μp(1)\mu_p^{(1)}μp(1) is computed as the mean feature vector across time, capturing average appearance. Higher-order moments μp(k)\mu_p^{(k)}μp(k) for k>1k > 1k>1 are computed as central moments, encoding the magnitude of variation (k=2k=2k=2) and directional asymmetry (k=3k=3k=3) of temporal change. These patch-wise moments are then spatially aggregated via averaging to produce global moment descriptors M(k)RdM^{(k)} \in \mathbb{R}^dM(k)Rd for each order kkk.

The final video-level representation ϕvideo\phi_{\mathrm{video}}ϕvideo is constructed by concatenating the first three moment descriptors—mean, variance, and skew—each scaled by a learnable or fixed weight αk\alpha_kαk. In practice, the authors fix α1=1\alpha_1 = 1α1=1, α2=8\alpha_2 = 8α2=8, and α3=4\alpha_3 = 4α3=4 to emphasize motion-related statistics. The resulting embedding ϕvideoR3d\phi_{\mathrm{video}} \in \mathbb{R}^{3d}ϕvideoR3d is computed without additional training, relying solely on pretrained backbones and moment aggregation, enabling efficient and scalable deployment on large video datasets.

Experiment

  • Existing self-supervised and appearance-focused video embeddings fail to consistently capture fine-grained motion similarity, often conflating motion with style or background.
  • A controlled synthetic benchmark reveals that motion-preserving variants are poorly clustered by prior methods, while the proposed SemanticMoments approach successfully isolates motion by summarizing temporal statistics over semantic features.
  • On real-world motion retrieval tasks, SemanticMoments outperforms baselines—including flow-based, CLIP-based, and action-trained models—by maintaining robustness to appearance shifts, camera motion, and unsynchronized timing.
  • Applied to gesture classification, SemanticMoments enhances separability in embedding space without additional training, improving kNN accuracy across multiple backbones.
  • Ablation studies confirm that higher-order moments, patch-level granularity, and additive fusion yield optimal motion alignment, with performance peaking at 32 uniformly sampled frames.
  • Limitations include challenges with subtle or absence-defined motions, dependency on upstream backbone quality, and inability to adapt to rare motion types without targeted training.

The authors evaluate motion-sensitive video representations across synthetic benchmarks that isolate appearance variations while preserving motion. Results show that existing methods—including CLIP-based, RGB-supervised, and optical-flow models—struggle to consistently capture motion similarity under style, object, or viewpoint changes, while their proposed SemanticMoments approach achieves superior or competitive performance by summarizing temporal dynamics through higher-order statistics over semantic features. This indicates that motion-aware embeddings can be effectively derived without explicit motion supervision or training, by leveraging temporal moments of semantic frame representations.

The authors evaluate how temporal sampling density affects motion retrieval performance, finding that accuracy improves as frame count increases up to 32 frames, beyond which performance plateaus or declines. This suggests that moderate temporal resolution captures sufficient motion dynamics without introducing redundancy or noise.

The authors evaluate different configurations of their moment-based representation on SimMotion-Real, finding that combining multiple temporal moments (mean, variance, skewness) at the patch level with concatenation yields the highest retrieval accuracy. Results show that higher-order moments and localized patch features significantly improve motion alignment over single-moment or frame-level approaches. The best-performing variant, DINO(1,8,4)-patch-concat, achieves 42.50% accuracy, demonstrating the value of richer temporal statistics and spatial granularity in capturing motion similarity.

The authors use SemanticMoments to enhance motion-sensitive video representations by applying temporal statistics to semantic features from pretrained encoders. Results show that this approach consistently outperforms baseline methods across multiple evaluation metrics, particularly when using V-JEPA2 as the backbone, indicating stronger gesture-level separability without additional training. The gains highlight the effectiveness of higher-order temporal moments in capturing motion structure beyond appearance or coarse semantics.

The authors evaluate motion similarity retrieval using a synthetic benchmark that isolates motion from appearance variations. Results show that existing methods, including CLIP-based, flow-based, and self-supervised models, struggle to consistently capture motion equivalence across style changes. In contrast, their SemanticMoments approach, which aggregates temporal statistics over semantic features, achieves significantly higher retrieval accuracy, demonstrating stronger robustness to appearance shifts while preserving motion structure.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp