HyperAIHyperAI

Command Palette

Search for a command to run...

HyTRec: A Hybrid Temporal-Aware Attention Architecture for Long Behavior Sequential Recommendation

Lei Xin Yuhao Zheng Ke Cheng Changjiang Jiang Zifan Zhang Fanhu Zeng

Abstract

Modeling long sequences of user behaviors has emerged as a critical frontier in generative recommendation. However, existing solutions face a dilemma: linear attention mechanisms achieve efficiency at the cost of retrieval precision due to limited state capacity, while softmax attention suffers from prohibitive computational overhead. To address this challenge, we propose HyTRec, a model featuring a Hybrid Attention architecture that explicitly decouples long-term stable preferences from short-term intent spikes. By assigning massive historical sequences to a linear attention branch and reserving a specialized softmax attention branch for recent interactions, our approach restores precise retrieval capabilities within industrial-scale contexts involving ten thousand interactions. To mitigate the lag in capturing rapid interest drifts within the linear layers, we furthermore design Temporal-Aware Delta Network (TADN) to dynamically upweight fresh behavioral signals while effectively suppressing historical noise. Empirical results on industrial-scale datasets confirm the superiority that our model maintains linear inference speed and outperforms strong baselines, notably delivering over 8% improvement in Hit Rate for users with ultra-long sequences with great efficiency.

One-sentence Summary

Researchers from Dewu, Wuhan University, USTC, and Beihang propose HyTRec, a hybrid attention model that decouples long-term preferences from short-term intent using linear and softmax branches, enhanced by TADN for dynamic signal weighting, achieving 8%+ Hit Rate gains on ultra-long sequences with linear inference speed.

Key Contributions

  • HyTRec introduces a Hybrid Attention architecture that assigns long-term historical sequences to linear attention and recent interactions to softmax attention, resolving the efficiency-precision trade-off while handling sequences of up to ten thousand interactions.
  • The Temporal-Aware Delta Network (TADN) dynamically amplifies fresh behavioral signals using an exponential gating mechanism, reducing lag in capturing rapid interest drifts and suppressing noise from outdated interactions.
  • Evaluated on industrial-scale e-commerce datasets, HyTRec achieves over 8% higher Hit Rate for users with ultra-long sequences while preserving linear inference speed, outperforming strong baselines in both efficiency and accuracy.

Introduction

The authors leverage the growing need to model ultra-long user behavior sequences in generative recommendation systems, where capturing both long-term preferences and short-term intent shifts is critical for accurate next-item prediction. Prior approaches face a trade-off: linear attention models scale efficiently but lose retrieval precision, while softmax attention preserves fidelity at prohibitive computational cost, and neither handles rapid interest drift well. HyTRec addresses this by introducing a hybrid attention architecture that routes long-term history through linear attention and recent interactions through softmax attention, preserving linear complexity while restoring precision. They further propose the Temporal-Aware Delta Network to dynamically emphasize fresh signals and suppress stale noise, improving responsiveness to intent changes. Empirically, HyTRec outperforms baselines by over 8% in Hit Rate for users with long sequences, without sacrificing inference speed.

Dataset

  • The authors use publicly available datasets with no sensitive information, focusing solely on research with no commercial intent.
  • Data is merged across partitions by user ID to reconstruct full behavior histories from registration, enabling richer long-sequence modeling.
  • User behavior sequences are structured using ad attribution IDs as numerical identifiers, tracking states across funnel stages (click, redirect, activation, wake-up, product click, stay, collection, add-to-cart, payment).
  • Behaviors are sorted by correlation strength with payment to build hierarchical long-cycle sequences, prioritizing signals most predictive of conversion.
  • The dataset excludes or handles problematic cases: cold-start users (sparse/no interactions), inactive long-term users (old/limited records), and scalper accounts (high interaction volume but low business value) to avoid misleading model training.
  • These filtering and structuring strategies ensure the training data supports efficient, accurate modeling of long user sequences without being skewed by edge cases.

Method

The authors leverage a dual-branch architecture to model long user behavior sequences, explicitly decoupling short-term intent spikes from long-term stable preferences. This stratification begins with sequence decomposition: the full historical sequence SuS_uSu is split into a short-term subsequence SushortS_u^{short}Sushort of fixed length KKK, capturing recent behaviors, and a long-term subsequence SulongS_u^{long}Sulong of length nKn-KnK, encoding stable consumption patterns. Each subsequence is processed independently through dedicated branches before fusion.

The short-term branch employs standard multi-head self-attention (MHSA) to preserve fine-grained temporal dynamics and ensure high precision for recent interactions. In contrast, the long-term branch is built upon a novel hybrid attention architecture designed to break the O(n2)O(n^2)O(n2) complexity bottleneck while retaining global context awareness. This branch consists of NNN encoder layers, predominantly using the proposed Temporal-Aware Delta Network (TADN) for linear complexity, with sparse interleaving of standard softmax attention layers (e.g., at a 7:1 ratio) to maintain retrieval fidelity.

Refer to the framework diagram for a visual representation of this dual-path processing and fusion mechanism.

At the core of the long-term branch is TADN, which introduces a Temporal-Aware Gating Mechanism to dynamically weight historical behaviors based on their temporal proximity to the target action. The temporal decay factor τt\tau_tτt quantifies relevance:

τt=exp(tcurrenttbehaviortT),\tau _ { t } = \exp \left( - \frac { t _ { \mathrm { c u r r e n t } } - t _ { \mathrm { b e h a v i o r } } ^ { t } } { T } \right),τt=exp(Ttcurrenttbehaviort),

where TTT is the decay period. This factor is fused with feature similarity to generate dynamic gating weights gtg_tgt:

gt=α[σ(WgConcat(ht,Δht)+b)τt]+(1α)gstatic,\begin{array} { r } { g _ { t } = \alpha \cdot [ \sigma ( \mathbf { W } _ { g } \cdot \mathrm { C o n c a t } ( \mathbf { h } _ { t } , \Delta \mathbf { h } _ { t } ) } \\\\ { + \mathbf { b } ) \odot \tau _ { t } ] + ( 1 - \alpha ) \cdot g _ { \mathrm { s t a t i c } } , } \end{array}gt=α[σ(WgConcat(ht,Δht)+b)τt]+(1α)gstatic,

where Δht=hthˉ\Delta\mathbf{h}_t = \mathbf{h}_t - \bar{\mathbf{h}}Δht=hthˉ represents short-term deviations, and gstaticg_{\mathrm{static}}gstatic encodes long-term preferences. The fused representation h~t\tilde{\mathbf{h}}_th~t is then computed as:

h~t=qtΔht+(1qt)ht.\tilde { \mathbf { h } } _ { t } = q _ { t } \odot \Delta \mathbf { h } _ { t } + ( \mathbf { 1 } - q _ { t } ) \odot \mathbf { h } _ { t }.h~t=qtΔht+(1qt)ht.

This gating mechanism is integrated into a linear attention recurrence via the state update rule:

St=St1(Igtβtktkt)+βtvtkt,\mathbf { S } _ { t } = \mathbf { S } _ { t - 1 } \left( \mathbf { I } - g _ { t } \beta _ { t } \mathbf { k } _ { t } \mathbf { k } _ { t } ^ { \top } \right) + \beta _ { t } \mathbf { v } _ { t } \mathbf { k } _ { t } ^ { \top },St=St1(Igtβtktkt)+βtvtkt,

which expands into a linear attention formulation with a temporal-aware decay mask D(t,i)\mathcal{D}(t,i)D(t,i):

ot=i=1tβi(viki)qtD(t,i),\mathbf { o } _ { t } = \sum _ { i = 1 } ^ { t } \beta _ { i } \left( \mathbf { v } _ { i } \mathbf { k } _ { i } ^ { \top } \right) \mathbf { q } _ { t } \cdot \mathcal { D } ( t , i ),ot=i=1tβi(viki)qtD(t,i),

where D(t,i)=j=i+1t(Igjβjkjkj)\mathcal{D}(t,i) = \prod _ { j = i + 1 } ^ { t } \left( \mathbf { I } - g _ { j } \beta _ { j } \mathbf { k } _ { j } \mathbf { k } _ { j } ^ { \top } \right)D(t,i)=j=i+1t(Igjβjkjkj). The inclusion of τj\tau_jτj in gjg_jgj ensures that recent interactions are prioritized while long-term patterns are preserved.

The outputs of both branches are fused and passed through a linear layer and softmax to generate the final recommendation prediction. This hybrid design enables efficient processing of long sequences without sacrificing the semantic richness required for accurate intent modeling.

Experiment

  • HyTRec outperforms state-of-the-art baselines in modeling long user behavior sequences, achieving top or near-top performance across multiple datasets, particularly excelling in capturing user interests and adapting to sparse or scattered behavioral patterns.
  • The model scales efficiently to ultra-long sequences due to its linear attention mechanism, maintaining high throughput even at sequence lengths up to 12k, while transformer-based models suffer sharp efficiency drops beyond 1k.
  • Ablation studies confirm that both the TADN branch (for long-term dependencies) and the short-term attention branch (for immediate interest drift) are essential, with their combination yielding the best overall performance through complementary modeling.
  • A 3:1 hybrid attention ratio between linear and short-term attention provides the optimal balance between recommendation accuracy and inference efficiency, outperforming other ratios in both performance and latency trade-offs.
  • HyTRec demonstrates robustness in cold-start and sparse interaction scenarios by leveraging similar user behavior patterns, showing strong generalization and adaptability to challenging user cases.
  • Additional parameter studies indicate that 2 attention heads and 4 experts deliver optimal performance-efficiency trade-offs, aligning with user group heterogeneity and computational constraints.
  • The model also shows strong cross-domain transfer capability, suggesting structural advantages in handling domain shifts and motivating future work on broader generalization and noise resilience.

The authors use HyTRec to address long user behavior modeling by combining linear and short-term attention mechanisms, achieving competitive or superior performance across multiple datasets compared to transformer and linear attention baselines. Results show that HyTRec maintains high efficiency at ultra-long sequence lengths while delivering strong recommendation accuracy, particularly in capturing both long-term patterns and short-term interest drift. Ablation and ratio studies confirm that the hybrid architecture’s dual-branch design and 3:1 attention ratio offer the best balance between performance and computational cost.

The authors use ablation experiments to isolate the contributions of HyTRec’s two key components: the TADN branch for long-term sequence modeling and the short-term attention branch for capturing immediate interest drift. Results show that each component independently improves performance over the baseline, but their combination yields the highest scores across all metrics, confirming their complementary roles. This design enables the model to effectively balance long-range dependency modeling with responsiveness to recent user behavior.

The authors evaluate the impact of varying the number of experts in their recommendation model and find that performance peaks at 4 experts, with both recommendation accuracy and inference efficiency declining as the number increases to 6 or 8. Results show that higher expert counts introduce unnecessary computational overhead without meaningful gains in key metrics like H@500, NDCG@500, or AUC. This supports the design choice of aligning expert count with user group heterogeneity for optimal balance between effectiveness and efficiency.

The authors evaluate how different hybrid attention ratios affect recommendation accuracy and inference latency, finding that a 3:1 ratio delivers the best balance between performance gains and computational cost. While higher ratios like 6:1 achieve marginally better metrics, they incur significantly increased latency, making them less practical. Results confirm that moderate hybrid configurations optimize both effectiveness and efficiency in long-sequence modeling.

The authors evaluate HyTRec on challenging user scenarios including cold-start new users and silent old users, showing that the model maintains strong retrieval and ranking performance across both cases. Results indicate consistent AUC scores above 0.86 and competitive H@500 values, demonstrating the model’s robustness in handling sparse interaction data through effective user similarity augmentation and generalization.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp