6 hours ago

Jiahao Yuan Yike Xu Jinyong Wen Baokun Wang Ziyi Gao Xiaotong Lin Yun Liu Xing Fu Yu Cheng Yongchao Liu

Table of Contents

Abstract

Industrial-scale user representation learning requires balancing robust universality with acute task-sensitivity. However, existing paradigms primarily yield static, task-agnostic embeddings that struggle to reconcile the divergent requirements of downstream scenarios within unified vector spaces. Furthermore, heterogeneous multi-source data introduces inherent noise and modality conflicts, degrading representation. We propose Query-as-Anchor, a framework shifting user modeling from static encoding to dynamic, query-aware synthesis. To empower Large Language Models (LLMs) with deep user understanding, we first construct UserU, an industrial-scale pre-training dataset that aligns multi-modal behavioral sequences with user understanding semantics, and our Q-Anchor Embedding architecture integrates hierarchical coarse-to-fine encoders into dual-tower LLMs via joint contrastive-autoregressive optimization for query-aware user representation. To bridge the gap between general pre-training and specialized business logic, we further introduce Cluster-based Soft Prompt Tuning to enforce discriminative latent structures, effectively aligning model attention with scenario-specific modalities. For deployment, anchoring queries at sequence termini enables KV-cache-accelerated inference with negligible incremental latency. Evaluations on 10 Alipay industrial benchmarks show consistent SOTA performance, strong scalability, and efficient deployment. Large-scale online A/B testing in Alipay's production system across two real-world scenarios further validates its practical effectiveness. Our code is prepared for public release and will be available at: https://github.com/JhCircle/Q-Anchor.

One-sentence Summary

Researchers from Ant Group and Zhejiang University propose Q-Anchor, a dynamic query-aware user representation framework that outperforms static embeddings by integrating multi-modal behavioral data with LLMs via hierarchical encoders and soft prompt tuning, enabling efficient, scalable deployment validated across 10 Alipay benchmarks and real-world A/B tests.

Key Contributions

We introduce UserU, an industrial-scale pre-training dataset that aligns multi-modal user behaviors with semantic understanding through future behavior prediction and QA supervision, enabling LLMs to learn robust, generalizable user representations despite sparse and heterogeneous data.
We propose Query-as-Anchor, a dynamic framework that generates scenario-specific user embeddings by conditioning LLM-based dual-tower encoders on natural language queries, allowing a single model to adaptively serve diverse downstream tasks without retraining.
We implement Cluster-based Soft Prompt Tuning and KV-cache-accelerated inference to enforce discriminative latent structures and maintain low-latency deployment, validated by SOTA results across 10 Alipay benchmarks and online A/B tests in production.

Introduction

The authors leverage large language models to address the limitations of static, task-agnostic user embeddings that struggle with cross-scenario adaptability and noisy, sparse behavioral data. Prior methods either produce fixed representations ill-suited for diverse downstream tasks or fail to bridge the modality gap between language-centric LLM pretraining and symbolic user logs. Their main contribution is Query-as-Anchor, a framework that dynamically generates scenario-specific user embeddings by conditioning LLMs on natural language queries, supported by UserU—a large-scale pretraining dataset with behavior prediction and QA supervision—and optimized with cluster-based soft prompt tuning for efficient, low-latency inference. This enables unified, adaptable user modeling across marketing, risk, and engagement tasks without retraining.

Dataset

The authors use the UserU Pretraining Dataset to improve user embedding performance across diverse tasks by combining behavior prediction and synthetic query-answer data.
The dataset has two main subsets: D_future (behavior prediction) and D_uqa (LLM-generated user queries and answers).
D_future contains N samples derived from three-month user behavior profiles, aggregated into future action summaries using frequency- and diversity-aware selection; each sample pairs a user profile + fixed template query with a predicted future action.
D_uqa contains M synthetic samples generated via LLM: for each user profile, the model retrieves 10 relevant life topics, generates grounded queries, and produces answers with a post-generation reflection step to ensure faithfulness and context validity.
Input data includes heterogeneous sources like PayBill transactions, Mini Program logs, App lists, and search queries, encoded as hierarchical user tokens followed by an optional instruction and a <USER_EMB> token to signal embedding extraction.
The authors train using a mixture of both subsets, with no explicit ratio specified, but emphasize decoupling pretraining from downstream tasks to improve generalization.
No cropping is applied; instead, user profiles are contextualized via concatenation with queries, and all data is anonymized or synthetic for privacy.
Downstream evaluation uses 10 real-world binary classification tasks with ~500k test samples each, where labels are assigned based on whether a user triggers a target event within a prediction window.

Method

The authors leverage a hierarchical, query-driven framework called Q-Anchor Embedding to unify multi-modal user behavior into a semantically grounded, task-adaptive representation space. The architecture is structured around three core components: a coarse-to-fine user encoder, a dual-tower pretraining mechanism, and a soft prompt tuning module for downstream adaptation—all designed to operate efficiently at industrial scale.

The hierarchical user encoder processes raw behavioral signals from six modalities—Bill, Mini, SPM, App, Search, and Tabular—each represented as sequences of events over a 90-day window. For each modality $m$ , event-level embeddings $\mathbf{h}_{m,t}$ are first projected via modality-specific MLPs into refined event tokens $\mathbf{z}_{m,t}^{(evt)}$ . These are then aggregated via mean-pooling into modality-level summaries $\bar{\mathbf{z}}_{m}^{(evt)}$ , which are further transformed by a shared modal adapter into $\mathbf{z}_{m}^{(mdl)}$ . A global user representation $\mathbf{z}^{(usr)}$ is derived by consolidating all modality embeddings through a dedicated user adapter. The final input token sequence $\mathbf{e}_i$ is constructed by concatenating representations across all three levels: user, modality, and event, enabling the LLM to dynamically attend to either granular actions or high-level summaries based on the query context.

The pretraining phase employs a dual-tower architecture that aligns user behavior with semantic intent. The Anchor Tower ingests the hierarchical user tokens $\mathbf{e}_i$ and appends a natural language query $q_i$ as a trailing anchor, producing a query-conditioned embedding $\mathbf{u}_{i,q} = \text{LLM}_{anc}(\mathbf{e}_i, q_i)$ . The Semantic Tower encodes the target answer $a_i$ into $\mathbf{v}_{a_i} = \text{LLM}_{sem}(a_i)$ , using the same LLM backbone to ensure alignment in a shared latent space. Training is driven by a joint objective: a contrastive loss $\mathcal{L}_{cl}$ that pulls positive pairs $(\mathbf{u}_{i,q}, \mathbf{v}_{a_i})$ together while pushing apart negatives using a margin-based mask $m_{ij}$ , and a generative Next-Token Prediction loss $\mathcal{L}_{ntp}$ that reconstructs the answer sequence autoregressively. The total loss $\mathcal{L}_{total} = \mathcal{L}_{cl} + \mathcal{L}_{ntp}$ ensures embeddings are both discriminative and semantically dense.

To adapt the pretrained model to downstream tasks without full fine-tuning, the authors introduce a soft prompt tuning mechanism. Learnable prompt tokens are inserted into the LLM’s input space to modulate the latent representation $\mathbf{u}_{i,q}$ , while class prototypes $\{\mathbf{p}_k\}$ are optimized via a prototypical contrastive loss $\mathcal{L}_{pt}$ that pulls user embeddings toward their assigned class centers. This enables task-specific alignment—such as distinguishing high-risk from low-risk users—while preserving the foundational multi-modal alignment learned during pretraining.

For deployment, the system leverages a KV-cache optimization to decouple user encoding from query processing. The hierarchical prefix $\mathbf{e}_i$ is encoded once and cached, allowing multiple downstream queries $\{q_1, \ldots, q_n\}$ to be processed sequentially with only incremental computation for the query tokens. This amortizes the cost to $O(L_{q_j})$ per query, enabling high-throughput, multi-scenario inference. Daily updates are performed incrementally: only modalities with new events are re-encoded, and their summary tokens are refreshed in a rolling buffer, ensuring fresh, bounded, and cost-efficient representations at scale.

Experiment

Q-Anchor consistently outperforms general-purpose and user-specific baselines across 10 real-world binary classification tasks, demonstrating superior AUC and KS scores, especially in Risk and Marketing domains, validating that representation alignment matters more than semantic capacity alone.
The model generalizes robustly across Engagement, Risk, and Marketing domains without task-specific architectures, supporting a “one-model-for-many” paradigm enabled by query-conditioned anchoring.
Pretraining on larger datasets yields steady gains, while scaling model size beyond 0.5B parameters shows no consistent improvement and even regression, confirming data scale matters more than parameter scale for embedding quality.
Prompt tuning with just 6 learnable tokens and 500 steps delivers most of the performance gain, saturating quickly and enabling efficient, interpretable specialization via attention shifts toward scenario-relevant modalities.
Ablations confirm contrastive alignment is essential for embedding structure, while modality and user tokens provide critical inductive bias; pretraining is foundational, not optional, for capturing behavioral priors.
Online A/B tests show tangible business impact: improved cash-reserve outreach engagement and credit delinquency detection, validating real-world efficacy.
Deployment at scale leverages shared prefix computation, enabling multi-scenario serving with minimal incremental cost per scenario.
PCA and t-SNE visualizations confirm prompt tuning sharpens cluster separation, aligning representations with downstream decision boundaries without architectural changes.

The authors use a lightweight 0.5B LLM backbone with modality-specific encoders and contrastive pretraining to generate user representations, then apply soft prompt tuning for scenario adaptation. Results show that their Q-Anchor method consistently outperforms both general text embeddings and specialized user representation models across all 10 tasks, with prompt tuning delivering significant gains in both AUC and KS while maintaining efficiency. The approach proves robust across domains and scales effectively with data and prompt tokens, but not with model size, highlighting that representation quality depends more on training signal alignment than parameter count.

Results show that increasing the number of learnable prompt tokens from 1 to 6 consistently improves both AUC and KS across all evaluated scenarios, with performance plateauing beyond 6 tokens. The 6-token configuration achieves the highest average AUC and KS, indicating that minimal parameter updates are sufficient to specialize the universal representation for diverse downstream tasks. This efficiency supports scalable deployment, as further token increases yield diminishing returns and occasional performance degradation.

The authors use Q-Anchor embeddings to outperform existing business-specific models in two key Alipay scenarios, with prompt tuning delivering additional gains through minimal parameter updates. Results show consistent improvements in both AUC and KS metrics, confirming that a universal representation combined with lightweight scenario conditioning can surpass handcrafted features while maintaining computational efficiency. The method proves effective even without task-specific architecture, supporting scalable deployment across diverse business objectives.

The authors use a 0.5B-parameter LLM backbone with fixed user representations and scale pretraining data from 20.5M to 102.4M samples, observing consistent gains in both AUC and KS across all 10 tasks as data increases. Results show that performance improves steadily with more pretraining data, with the highest average AUC (0.8105) and KS (0.5044) achieved at 50k steps, while larger model sizes do not yield consistent improvements under the same training budget. This indicates that for user representation learning, data scale contributes more reliably to performance than model scale.

The authors use a structured ablation study to isolate the contributions of modality tokens, training objectives, and prompt tuning in their Q-Anchor framework. Results show that removing modality or user tokens causes minor performance drops, while eliminating contrastive learning leads to the largest degradation, confirming its role in shaping the embedding space. Prompt tuning consistently boosts performance, but omitting pretraining causes a systemic collapse, underscoring that pretraining provides essential behavioral priors for effective downstream adaptation.

Source PDF View Code

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

6 hours ago

Multimodal Representation

Jiahao Yuan Yike Xu Jinyong Wen Baokun Wang Ziyi Gao Xiaotong Lin Yun Liu Xing Fu Yu Cheng Yongchao Liu

Table of Contents

Abstract

One-sentence Summary

Key Contributions

We introduce UserU, an industrial-scale pre-training dataset that aligns multi-modal user behaviors with semantic understanding through future behavior prediction and QA supervision, enabling LLMs to learn robust, generalizable user representations despite sparse and heterogeneous data.
We propose Query-as-Anchor, a dynamic framework that generates scenario-specific user embeddings by conditioning LLM-based dual-tower encoders on natural language queries, allowing a single model to adaptively serve diverse downstream tasks without retraining.
We implement Cluster-based Soft Prompt Tuning and KV-cache-accelerated inference to enforce discriminative latent structures and maintain low-latency deployment, validated by SOTA results across 10 Alipay benchmarks and online A/B tests in production.

Introduction

Dataset

The authors use the UserU Pretraining Dataset to improve user embedding performance across diverse tasks by combining behavior prediction and synthetic query-answer data.
The dataset has two main subsets: D_future (behavior prediction) and D_uqa (LLM-generated user queries and answers).
D_future contains N samples derived from three-month user behavior profiles, aggregated into future action summaries using frequency- and diversity-aware selection; each sample pairs a user profile + fixed template query with a predicted future action.
D_uqa contains M synthetic samples generated via LLM: for each user profile, the model retrieves 10 relevant life topics, generates grounded queries, and produces answers with a post-generation reflection step to ensure faithfulness and context validity.
Input data includes heterogeneous sources like PayBill transactions, Mini Program logs, App lists, and search queries, encoded as hierarchical user tokens followed by an optional instruction and a <USER_EMB> token to signal embedding extraction.
The authors train using a mixture of both subsets, with no explicit ratio specified, but emphasize decoupling pretraining from downstream tasks to improve generalization.
No cropping is applied; instead, user profiles are contextualized via concatenation with queries, and all data is anonymized or synthetic for privacy.
Downstream evaluation uses 10 real-world binary classification tasks with ~500k test samples each, where labels are assigned based on whether a user triggers a target event within a prediction window.

Method

Experiment

Q-Anchor consistently outperforms general-purpose and user-specific baselines across 10 real-world binary classification tasks, demonstrating superior AUC and KS scores, especially in Risk and Marketing domains, validating that representation alignment matters more than semantic capacity alone.
The model generalizes robustly across Engagement, Risk, and Marketing domains without task-specific architectures, supporting a “one-model-for-many” paradigm enabled by query-conditioned anchoring.
Pretraining on larger datasets yields steady gains, while scaling model size beyond 0.5B parameters shows no consistent improvement and even regression, confirming data scale matters more than parameter scale for embedding quality.
Prompt tuning with just 6 learnable tokens and 500 steps delivers most of the performance gain, saturating quickly and enabling efficient, interpretable specialization via attention shifts toward scenario-relevant modalities.
Ablations confirm contrastive alignment is essential for embedding structure, while modality and user tokens provide critical inductive bias; pretraining is foundational, not optional, for capturing behavioral priors.
Online A/B tests show tangible business impact: improved cash-reserve outreach engagement and credit delinquency detection, validating real-world efficacy.
Deployment at scale leverages shared prefix computation, enabling multi-scenario serving with minimal incremental cost per scenario.
PCA and t-SNE visualizations confirm prompt tuning sharpens cluster separation, aligning representations with downstream decision boundaries without architectural changes.

Source PDF View Code

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Query as Anchor: Scenario-Adaptive User Representation via Large Language Model

Jiahao Yuan Yike Xu Jinyong Wen Baokun Wang Ziyi Gao Xiaotong Lin Yun Liu Xing Fu Yu Cheng Yongchao Liu2 more

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

Query as Anchor: Scenario-Adaptive User Representation via Large Language Model

Jiahao Yuan Yike Xu Jinyong Wen Baokun Wang Ziyi Gao Xiaotong Lin Yun Liu Xing Fu Yu Cheng Yongchao Liu2 more

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

Query as Anchor: Scenario-Adaptive User Representation via Large Language Model

Jiahao Yuan Yike Xu Jinyong Wen Baokun Wang Ziyi Gao Xiaotong Lin Yun Liu Xing Fu Yu Cheng Yongchao Liu2 more

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters

Jiahao Yuan Yike Xu Jinyong Wen Baokun Wang Ziyi Gao Xiaotong Lin Yun Liu Xing Fu Yu Cheng Yongchao Liu

Jiahao Yuan Yike Xu Jinyong Wen Baokun Wang Ziyi Gao Xiaotong Lin Yun Liu Xing Fu Yu Cheng Yongchao Liu

Jiahao Yuan Yike Xu Jinyong Wen Baokun Wang Ziyi Gao Xiaotong Lin Yun Liu Xing Fu Yu Cheng Yongchao Liu