HyperAIHyperAI

Command Palette

Search for a command to run...

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

Abstract

The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteria decision-making. However, existing idea evaluation methods often suffer from narrow knowledge horizons, flattened evaluation dimensions, and the inherent bias in LLM-as-a-Judge. To address these, we regard idea evaluation as a knowledge-grounded, multi-perspective reasoning problem and introduce InnoEval, a deep innovation evaluation framework designed to emulate human-level idea assessment. We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources. We further achieve review consensus with an innovation review board containing reviewers with distinct academic backgrounds, enabling a multi-dimensional decoupled evaluation across multiple metrics. We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval. Experiments demonstrate that InnoEval can consistently outperform baselines in point-wise, pair-wise, and group-wise evaluation tasks, exhibiting judgment patterns and consensus highly aligned with human experts.

One-sentence Summary

Researchers from Zhejiang University and University of Amsterdam propose InnoEval, a knowledge-grounded multi-perspective framework using dynamic evidence retrieval and diverse reviewer boards to overcome LLM biases in scientific idea evaluation, achieving human-aligned consensus across benchmark tasks.

Key Contributions

  • InnoEval addresses the critical gap in evaluating scientific ideas generated by LLMs by framing evaluation as a knowledge-grounded, multi-perspective reasoning task, countering narrow knowledge horizons and LLM-as-a-Judge bias through a heterogeneous deep knowledge search engine that retrieves dynamic evidence from literature, web, and code sources.
  • The framework simulates human review consensus via an innovation review board of persona-based evaluators with distinct academic backgrounds, each independently assessing ideas across five decoupled dimensions—Clarity, Novelty, Feasibility, Validity, and Significance—to preserve multi-criteria decision-making and mitigate single-judge bias.
  • Evaluated on datasets derived from peer-reviewed submissions, InnoEval outperforms baselines by 16.18% F1 in point-wise prediction, 5% in pair-wise accuracy, and 7.56% in group-wise ranking, with human evaluations confirming high alignment in judgment patterns and consensus.

Introduction

The authors leverage large language models to address the growing gap between automated idea generation and human-level idea evaluation in scientific research. Existing methods suffer from narrow knowledge bases limited to static papers, biased single-judge evaluations, and flattened metrics that ignore the multidimensional nature of innovation. In response, they introduce InnoEval — a framework that treats idea evaluation as a knowledge-grounded, multi-perspective reasoning problem. It combines a heterogeneous deep knowledge search engine that pulls dynamic evidence from literature, web, and code sources, with a simulated innovation review board of diverse academic personas to generate consensus-driven, multi-dimensional assessments across clarity, novelty, feasibility, validity, and significance. Their approach outperforms baselines in point-wise, pair-wise, and group-wise tasks while aligning closely with human expert judgments.

Dataset

The authors use a multi-tiered dataset built from NeurIPS 2025 and ICLR 2025 papers via OpenReview, filtered to exclude withdrawn or placeholder submissions. They construct three core subsets:

  • Point-wise Dataset (D_point, 217 samples):

    • Composed of 136 ICLR 2025 and 81 NeurIPS 2025 ideas.
    • Stratified by final decision: 138 Reject, 66 Poster, 9 Spotlight, 4 Oral (Reject: 61.3%, Poster: 29.3%, Highlight: 9.4%).
    • Each idea is extracted via agent M_e, then manually verified.
    • Tasks: Binary classification (Reject vs. Accept) and ternary classification (Reject, Poster, Highlight).
    • Metrics: Accuracy and macro F1.
  • Group-wise Dataset (D_group, 172 instances):

    • Built by using each D_point idea’s abstract as a query to retrieve similar papers via bge-base-en-v1.5 (top 800), then reranked with bge-reranker-base to 120.
    • One highest-similarity paper per decision stratum forms a group, enabling label-based ranking.
    • Tasks: Best idea selection (Accuracy) and full ranking (LIS score + Accuracy).
    • LIS measures alignment between predicted and gold rankings as length of longest increasing subsequence divided by group size.
  • Pair-wise Dataset (D_pair, 372 samples):

    • Derived from D_group: 172 easy pairs (e.g., Reject vs. Highlight) and 200 hard pairs (e.g., Poster vs. Highlight).
    • Tasks: Binary comparison, evaluated via Accuracy for each difficulty level separately.

All datasets are used for evaluation only — no training. The authors also deploy a persona-based review system (F. Innovation Review Board) to mitigate LLM bias, where synthetic reviewers with tailored backgrounds mask evidence proportionally to their expertise scores (e.g., 20% masking for Literature Familiarity = 8). These personas are generated using DeepSeek-V3.2 and embedded into prompts for dimension-specific agents.

Method

The authors leverage a multi-stage, agent-driven architecture called InnoEval to systematically evaluate research ideas by grounding them in dynamic, heterogeneous knowledge sources and synthesizing multi-perspective assessments. The framework begins with an Extraction Agent that parses raw textual input into a structured six-tuple representation—comprising TLDR, motivations, research questions, methods, experimental settings, and expected results—enabling precise downstream processing. This structured idea then feeds into a heterogeneous deep-knowledge search engine, which operates through iterative cycles of fast and slow search phases, query refinement, and knowledge ranking.

Refer to the framework diagram, which illustrates the end-to-end pipeline. The process initiates with Query Generation, where tailored queries are constructed for each idea component and dispatched across multiple search tools—including Google, arXiv, Semantic Scholar, and GitHub—each represented by distinct icons in the diagram. These queries are enriched with synonym expansions to account for terminological variance across domains. The Fast Search phase retrieves brief results from these tools, which are then ranked and filtered using a hybrid scoring function combining semantic similarity (Ssem\mathcal{S}^{\text{sem}}Ssem) and model-as-judge scores (Sllm\mathcal{S}^{\text{llm}}Sllm), weighted by a coefficient α\alphaα. The top-mmm results per knowledge type (literature, web, code) are retained for subsequent enrichment.

The Slow Search phase enriches the filtered results: for literature, full PDFs are parsed into structured text; for web content, pages are summarized; and for code repositories, call graphs and READMEs are analyzed to generate executable context. This enriched knowledge is then subjected to iterative Query Refinement, where the Search Agent revises queries based on prior retrieval outcomes—rewriting, generalizing, or concretizing them—to uncover deeper or more relevant background material. This loop repeats NNN times, progressively expanding the knowledge corpus.

Following retrieval, a Grounding Agent aligns each idea component with specific evidence from the retrieved knowledge, producing fine-grained relevance analyses that explicitly support or contradict the idea’s claims. These grounding results serve as the foundation for Multi-dimensional Multi-perspective Evaluation, conducted by a panel of persona-based evaluators. Each persona, drawn from an Innovation Review Board, is assigned a subset of knowledge based on their simulated familiarity, and evaluates the idea across five dimensions—Clarity, Validity, Novelty, Feasibility, and Significance—using dedicated agent evaluators Mψ\mathcal{M}_{\psi}Mψ. Each evaluation yields a score and narrative rationale, aggregated into a meta-review.

Finally, a Report Agent synthesizes the enriched knowledge, grounding results, and evaluation outputs into a structured final report. For point-wise evaluation, this includes background knowledge, revision suggestions derived from future-oriented knowledge, and a meta-review with a final decision. For group-wise evaluation, the system compares all ideas across dimensions and produces a ranked list alongside comparative analyses. The entire architecture is designed to be modular, extensible, and grounded in real-time, diverse knowledge sources, ensuring both timeliness and depth in evaluation.

Experiment

  • InnoEval outperforms multiple baselines across point-wise, pair, and group evaluation tasks by leveraging multi-source retrieval, multi-perspective review, and grounded analysis, achieving higher F1 scores and better label dispersion.
  • Qualitative evaluations show InnoEval’s reports are more rational, well-supported, deeper, and more constructive than baselines, with strong alignment to human and peer-review judgments across five dimensions.
  • Ablation studies confirm that grounding, personalized reviewer personas, and diverse retrieval sources (including web and code) are critical to performance, while removing them degrades results.
  • InnoEval’s multi-perspective test-time scaling improves with more personas, demonstrating that authentic reviewer diversity outperforms synthetic opinion generation.
  • The system’s deep knowledge search engine uniquely balances relevance, coverage, and diversity better than baseline search modules.
  • InnoEval’s feedback enhances idea generation pipelines, producing more refined problem formulations, methodologies, and experimental designs than ScholarEval or vanilla baselines.
  • Novelty is the strongest predictor of idea acceptance, while feasibility becomes critical for achieving highlight status, indicating the need for well-rounded evaluation.
  • Dimensional correlations reveal that clarity and significance are foundational, while novelty trades off slightly with validity and feasibility, aligning with scholarly intuition.
  • InnoEval demonstrates robustness across different backbone models (DeepSeek-V3.2 and o4-mini) and can recognize real-world innovation through comprehensive, multi-angle review, mitigating single-viewpoint bias.

The authors use InnoEval to evaluate scientific ideas across point-wise, pair-wise, and group-wise tasks, outperforming multiple baselines including CoT, RAG, ResearchAgent, and ScholarEval. Results show that InnoEval achieves state-of-the-art performance by leveraging multi-dimensional, multi-perspective evaluation and evidence-rich grounding, which mitigates label collapse and improves alignment with human judgments. Ablation studies confirm that components like personalized personas, diverse retrieval sources, and grounding are critical to its robustness and effectiveness.

The authors use LLM-as-judge to compare InnoEval against multiple baselines across five qualitative dimensions, finding that InnoEval consistently outperforms them in rationality, supportiveness, depth, and constructiveness. Results show that InnoEval’s multi-source search and multi-perspective evaluation yield significantly higher win rates, especially in depth and overall quality, while ScholarEval remains competitive in some dimensions but lags in constructiveness. The findings highlight that integrating diverse evidence and structured review personas enhances evaluation robustness and actionable feedback.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp