【阅】本周阅读摘选2026-05-25 → 2026-05-31

Posted by Cao Zihang on June 1, 2026 Word Count:
本周阅读摘选
2026-05-25 → 2026-05-31
目录

学术相关

Intern-Atlas

One-sentence positioning: Extracts “method entities + typed causal edges + bottleneck/mechanism evidence” from 1M+ AI papers to construct a queryable methodological evolution graph, serving as the underlying knowledge infrastructure for AI research agents.

Key innovation: Upgrades flat citation networks into a “method-method causal graph” where each causal edge is accompanied by verbatim citations and structured bottleneck/mechanism annotations; proposes the SGT-MCTS algorithm to reconstruct method evolution lineages on this graph, enabling idea evaluation and generation based on explicit structural evidence rather than LLM parametric memory.


0. Execution Overview

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Offline Phase (graph construction)
  ├─ ① Paper corpus processing (1,030,314 papers)
  ├─ ② Method entity extraction + alias resolution
  │     ├─ Seed method curation
  │     ├─ LLM Proposer scanning expansion
  │     └─ Alias registry A: 8,155 canonical / 9,545 aliases
  ├─ ③ Citation edge semantic classification (7 types)
  │     ├─ strong-causal: extends / improves / replaces / adapts
  │     ├─ non-strong: uses_component / compares
  │     └─ non-causal: background
  └─ ④ Causal edge evidence filling
        ├─ 14-category bottleneck taxonomy b_e
        ├─ Mechanism m_e / trade-off t_e
        ├─ LLM confidence c_e ∈ [0,1]
        └─ Verbatim citation grounding
              ↓
Online Phase (retrieval + generation)
  ├─ ⑤ Node matching (keywords + BM25)
  ├─ ⑥ Lineage Reconstruction: SGT-MCTS
  │     ├─ Forward/backward dual trees
  │     ├─ SGT-UCT selection = standard UCT + λ·graph-aware prior
  │     ├─ Temporal coherence TC + edge confidence conf joint prior
  │     └─ Branch discovery + Jaccard deduplication
  ├─ ⑦ Lineage Rerank (length + evidence strength + search consensus)
  └─ ⑧ Generation layer
        ├─ Graph-Grounded Idea Evaluator (5-dimension scoring)
        └─ Graph-Grounded Idea Generator (4 strategy types + evidence certificate)

1. High-level Design (Indexing → Retrieval → Generation)

1.1 Indexing

Dimension Approach
Chunking strategy Method-level entity extraction, replacing traditional paper/paragraph-level extraction
Index structure Method evolution graph: method nodes + typed causal edges + structured evidence attributes
Knowledge representation Directed causal network: nodes = methods/papers/stubs, edges = 7 semantic types (extends/improves/replaces/adapts/uses_component/compares/background), each edge carrying bottleneck/mechanism/trade-off/confidence/verbatim citation
Construction cost High: 1M+ paper corpus processing, LLM extraction (Qwen3.6-35B-A3B), alias resolution, edge classification, evidence filling
Core characteristic Method-level atomic units replace paper-level; each causal edge mandatorily carries verbatim citations and structured evidence

1.2 Retrieval

Dimension Approach
Retrieval method Graph traversal (SGT-MCTS searches evolution paths on the strong-causal subgraph)
Retrieval granularity Method nodes + Evolution paths (evolution chains)
Iteration strategy Multi-hop (forward/backward exploration along strong-causal edges) + Branch discovery restart
Query processing Keyword matching (canonical/alias) + BM25 semantic matching
Core characteristic SGT-MCTS dual-tree search with graph-aware and time-aware priors; branch discovery prevents greedy collapse

1.3 Generation

Dimension Approach
Context injection Lineage paths injected into Idea Evaluator / Generator
Citation tracing Verbatim citation grounding + evidence certificate (specific causal edge, bottleneck original text, unresolved explanation)
Quality control Graph statistics replace LLM free-text judgment; evidence certificate prevents hallucination; verification failure triggers fallback
Core characteristic Graph-grounded idea evaluation/generation; structural evidence replaces parametric memory

2. Offline Construction: Indexing (Detailed Execution)

Step 2.1 Corpus Preparation

Item Description
Input 1,030,314 AI papers (covering AI conferences, journals, arXiv preprints)
Operation Collect and preprocess paper corpus, construct raw document repository
Output Structured paper corpus

Step 2.2 Method Entity Extraction and Alias Resolution

Step 2.2.1 Seed Method Construction
Item Description
Input Human-curated list of well-known methods
Operation Manually establish initial method seed set
Output Seed method set
Step 2.2.2 Method Expansion
Item Description
Input Seed method set + paper corpus
Operation LLM Proposer scans entire corpus, identifies and supplements additional candidate method entities
Output Expanded candidate method set
Step 2.2.3 Alias Registry Construction
Item Description
Input Expanded method entity set
Operation Build alias registry A: V_M → 2^Σ*, mapping each canonical method to a set of surface forms
Matching rules Substring matching + case/punctuation normalization + word boundary enforcement (prevents “GPT” matching inside “lgpto”) + longest match priority (“GPT-4 Turbo” > “GPT-4” > “GPT”)
Version merging “-v2”, “-Large” etc. appended to parent unless an independent canonical node already exists
Ambiguity handling Manually maintained negative-surface list (e.g., state space model “Mamba” vs. Python linter “Mamba”)
Scale 8,155 canonical methods / 9,545 aliases
Output Alias registry A

Step 2.3 Citation Edge Semantic Classification (Two-Phase LLM Extraction)

Step 2.3.1 Phase 1: Edge Type Classification
Item Description
Input Citation relationships between papers
Operation Use Qwen3.6-35B-A3B to classify each citation edge into semantic types
Classification system 7 types: strong-causal (extends / improves / replaces / adapts), non-strong (uses_component / compares), non-causal (background)
Accuracy Production model 70.4%; audit model (Claude-Sonnet-4.6) 93.0%
Output Semantically typed edges
Step 2.3.2 Phase 2: Structured Record Completion
Item Description
Input Non-background causal edges
Operation Complete structured evidence records for each edge
Output Causal edges carrying structured attributes

Step 2.4 Causal Edge Evidence Filling

Step 2.4.1 14-Category Bottleneck Taxonomy
Bottleneck Dimension Operational Definition
computational complexity asymptotic or wall-clock compute at fixed scale
memory efficiency peak activation / parameter memory footprint
parallelization degree of across-device or across-token parallelism
accuracy task-level correctness or quality metric
generalization out-of-distribution / cross-domain transfer
scalability behavior as model / data / context size grows
data efficiency sample complexity at fixed quality target
training stability variance / divergence risk during optimization
inference speed runtime latency or throughput
expressiveness function class or representational capacity
simplicity implementation, conceptual, or interface simplicity
robustness behavior under perturbation or adversarial input
hyperparameter sensitivity outcome variance w.r.t. hyperparameter choice
training complexity engineering difficulty of the training recipe
Step 2.4.2 Structured Evidence Quadruple

Each causal edge e carries quadruple ρ(e) = (b_e, m_e, t_e, c_e):

Attribute Meaning Key Design
b_e Bottleneck addressed 14-axis bottleneck taxonomy (fixed at publication time)
m_e Mechanism employed LLM-extracted structured field
t_e Trade-off Cost/limitation of the mechanism
c_e LLM-reported confidence ∈ [0,1], used later by SGT-MCTS
Citation grounding Verbatim excerpt All non-background edges mandatorily paired with verbatim quote
Step 2.4.3 Verbatim Citation Verification
Item Description
Input LLM-extracted citations + original papers
Operation Search Match + Symmetry Check
Purpose Ensure citations actually exist in the original text, preventing hallucination
Output Verified verbatim citations
Step 2.4.4 Graph Scale
Metric Value
Papers 1,030,314
Method nodes 8,155 canonical
Aliases 9,545
Semantically typed edges 9,430,201

Pasted image 20260502215241.png


3. Online Query: Retrieval (Detailed Execution)

3.1 Retrieval Mode Overview

Intern-Atlas employs a single lineage reconstruction retrieval mode, with SGT-MCTS searching on the method evolution graph:

Mode Applicable Scenario Core Mechanism Characteristic
Lineage Reconstruction Query method’s evolution history SGT-MCTS forward/backward dual-tree search on strong-causal subgraph Graph-aware + time-aware; branch discovery prevents greedy collapse

3.2 Retrieval Procedure

Step 3.1: Node Matching
Item Description
Input User query q
Operation Parse query, construct seed method set S(q)
Matching method 1 Vocabulary keyword matching (exact hit on canonical / alias)
Matching method 2 BM25 semantic matching (handles semantically ambiguous queries)
Output Seed method set S(q) ⊆ C_q
Step 3.2: SGT-MCTS Lineage Reconstruction
Item Description
Input Seed method set S(q)
Operation On strong-causal subgraph (V_M, ε_sc), construct directed evolution path set Π_q along publication chronology
Why not standard MCTS? Central nodes have extremely high branching → standard UCT easily gets trapped in high-visit branches; standard UCT does not perceive graph structure and temporal direction, producing implausible paths where “ancestors come from descendants”
Step 3.2.1: Dual-tree Construction
Item Description
Operation Each seed simultaneously constructs forward + backward trees
Forward tree Searches for predecessor methods
Backward tree Searches for successor developments
Key constraint Operates only on the strong-causal edge subgraph
Step 3.2.2: SGT-UCT Selection
\[\text{SGT-UCT}(v) = \underbrace{\text{UCT}(u, v)}_{\text{standard exploration-exploitation}} + \lambda \cdot \underbrace{\alpha_G(u, v)}_{\text{graph-aware prior}}\] \[\alpha_G(u, v) = \underbrace{\text{conf}(e_{u \to v})}_{\text{edge confidence}} \cdot \underbrace{\text{TC}(\Delta\tau_{uv})}_{\text{temporal coherence}}\]
Component Source
conf(e) Edge confidence c_e reported by LLM during graph construction
TC(Δτ) Manually segmented function scoring publication year difference
\[\text{TC}(\Delta\tau) = \begin{cases} 0.40 & -1 \leq \Delta\tau < 0 \quad \text{(slight temporal overlap, e.g., preprints)} \\ 0.85 & \Delta\tau = 0 \quad \text{(same year)} \\ 1.00 & 1 \leq \Delta\tau \leq 3 \quad \text{(optimal: 1-3 year natural evolution)} \\ 0.80 & 4 \leq \Delta\tau \leq 6 \quad \text{(slightly distant but still reasonable)} \\ \max(0.30,\ 1.00 - 0.08(\Delta\tau - 6)) & \Delta\tau > 6 \\ 0.70 & \tau \text{ missing} \end{cases}\]

Calibration range limitation: TC is only calibrated on post-2015 AI literature; domains with different research paces require recalibration.

Step 3.2.3: Expansion (Confidence-prioritized)
Item Description
Loop exclusion Discard nodes that would create cycles in the path
Hard temporal filtering Discard nodes with reversed temporal direction (prevents “descendants producing ancestors” paradox)
Expansion strategy Select the unexplored child node with highest confidence (no random expansion)
Step 3.2.4: Rollout (Greedy)
Item Description
Strategy Greedy rollout, no MC random sampling
Selection Directly select child node with highest confidence, avoiding noise introduction
Termination conditions Leaf node with no outgoing edges / cycle detected / maximum depth reached
Scoring R(π): paper does not provide specific form
Step 3.2.5: Backpropagation
Item Description
Operation Backpropagate to update cumulative reward and visit count
Special penalty Additional score deduction for ancestors of non-expandable leaf nodes → reduces dead ends
Step 3.2.6: Path Stitching and Deduplication
Item Description
Operation Take top-5 cumulative reward paths from forward/backward each → stitch through seed node
Deduplication Jaccard similarity threshold 0.8 to determine path homogeneity; keep only the higher-ranked path for homogeneous ones
Step 3.2.7: Branch Discovery
Item Description
Problem Addresses greedy collapse preventing discovery of branch paths
Branch node definition Node has at least 2 strong-causal child nodes in the search direction, but the current lineage final path includes only 1
Restart search Restart algorithm from the branch node
Constraint Edges already in the main lineage are forcibly blocked; search budget halved
Output merge Aggregate main path + branch paths
Step 3.3: Lineage Rerank
Item Description    
Input Candidate evolution path π    
Ranking formula rank(π) = w_ℓ · π /L_max + w_c · conf̄(π) + w_m · N̄(π)
First term Rewards long paths (sufficient evolution)    
Second term Path average evidence strength    
Third term Path node average visit count (search consensus)    
Output Re-ranked evolution paths    

3.3 Retrieval Algorithm Comparison

Method NR ER CAS
Beam@1 41.0 18.6 41.0
Beam@5 43.4 21.6 43.4
Beam@10 44.9 23.2 44.9
RW@5 28.1 0.7 28.1
SGT-MCTS 84.8 79.0 84.8

Pasted image 20260503100504.png


4. Online Generation: Generation (Detailed Execution)

4.1 Generation Procedure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Input: query q, retrieved evolution paths Π_q
    │
    ▼
┌─────────────────────────────┐
│ Parse methods mentioned     │
│ in idea d, map to nodes M_d │
│ via lookup table A          │
└─────────────┬───────────────┘
              │
         ┌────┴────┐
         ▼         ▼
   Evaluator   Generator
   (Evaluate)   (Generate)
      │           │
      ▼           ▼
  5-dim scoring  4 strategy types
  +cross penalty +evidence certificate

Step 4.1: Graph-Grounded Idea Evaluator

Step 4.1.1: Motivation

Free-text LLM judges prefer stacking popular methods; novelty scores are negatively correlated with actual scientific impact; graph statistics can directly provide structural evidence.

Step 4.1.2: Per-dimension Scoring
Item Description
Input Methods mentioned in idea d → resolved to nodes M_d ⊆ V_M via lookup table A
Operation Independently evaluate 5 dimensions based on graph statistics
Dimensions Novelty (N), Feasibility (F), Significance (S), Validity (V), Clarity (C)
Calculation Each dimension score directly based on graph statistics of M_d’s position/connectivity structure in retrieval context C_d

Formula:

\[s_k(d, G) = \text{clip}_{[1,10]}\left(b_k + \sum_j w_j^{(k)} \cdot \phi_j^{(k)}(M_d, C_d)\right), \quad k \in \{N, F, S, V, C\}\]
Step 4.1.3: Cross-dimensional Aggregation
Item Description
Operation Fixed weight vector w + 4 hand-crafted conjunctive penalties Ω_cross
Representative penalty Strong deduction when high Novelty + low Feasibility — “novel but infeasible” typically indicates a flawed core proposal

Formula:

\[s^*(d, G) = \text{clip}_{[1,10]}\left(\mathbf{w}^\top \mathbf{s} + \Omega_{\text{cross}}(\mathbf{s})\right), \quad \mathbf{s} = (s_N, s_F, s_S, s_V, s_C)\]
Step 4.1.4: Idea Evaluation Results
Publication tier Overall Score
Top-tier 8.48 ± 1 s.d.
Core 7.83 ± 1 s.d.
Workshop 6.85 ± 1 s.d.
Rejected 5.84 ± 1 s.d.

Pasted image 20260503095949.png

Step 4.2: Graph-Grounded Idea Generator

Step 4.2.1: Structural Gap Pattern → Generation Strategy Mapping
Gap Pattern Corresponding Generation Strategy Description
Open axes Bottleneck Resolution Address identified but unresolved bottlenecks in the graph
Recent improvement direction Trend Extrapolation Extrapolate along recent improvement directions
Sacrifice axes Cross-pollination Cross-domain/cross-method combination, leveraging trade-offs between different methods
Disconnected pairs Paradigm Challenge Challenge existing paradigms, connecting disconnected method pairs in the graph
Step 4.2.2: Evidence Certificate (Core Anti-hallucination Mechanism)
Field Content
Certificate tuple (specific causal edge, bottleneck original text in the graph, explanation of why this bottleneck remains unresolved)
Verification Exact match of bottleneck original text in the certificate against actual graph content
Failure handling Discard result, fallback
Design goal Ensure ideas are traceable without over-constraining generation

5. Key Design Decisions

Decision Point Intern-Atlas’s Choice Alternative Rationale
Graph atomic unit Method entities (method-level) Papers (paper-level) Notes explicitly document: paper-level citations cannot distinguish extends/improves/replaces/compares/background
Causal edge evidence Mandatory verbatim citations + structured bottleneck/mechanism Edge type labels only Provides grounded evidence, supports certificate verification for idea generation
Lineage search algorithm SGT-MCTS (graph-aware + time-aware) Beam search / standard MCTS / Random Walk Notes explicitly document: high branching at central nodes causes beam collapse; standard MCTS does not perceive publication year direction; Random Walk performs extremely poorly (ER only 0.7)
Rollout strategy Greedy (highest confidence) MC random sampling Random sampling introduces LLM extraction noise
Dead-end handling Backpropagation additional penalty + branch discovery restart Pure reliance on UCT exploration term Explicitly suppresses greedy collapse, proactively recalls ignored branches
Idea evaluation Per-dimension graph statistics + hand-crafted conjunctive penalties Free-text LLM judge The latter’s novelty score is negatively correlated with actual impact
Idea generation Structural gap patterns + evidence certificate Direct LLM free generation Forces grounding to specific causal edges and bottlenecks, avoids method name stacking

6. Evaluation

6.1 Graph Construction Quality

Benchmark: 30 high-impact surveys → 30 method evolution graphs, containing 2,268 nodes / 1,462 edges / 133 evolution chains.

Metric Meaning Result
NMR (Node Match Ratio) Node match rate 91.0%
ERR (Edge Reachable Ratio) Edge reachability rate 89.7%
PSC (Path Semantic Correctness) Path semantic correctness 92.0%
NR (Node Recall) Node recall 84.8% (SGT-MCTS)
ER (Edge Recall) Edge recall 79.0% (SGT-MCTS)
CAS (Chain Alignment Score) Lineage chain alignment score 84.8% (SGT-MCTS)

Pasted image 20260503095949.png

6.2 Idea Evaluator (Strata Dataset)

Category Paper Count Overall Score
Top-tier (ICLR 2026, ICML 2025, NeurIPS 2025) 300 8.48
Core (AAAI 2026, IJCAI 2025) 300 7.83
Workshop (ICLR 2026) 300 6.85
Rejected (ICLR 2026) 300 5.84

Evaluation method: Publication tier verification (across publication strata) + human evaluation.

6.3 Idea Generator Comparative Experiments

Baseline Retrieval/Knowledge Source
Direct LLM generation No external retrieval
External search OpenAlex / Semantic Scholar
Local RAG BM25
Intern-Atlas Method evolution graph + evidence certificate

6.4 SGT-MCTS Retrieval Algorithm Comparison

Method NR ER CAS
Beam@1 41.0 18.6 41.0
Beam@5 43.4 21.6 43.4
Beam@10 44.9 23.2 44.9
RW@5 28.1 0.7 28.1
SGT-MCTS 84.8 79.0 84.8

Pasted image 20260503100504.png


7. Limitations and Applicability

Limitation Specific Manifestation Mitigation
Edge type classification accuracy Phase-1 production model 70.4% / audit model 93.0% Key decisions can undergo secondary audit; improve production model
Bottleneck taxonomy rigidity 14-axis taxonomy D fixed at publication time; new dimensions mapped to nearest existing axis Await future taxonomy revision
Alias resolution coverage Substring matching favors precision over recall; ambiguity handled via manual negative list Biased toward high-quality nodes
Temporal coherence calibration range TC calibrated on post-2015 AI literature Cross-pace domains require recalibration
Rollout scoring opacity R(π) paper does not provide specific form Reproduction requires custom design

Best Applicable Scenarios

  • AI agents performing idea generation/evaluation that require structured method evolution priors
  • Traceable scientific idea generation (evidence certificate prevents hallucination)
  • Method lineage research / survey automation within AI subfields
  • Evaluating ideas while avoiding the “popular method stacking” bias of LLM free-text judges

Unsuitable Scenarios

  • Disciplines with research paces significantly different from AI (TC calibration fails)
  • Extremely niche subfields without human survey references (limited graph coverage)
  • Research focused on paper-level citation network properties (e.g., PageRank, influence propagation)

8. Quick Reference

What You Want to Know See Which Section
What is the complete pipeline? 0. Execution Overview
High-level design comparison? 1. High-level Design
How is the graph constructed? 2. Offline Construction
How to retrieve evolution paths from the graph? 3. Online Query
How to use the graph to evaluate/generate ideas? 4. Online Generation
How does it differ from existing approaches? 5. Key Design Decisions
How does it perform? 6. Evaluation
When should it NOT be used? 7. Limitations and Applicability

Search-R1

One-sentence positioning: A framework that trains LLMs via reinforcement learning to autonomously generate multi-turn search queries during step-by-step reasoning and perform real-time retrieval, addressing the suboptimal performance of prompt-engineering-driven search through retrieval result loss masking and outcome-oriented rewards.

Key innovation: Unlike prompt-driven search methods, Search-R1 models the search engine as part of the environment and trains the LLM via RL to autonomously learn the optimal search interaction strategy; simultaneously introduces retrieval token masking to prevent gradient updates on retrieved content, ensuring training stability.


0. Execution Overview

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Training Phase
  ├─ ① Preparation: search engine as environment, policy model π_θ, reference model π_ref
  ├─ ② Rollout: model generates reasoning sequences, interleaving token generation with search engine retrieval
  │   ├─ Encountering uncertain knowledge → generate <search>... </search> query
  │   ├─ Search engine returns documents → wrap as <information>... </information>
  │   ├─ Reasoning process → wrap as <think>... </think>
  │   └─ Final answer → wrap as <answer>... </answer>
  ├─ ③ Reward computation: outcome-oriented reward r_φ(x, y) = EM(a_pred, a_gold)
  ├─ ④ Loss Masking: policy gradient computed only on LLM-generated tokens; retrieved content excluded from optimization
  └─ ⑤ Policy update: PPO or GRPO updates the policy model

Inference Phase (per query)
  ├─ ① Input question q
  ├─ ② Iterative generation: text generation ↔ search query alternation
  │   ├─ Generate response tokens
  │   ├─ Detect <search> / </answer> / <eos> → determine next action
  │   ├─ Search query → retrieve documents → inject into reasoning chain
  │   └─ Continue generation
  ├─ ③ Termination condition: maximum action count B reached or <answer> generated
  └─ ④ Output final answer

1. High-level Design (Indexing → Retrieval → Generation)

1.1 Indexing

Dimension Approach
Chunking strategy
Index structure — (relies on external search engine; offline index construction not explicitly recorded in notes)
Knowledge representation
Construction cost
Core characteristic Notes do not explicitly record index construction; the offline phase focuses on model training rather than knowledge base construction

1.2 Retrieval

Dimension Approach
Retrieval method Dynamic search triggering (model learns via RL to autonomously decide when to retrieve)
Retrieval granularity Documents (external documents returned by the search engine)
Iteration strategy Multi-turn iteration (search may be triggered multiple times during reasoning until termination conditions are met)
Query processing Model autonomously generates search queries based on current reasoning state
Core characteristic RL-driven agentic search: model learns the optimal search interaction strategy rather than relying on manual prompt engineering

1.3 Generation

Dimension Approach
Context injection Retrieved documents injected into the reasoning chain via tags
Citation tracing
Quality control Outcome-oriented reward (EM matching) drives generation quality; retrieval token masking ensures training stability
Core characteristic Structured token system ( / / / ); reasoning and retrieval proceed in an interleaved manner

2. Offline Construction: RL Training (Detailed Execution)

The core of Search-R1’s offline phase is RL training rather than traditional knowledge base index construction. The system relies on external search engines (e.g., Bing) for document retrieval, and the notes do not explicitly record traditional index construction steps.

Step 2.1 Search Engine Environment Modeling

Item Description
Input Policy LLM π_θ, search engine R
Operation Model the search engine as part of the environment; sampled trajectory sequences interleave LLM token generation with search engine retrieval
Key decision Unlike previous methods that rely solely on the policy LLM for rollout generation, explicitly introduces retrieval-interleaved reasoning: π_θ(· | x; R)
Output Rollout environment with interleaved retrieval and reasoning

Step 2.2 Structured Token System Definition

Item Description
Operation Define four types of special tokens to structure the interaction
Key decision Uses special tokens rather than free text, facilitating rule-based parsing and training control
Output Token system: / , / , / , /

Step 2.3 Rollout Sequence Generation

Item Description
Input Query q, policy model π_θ, search engine R
Operation Model generates response tokens y_t ~ πθ(· | x, y*<t; R), appended to the rollout sequence
Key decision The sequence contains two types of tokens: LLM-generated tokens and retrieved tokens
Output Complete rollout sequence y

Step 2.4 Retrieval Token Masking (Loss Masking)

Item Description
Problem Optimizing retrieved tokens equally leads to unexpected learning dynamics
Operation Introduce a loss mask ensuring the policy gradient objective is computed only on LLM-generated tokens
Key decision Indicator function I(y_t): model-generated token = 1, retrieved content (within ) = 0
Output Masked policy gradient computation

Step 2.5 Reward Computation

Item Description
Input Model’s final answer a_pred, ground-truth answer a_gold
Operation Apply a rule-based outcome-oriented reward: r_φ(x, y) = EM(a_pred, a_gold)
Key decision Does not use format reward (model already demonstrates strong structural compliance); does not train a neural reward model (avoids sensitivity and additional cost)
Output Reward value

Step 2.6 Policy Update (PPO / GRPO)

Item Description
PPO Actor-critic method; Actor generates answers, Critic value network estimates advantage, advantage function uses GAE
GRPO Group-relative advantage estimation, no Critic network needed; samples G outputs per group to compute group baseline
Key decision Both are compatible; GRPO converges faster, PPO trains more stably, final rewards are comparable
Output Updated policy model parameters

3. Online Query: Retrieval (Detailed Execution)

3.1 Retrieval Mode Overview

Search-R1 employs a single RL-driven search mode, autonomously triggered by the trained model during inference:

Mode Applicable Scenario Core Mechanism Characteristic
RL-driven Search Knowledge uncertainty encountered during reasoning Model autonomously generates queries → retrieves → injects Learned, not manually predefined strategy

3.2 Retrieval Procedure

Pasted image 20260513202849.png

Step 3.1: Reasoning and Search Alternation
Item Description
Input Query q, current reasoning chain state
Operation LLM alternates between text generation and external search engine queries
Key decision Iterative framework: generates at each step until , </answer>, or is detected
Output Response token sequence
Step 3.2: Search Query Detection and Execution
Item Description
Input Query wrapped in tags
Operation Extract search query from rollout sequence, invoke search engine to retrieve documents D
Output Retrieved documents, wrapped as ... and injected into the sequence
Step 3.3: Iterative Termination Judgment
Item Description
Termination condition 1 Maximum action count B reached
Termination condition 2 Model generates final response wrapped in tags
Fallback strategy If content is empty (model outputs "My action is not correct. Let me rethink."), continue iteration

4. Online Generation: Generation (Detailed Execution)

4.1 Reasoning Procedure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Input question q
    │
    ▼
┌─────────────────────────┐
│ Initialize: action count b ← 0 │
│ Initialize: response y ← ∅      │
└───────────┬─────────────┘
            │
            ▼
    ┌───────┴───────┐
    │  while b < B  │
    └───────┬───────┘
            │
            ▼
    ┌───────────────────┐
    │ Generate response tokens │
    │ until termination signal │
    └─────────┬─────────┘
              │
         ┌────┴────┐
         ▼         ▼
      <search>   <answer>
         │         │
         ▼         ▼
      Retrieve    Return y
      documents
      inject
         │
         ▼
      b ← b + 1
         │
         └──────→ Continue generating

Step 4.1: Online Reasoning Sequence Generation

Item Description
Input Query q, trained policy model π_θ, search engine R
Operation Model autoregressively generates response tokens; pauses upon encountering to trigger retrieval, then continues generation after injecting results
Key decision Reasoning process is structurally consistent with the training Rollout, but no longer computes gradient updates
Output Final response sequence y, containing the answer wrapped in tags

5. Key Design Decisions

Decision Point Search-R1’s Choice Alternative Rationale
Training paradigm Reinforcement learning (PPO / GRPO) Prompt engineering / SFT / Rejection sampling Explicitly stated in notes: prompting advanced LLMs to use search engines is often suboptimal; LLMs may not fully possess the ability to optimally interact with search engines
Retrieval triggering Model learns autonomously ( token) Fixed retrieval strategy / manually predefined trigger conditions Explicitly stated in notes: RL training enables the model to autonomously learn when search is needed
Reward design Outcome-oriented reward (EM matching) Process reward / neural reward model / format reward Explicitly stated in notes: uses simple outcome reward to avoid complexity; does not train a neural reward model to avoid sensitivity and cost
Training stability Retrieval token masking No special handling of retrieved content Explicitly stated in notes: optimizing on retrieved tokens equally leads to unexpected learning dynamics
Inference algorithm Supports both PPO and GRPO Single RL algorithm Explicitly stated in notes: both algorithms are compatible, providing empirical comparison

6. Evaluation

6.1 Evaluation Metrics

Metric Meaning This System vs. Baseline
EM Exact Match Qwen2.5-7B +41% over RAG baselines; Qwen2.5-3B +20% over RAG baselines

6.2 RL Training Comparison

Condition Description
PPO Actor-critic RL; higher training stability
GRPO Group Relative Policy Optimization; faster convergence
Conclusion GRPO converges faster, PPO is more stable, final training rewards are comparable

6.3 Experimental Setup

Condition Description
Search-R1 Full system (RL training + retrieval masking + outcome reward)
CoT Chain-of-thought baseline
vanilla RAG Standard retrieval-augmented generation baseline
IRCoT Iterative retrieval chain-of-thought baseline
Search-o1 Inference-time search augmentation baseline
R1 DeepSeek-R1 baseline
SFT Supervised fine-tuning baseline
Rejection Sampling Rejection sampling baseline

6.4 Datasets

Dataset Description
Natural Questions (NQ) Open-domain QA benchmark
TriviaQA Open-domain QA benchmark
PopQA Knowledge-intensive QA benchmark
HotpotQA Multi-hop QA benchmark
2WikiMultiHopQA Multi-hop QA benchmark
MuSiQue Multi-hop QA benchmark
Bamboogle Open-domain QA benchmark

7. Limitations and Applicability

The notes do not explicitly record Search-R1’s limitations.

Best Applicable Scenarios

  • Reasoning tasks requiring efficient acquisition of external knowledge and up-to-date information
  • Open-domain QA (NQ, TriviaQA, etc.)
  • Multi-hop QA (HotpotQA, 2WikiMultiHopQA, MuSiQue, etc.)
  • Knowledge-intensive reasoning tasks

Unsuitable Scenarios


8. Quick Reference

What You Want to Know See Which Section
What is the complete pipeline? 0. Execution Overview
High-level design comparison? 1. High-level Design
How is the model trained? 2. Offline Construction
How is retrieval triggered? 3. Online Query
RL training details? 2. Offline Construction
Why this design? 5. Key Design Decisions
How does it perform? 6. Evaluation
When should it NOT be used? 7. Limitations and Applicability