本周阅读摘选 2026-05-25 → 2026-05-31 目录

学术相关
- Intern-Atlas
- Search-R1

学术相关

Intern-Atlas

One-sentence positioning: Extracts “method entities + typed causal edges + bottleneck/mechanism evidence” from 1M+ AI papers to construct a queryable methodological evolution graph, serving as the underlying knowledge infrastructure for AI research agents.

Key innovation: Upgrades flat citation networks into a “method-method causal graph” where each causal edge is accompanied by verbatim citations and structured bottleneck/mechanism annotations; proposes the SGT-MCTS algorithm to reconstruct method evolution lineages on this graph, enabling idea evaluation and generation based on explicit structural evidence rather than LLM parametric memory.

0. Execution Overview

Offline Phase (graph construction)
  ├─ ① Paper corpus processing (1,030,314 papers)
  ├─ ② Method entity extraction + alias resolution
  │     ├─ Seed method curation
  │     ├─ LLM Proposer scanning expansion
  │     └─ Alias registry A: 8,155 canonical / 9,545 aliases
  ├─ ③ Citation edge semantic classification (7 types)
  │     ├─ strong-causal: extends / improves / replaces / adapts
  │     ├─ non-strong: uses_component / compares
  │     └─ non-causal: background
  └─ ④ Causal edge evidence filling
        ├─ 14-category bottleneck taxonomy b_e
        ├─ Mechanism m_e / trade-off t_e
        ├─ LLM confidence c_e ∈ [0,1]
        └─ Verbatim citation grounding
              ↓
Online Phase (retrieval + generation)
  ├─ ⑤ Node matching (keywords + BM25)
  ├─ ⑥ Lineage Reconstruction: SGT-MCTS
  │     ├─ Forward/backward dual trees
  │     ├─ SGT-UCT selection = standard UCT + λ·graph-aware prior
  │     ├─ Temporal coherence TC + edge confidence conf joint prior
  │     └─ Branch discovery + Jaccard deduplication
  ├─ ⑦ Lineage Rerank (length + evidence strength + search consensus)
  └─ ⑧ Generation layer
        ├─ Graph-Grounded Idea Evaluator (5-dimension scoring)
        └─ Graph-Grounded Idea Generator (4 strategy types + evidence certificate)

1. High-level Design (Indexing → Retrieval → Generation)

1.1 Indexing

Dimension	Approach
Chunking strategy	Method-level entity extraction, replacing traditional paper/paragraph-level extraction
Index structure	Method evolution graph: method nodes + typed causal edges + structured evidence attributes
Knowledge representation	Directed causal network: nodes = methods/papers/stubs, edges = 7 semantic types (extends/improves/replaces/adapts/uses_component/compares/background), each edge carrying bottleneck/mechanism/trade-off/confidence/verbatim citation
Construction cost	High: 1M+ paper corpus processing, LLM extraction (Qwen3.6-35B-A3B), alias resolution, edge classification, evidence filling
Core characteristic	Method-level atomic units replace paper-level; each causal edge mandatorily carries verbatim citations and structured evidence

1.2 Retrieval

Dimension	Approach
Retrieval method	Graph traversal (SGT-MCTS searches evolution paths on the strong-causal subgraph)
Retrieval granularity	Method nodes + Evolution paths (evolution chains)
Iteration strategy	Multi-hop (forward/backward exploration along strong-causal edges) + Branch discovery restart
Query processing	Keyword matching (canonical/alias) + BM25 semantic matching
Core characteristic	SGT-MCTS dual-tree search with graph-aware and time-aware priors; branch discovery prevents greedy collapse

1.3 Generation

Dimension	Approach
Context injection	Lineage paths injected into Idea Evaluator / Generator
Citation tracing	Verbatim citation grounding + evidence certificate (specific causal edge, bottleneck original text, unresolved explanation)
Quality control	Graph statistics replace LLM free-text judgment; evidence certificate prevents hallucination; verification failure triggers fallback
Core characteristic	Graph-grounded idea evaluation/generation; structural evidence replaces parametric memory

2. Offline Construction: Indexing (Detailed Execution)

Step 2.1 Corpus Preparation

Item	Description
Input	1,030,314 AI papers (covering AI conferences, journals, arXiv preprints)
Operation	Collect and preprocess paper corpus, construct raw document repository
Output	Structured paper corpus

Step 2.2 Method Entity Extraction and Alias Resolution

Step 2.2.1 Seed Method Construction

Item	Description
Input	Human-curated list of well-known methods
Operation	Manually establish initial method seed set
Output	Seed method set

Step 2.2.2 Method Expansion

Item	Description
Input	Seed method set + paper corpus
Operation	LLM Proposer scans entire corpus, identifies and supplements additional candidate method entities
Output	Expanded candidate method set

Step 2.2.3 Alias Registry Construction

Item	Description
Input	Expanded method entity set
Operation	Build alias registry A: V_M → 2^Σ*, mapping each canonical method to a set of surface forms
Matching rules	Substring matching + case/punctuation normalization + word boundary enforcement (prevents “GPT” matching inside “lgpto”) + longest match priority (“GPT-4 Turbo” > “GPT-4” > “GPT”)
Version merging	“-v2”, “-Large” etc. appended to parent unless an independent canonical node already exists
Ambiguity handling	Manually maintained negative-surface list (e.g., state space model “Mamba” vs. Python linter “Mamba”)
Scale	8,155 canonical methods / 9,545 aliases
Output	Alias registry A

Step 2.3 Citation Edge Semantic Classification (Two-Phase LLM Extraction)

Step 2.3.1 Phase 1: Edge Type Classification

Item	Description
Input	Citation relationships between papers
Operation	Use Qwen3.6-35B-A3B to classify each citation edge into semantic types
Classification system	7 types: strong-causal (extends / improves / replaces / adapts), non-strong (uses_component / compares), non-causal (background)
Accuracy	Production model 70.4%; audit model (Claude-Sonnet-4.6) 93.0%
Output	Semantically typed edges

Step 2.3.2 Phase 2: Structured Record Completion

Item	Description
Input	Non-background causal edges
Operation	Complete structured evidence records for each edge
Output	Causal edges carrying structured attributes

Step 2.4 Causal Edge Evidence Filling

Step 2.4.1 14-Category Bottleneck Taxonomy

Bottleneck Dimension	Operational Definition
computational complexity	asymptotic or wall-clock compute at fixed scale
memory efficiency	peak activation / parameter memory footprint
parallelization	degree of across-device or across-token parallelism
accuracy	task-level correctness or quality metric
generalization	out-of-distribution / cross-domain transfer
scalability	behavior as model / data / context size grows
data efficiency	sample complexity at fixed quality target
training stability	variance / divergence risk during optimization
inference speed	runtime latency or throughput
expressiveness	function class or representational capacity
simplicity	implementation, conceptual, or interface simplicity
robustness	behavior under perturbation or adversarial input
hyperparameter sensitivity	outcome variance w.r.t. hyperparameter choice
training complexity	engineering difficulty of the training recipe

Step 2.4.2 Structured Evidence Quadruple

Each causal edge e carries quadruple ρ(e) = (b_e, m_e, t_e, c_e):

Attribute	Meaning	Key Design
b_e	Bottleneck addressed	14-axis bottleneck taxonomy (fixed at publication time)
m_e	Mechanism employed	LLM-extracted structured field
t_e	Trade-off	Cost/limitation of the mechanism
c_e	LLM-reported confidence	∈ [0,1], used later by SGT-MCTS
Citation grounding	Verbatim excerpt	All non-background edges mandatorily paired with verbatim quote

Step 2.4.3 Verbatim Citation Verification

Item	Description
Input	LLM-extracted citations + original papers
Operation	Search Match + Symmetry Check
Purpose	Ensure citations actually exist in the original text, preventing hallucination
Output	Verified verbatim citations

Step 2.4.4 Graph Scale

Metric	Value
Papers	1,030,314
Method nodes	8,155 canonical
Aliases	9,545
Semantically typed edges	9,430,201

Pasted image 20260502215241.png

3. Online Query: Retrieval (Detailed Execution)

3.1 Retrieval Mode Overview

Intern-Atlas employs a single lineage reconstruction retrieval mode, with SGT-MCTS searching on the method evolution graph:

Mode	Applicable Scenario	Core Mechanism	Characteristic
Lineage Reconstruction	Query method’s evolution history	SGT-MCTS forward/backward dual-tree search on strong-causal subgraph	Graph-aware + time-aware; branch discovery prevents greedy collapse

3.2 Retrieval Procedure

Step 3.1: Node Matching

Item	Description
Input	User query q
Operation	Parse query, construct seed method set S(q)
Matching method 1	Vocabulary keyword matching (exact hit on canonical / alias)
Matching method 2	BM25 semantic matching (handles semantically ambiguous queries)
Output	Seed method set S(q) ⊆ C_q

Step 3.2: SGT-MCTS Lineage Reconstruction

Item	Description
Input	Seed method set S(q)
Operation	On strong-causal subgraph (V_M, ε_sc), construct directed evolution path set Π_q along publication chronology
Why not standard MCTS?	Central nodes have extremely high branching → standard UCT easily gets trapped in high-visit branches; standard UCT does not perceive graph structure and temporal direction, producing implausible paths where “ancestors come from descendants”

Step 3.2.1: Dual-tree Construction

Item	Description
Operation	Each seed simultaneously constructs forward + backward trees
Forward tree	Searches for predecessor methods
Backward tree	Searches for successor developments
Key constraint	Operates only on the strong-causal edge subgraph

Step 3.2.2: SGT-UCT Selection

\[\text{SGT-UCT}(v) = \underbrace{\text{UCT}(u, v)}_{\text{standard exploration-exploitation}} + \lambda \cdot \underbrace{\alpha_G(u, v)}_{\text{graph-aware prior}}\] \[\alpha_G(u, v) = \underbrace{\text{conf}(e_{u \to v})}_{\text{edge confidence}} \cdot \underbrace{\text{TC}(\Delta\tau_{uv})}_{\text{temporal coherence}}\]

Component	Source
conf(e)	Edge confidence c_e reported by LLM during graph construction
TC(Δτ)	Manually segmented function scoring publication year difference

\[\text{TC}(\Delta\tau) = \begin{cases} 0.40 & -1 \leq \Delta\tau < 0 \quad \text{(slight temporal overlap, e.g., preprints)} \\ 0.85 & \Delta\tau = 0 \quad \text{(same year)} \\ 1.00 & 1 \leq \Delta\tau \leq 3 \quad \text{(optimal: 1-3 year natural evolution)} \\ 0.80 & 4 \leq \Delta\tau \leq 6 \quad \text{(slightly distant but still reasonable)} \\ \max(0.30,\ 1.00 - 0.08(\Delta\tau - 6)) & \Delta\tau > 6 \\ 0.70 & \tau \text{ missing} \end{cases}\]

Calibration range limitation: TC is only calibrated on post-2015 AI literature; domains with different research paces require recalibration.

Step 3.2.3: Expansion (Confidence-prioritized)

Item	Description
Loop exclusion	Discard nodes that would create cycles in the path
Hard temporal filtering	Discard nodes with reversed temporal direction (prevents “descendants producing ancestors” paradox)
Expansion strategy	Select the unexplored child node with highest confidence (no random expansion)

Step 3.2.4: Rollout (Greedy)

Item	Description
Strategy	Greedy rollout, no MC random sampling
Selection	Directly select child node with highest confidence, avoiding noise introduction
Termination conditions	Leaf node with no outgoing edges / cycle detected / maximum depth reached
Scoring	R(π): paper does not provide specific form

Step 3.2.5: Backpropagation

Item	Description
Operation	Backpropagate to update cumulative reward and visit count
Special penalty	Additional score deduction for ancestors of non-expandable leaf nodes → reduces dead ends

Step 3.2.6: Path Stitching and Deduplication

Item	Description
Operation	Take top-5 cumulative reward paths from forward/backward each → stitch through seed node
Deduplication	Jaccard similarity threshold 0.8 to determine path homogeneity; keep only the higher-ranked path for homogeneous ones

Step 3.2.7: Branch Discovery

Item	Description
Problem	Addresses greedy collapse preventing discovery of branch paths
Branch node definition	Node has at least 2 strong-causal child nodes in the search direction, but the current lineage final path includes only 1
Restart search	Restart algorithm from the branch node
Constraint	Edges already in the main lineage are forcibly blocked; search budget halved
Output merge	Aggregate main path + branch paths

Step 3.3: Lineage Rerank

Item	Description
Input	Candidate evolution path π
Ranking formula	rank(π) = w_ℓ ·	π	/L_max + w_c · conf̄(π) + w_m · N̄(π)
First term	Rewards long paths (sufficient evolution)
Second term	Path average evidence strength
Third term	Path node average visit count (search consensus)
Output	Re-ranked evolution paths

3.3 Retrieval Algorithm Comparison

Method	NR	ER	CAS
Beam@1	41.0	18.6	41.0
Beam@5	43.4	21.6	43.4
Beam@10	44.9	23.2	44.9
RW@5	28.1	0.7	28.1
SGT-MCTS	84.8	79.0	84.8

Pasted image 20260503100504.png

4. Online Generation: Generation (Detailed Execution)

4.1 Generation Procedure

Input: query q, retrieved evolution paths Π_q
    │
    ▼
┌─────────────────────────────┐
│ Parse methods mentioned     │
│ in idea d, map to nodes M_d │
│ via lookup table A          │
└─────────────┬───────────────┘
              │
         ┌────┴────┐
         ▼         ▼
   Evaluator   Generator
   (Evaluate)   (Generate)
      │           │
      ▼           ▼
  5-dim scoring  4 strategy types
  +cross penalty +evidence certificate

Step 4.1: Graph-Grounded Idea Evaluator

Step 4.1.1: Motivation

Free-text LLM judges prefer stacking popular methods; novelty scores are negatively correlated with actual scientific impact; graph statistics can directly provide structural evidence.

Step 4.1.2: Per-dimension Scoring

Item	Description
Input	Methods mentioned in idea d → resolved to nodes M_d ⊆ V_M via lookup table A
Operation	Independently evaluate 5 dimensions based on graph statistics
Dimensions	Novelty (N), Feasibility (F), Significance (S), Validity (V), Clarity (C)
Calculation	Each dimension score directly based on graph statistics of M_d’s position/connectivity structure in retrieval context C_d

Formula:

\[s_k(d, G) = \text{clip}_{[1,10]}\left(b_k + \sum_j w_j^{(k)} \cdot \phi_j^{(k)}(M_d, C_d)\right), \quad k \in \{N, F, S, V, C\}\]

Step 4.1.3: Cross-dimensional Aggregation

Item	Description
Operation	Fixed weight vector w + 4 hand-crafted conjunctive penalties Ω_cross
Representative penalty	Strong deduction when high Novelty + low Feasibility — “novel but infeasible” typically indicates a flawed core proposal

Formula:

\[s^*(d, G) = \text{clip}_{[1,10]}\left(\mathbf{w}^\top \mathbf{s} + \Omega_{\text{cross}}(\mathbf{s})\right), \quad \mathbf{s} = (s_N, s_F, s_S, s_V, s_C)\]

Step 4.1.4: Idea Evaluation Results

Publication tier	Overall Score
Top-tier	8.48 ± 1 s.d.
Core	7.83 ± 1 s.d.
Workshop	6.85 ± 1 s.d.
Rejected	5.84 ± 1 s.d.

Pasted image 20260503095949.png

Step 4.2: Graph-Grounded Idea Generator

Step 4.2.1: Structural Gap Pattern → Generation Strategy Mapping

Gap Pattern	Corresponding Generation Strategy	Description
Open axes	Bottleneck Resolution	Address identified but unresolved bottlenecks in the graph
Recent improvement direction	Trend Extrapolation	Extrapolate along recent improvement directions
Sacrifice axes	Cross-pollination	Cross-domain/cross-method combination, leveraging trade-offs between different methods
Disconnected pairs	Paradigm Challenge	Challenge existing paradigms, connecting disconnected method pairs in the graph

Step 4.2.2: Evidence Certificate (Core Anti-hallucination Mechanism)

Field	Content
Certificate tuple	(specific causal edge, bottleneck original text in the graph, explanation of why this bottleneck remains unresolved)
Verification	Exact match of bottleneck original text in the certificate against actual graph content
Failure handling	Discard result, fallback
Design goal	Ensure ideas are traceable without over-constraining generation

5. Key Design Decisions

Decision Point	Intern-Atlas’s Choice	Alternative	Rationale
Graph atomic unit	Method entities (method-level)	Papers (paper-level)	Notes explicitly document: paper-level citations cannot distinguish extends/improves/replaces/compares/background
Causal edge evidence	Mandatory verbatim citations + structured bottleneck/mechanism	Edge type labels only	Provides grounded evidence, supports certificate verification for idea generation
Lineage search algorithm	SGT-MCTS (graph-aware + time-aware)	Beam search / standard MCTS / Random Walk	Notes explicitly document: high branching at central nodes causes beam collapse; standard MCTS does not perceive publication year direction; Random Walk performs extremely poorly (ER only 0.7)
Rollout strategy	Greedy (highest confidence)	MC random sampling	Random sampling introduces LLM extraction noise
Dead-end handling	Backpropagation additional penalty + branch discovery restart	Pure reliance on UCT exploration term	Explicitly suppresses greedy collapse, proactively recalls ignored branches
Idea evaluation	Per-dimension graph statistics + hand-crafted conjunctive penalties	Free-text LLM judge	The latter’s novelty score is negatively correlated with actual impact
Idea generation	Structural gap patterns + evidence certificate	Direct LLM free generation	Forces grounding to specific causal edges and bottlenecks, avoids method name stacking

6. Evaluation

6.1 Graph Construction Quality

Benchmark: 30 high-impact surveys → 30 method evolution graphs, containing 2,268 nodes / 1,462 edges / 133 evolution chains.

Metric	Meaning	Result
NMR (Node Match Ratio)	Node match rate	91.0%
ERR (Edge Reachable Ratio)	Edge reachability rate	89.7%
PSC (Path Semantic Correctness)	Path semantic correctness	92.0%
NR (Node Recall)	Node recall	84.8% (SGT-MCTS)
ER (Edge Recall)	Edge recall	79.0% (SGT-MCTS)
CAS (Chain Alignment Score)	Lineage chain alignment score	84.8% (SGT-MCTS)

Pasted image 20260503095949.png

6.2 Idea Evaluator (Strata Dataset)

Category	Paper Count	Overall Score
Top-tier (ICLR 2026, ICML 2025, NeurIPS 2025)	300	8.48
Core (AAAI 2026, IJCAI 2025)	300	7.83
Workshop (ICLR 2026)	300	6.85
Rejected (ICLR 2026)	300	5.84

Evaluation method: Publication tier verification (across publication strata) + human evaluation.

6.3 Idea Generator Comparative Experiments

Baseline	Retrieval/Knowledge Source
Direct LLM generation	No external retrieval
External search	OpenAlex / Semantic Scholar
Local RAG	BM25
Intern-Atlas	Method evolution graph + evidence certificate

6.4 SGT-MCTS Retrieval Algorithm Comparison

Method	NR	ER	CAS
Beam@1	41.0	18.6	41.0
Beam@5	43.4	21.6	43.4
Beam@10	44.9	23.2	44.9
RW@5	28.1	0.7	28.1
SGT-MCTS	84.8	79.0	84.8

Pasted image 20260503100504.png

7. Limitations and Applicability

Limitation	Specific Manifestation	Mitigation
Edge type classification accuracy	Phase-1 production model 70.4% / audit model 93.0%	Key decisions can undergo secondary audit; improve production model
Bottleneck taxonomy rigidity	14-axis taxonomy D fixed at publication time; new dimensions mapped to nearest existing axis	Await future taxonomy revision
Alias resolution coverage	Substring matching favors precision over recall; ambiguity handled via manual negative list	Biased toward high-quality nodes
Temporal coherence calibration range	TC calibrated on post-2015 AI literature	Cross-pace domains require recalibration
Rollout scoring opacity	R(π) paper does not provide specific form	Reproduction requires custom design

Best Applicable Scenarios

AI agents performing idea generation/evaluation that require structured method evolution priors
Traceable scientific idea generation (evidence certificate prevents hallucination)
Method lineage research / survey automation within AI subfields
Evaluating ideas while avoiding the “popular method stacking” bias of LLM free-text judges

Unsuitable Scenarios

Disciplines with research paces significantly different from AI (TC calibration fails)
Extremely niche subfields without human survey references (limited graph coverage)
Research focused on paper-level citation network properties (e.g., PageRank, influence propagation)

8. Quick Reference

What You Want to Know	See Which Section
What is the complete pipeline?	0. Execution Overview
High-level design comparison?	1. High-level Design
How is the graph constructed?	2. Offline Construction
How to retrieve evolution paths from the graph?	3. Online Query
How to use the graph to evaluate/generate ideas?	4. Online Generation
How does it differ from existing approaches?	5. Key Design Decisions
How does it perform?	6. Evaluation
When should it NOT be used?	7. Limitations and Applicability

Search-R1

One-sentence positioning: A framework that trains LLMs via reinforcement learning to autonomously generate multi-turn search queries during step-by-step reasoning and perform real-time retrieval, addressing the suboptimal performance of prompt-engineering-driven search through retrieval result loss masking and outcome-oriented rewards.

Key innovation: Unlike prompt-driven search methods, Search-R1 models the search engine as part of the environment and trains the LLM via RL to autonomously learn the optimal search interaction strategy; simultaneously introduces retrieval token masking to prevent gradient updates on retrieved content, ensuring training stability.

0. Execution Overview

Training Phase
  ├─ ① Preparation: search engine as environment, policy model π_θ, reference model π_ref
  ├─ ② Rollout: model generates reasoning sequences, interleaving token generation with search engine retrieval
  │   ├─ Encountering uncertain knowledge → generate <search>... </search> query
  │   ├─ Search engine returns documents → wrap as <information>... </information>
  │   ├─ Reasoning process → wrap as <think>... </think>
  │   └─ Final answer → wrap as <answer>... </answer>
  ├─ ③ Reward computation: outcome-oriented reward r_φ(x, y) = EM(a_pred, a_gold)
  ├─ ④ Loss Masking: policy gradient computed only on LLM-generated tokens; retrieved content excluded from optimization
  └─ ⑤ Policy update: PPO or GRPO updates the policy model

Inference Phase (per query)
  ├─ ① Input question q
  ├─ ② Iterative generation: text generation ↔ search query alternation
  │   ├─ Generate response tokens
  │   ├─ Detect <search> / </answer> / <eos> → determine next action
  │   ├─ Search query → retrieve documents → inject into reasoning chain
  │   └─ Continue generation
  ├─ ③ Termination condition: maximum action count B reached or <answer> generated
  └─ ④ Output final answer

1. High-level Design (Indexing → Retrieval → Generation)

1.1 Indexing

Dimension	Approach
Chunking strategy	—
Index structure	— (relies on external search engine; offline index construction not explicitly recorded in notes)
Knowledge representation	—
Construction cost	—
Core characteristic	Notes do not explicitly record index construction; the offline phase focuses on model training rather than knowledge base construction

1.2 Retrieval

Dimension	Approach
Retrieval method	Dynamic search triggering (model learns via RL to autonomously decide when to retrieve)
Retrieval granularity	Documents (external documents returned by the search engine)
Iteration strategy	Multi-turn iteration (search may be triggered multiple times during reasoning until termination conditions are met)
Query processing	Model autonomously generates search queries based on current reasoning state
Core characteristic	RL-driven agentic search: model learns the optimal search interaction strategy rather than relying on manual prompt engineering

1.3 Generation

Dimension	Approach
Context injection	Retrieved documents injected into the reasoning chain via tags
Citation tracing	—
Quality control	Outcome-oriented reward (EM matching) drives generation quality; retrieval token masking ensures training stability
Core characteristic	Structured token system ( / / / ); reasoning and retrieval proceed in an interleaved manner

2. Offline Construction: RL Training (Detailed Execution)

The core of Search-R1’s offline phase is RL training rather than traditional knowledge base index construction. The system relies on external search engines (e.g., Bing) for document retrieval, and the notes do not explicitly record traditional index construction steps.

Step 2.1 Search Engine Environment Modeling

Item	Description
Input	Policy LLM π_θ, search engine R
Operation	Model the search engine as part of the environment; sampled trajectory sequences interleave LLM token generation with search engine retrieval
Key decision	Unlike previous methods that rely solely on the policy LLM for rollout generation, explicitly introduces retrieval-interleaved reasoning: π_θ(· \| x; R)
Output	Rollout environment with interleaved retrieval and reasoning

Step 2.2 Structured Token System Definition

Item	Description
Operation	Define four types of special tokens to structure the interaction
Key decision	Uses special tokens rather than free text, facilitating rule-based parsing and training control
Output	Token system: / , / , / , /

Step 2.3 Rollout Sequence Generation

Item	Description
Input	Query q, policy model π_θ, search engine R
Operation	Model generates response tokens y_t ~ πθ(· \| x, y*<t; R), appended to the rollout sequence
Key decision	The sequence contains two types of tokens: LLM-generated tokens and retrieved tokens
Output	Complete rollout sequence y

Step 2.4 Retrieval Token Masking (Loss Masking)

Item	Description
Problem	Optimizing retrieved tokens equally leads to unexpected learning dynamics
Operation	Introduce a loss mask ensuring the policy gradient objective is computed only on LLM-generated tokens
Key decision	Indicator function I(y_t): model-generated token = 1, retrieved content (within ) = 0
Output	Masked policy gradient computation

Step 2.5 Reward Computation

Item	Description
Input	Model’s final answer a_pred, ground-truth answer a_gold
Operation	Apply a rule-based outcome-oriented reward: r_φ(x, y) = EM(a_pred, a_gold)
Key decision	Does not use format reward (model already demonstrates strong structural compliance); does not train a neural reward model (avoids sensitivity and additional cost)
Output	Reward value

Step 2.6 Policy Update (PPO / GRPO)

Item	Description
PPO	Actor-critic method; Actor generates answers, Critic value network estimates advantage, advantage function uses GAE
GRPO	Group-relative advantage estimation, no Critic network needed; samples G outputs per group to compute group baseline
Key decision	Both are compatible; GRPO converges faster, PPO trains more stably, final rewards are comparable
Output	Updated policy model parameters

3. Online Query: Retrieval (Detailed Execution)

3.1 Retrieval Mode Overview

Search-R1 employs a single RL-driven search mode, autonomously triggered by the trained model during inference:

Mode	Applicable Scenario	Core Mechanism	Characteristic
RL-driven Search	Knowledge uncertainty encountered during reasoning	Model autonomously generates queries → retrieves → injects	Learned, not manually predefined strategy

3.2 Retrieval Procedure

Pasted image 20260513202849.png

Step 3.1: Reasoning and Search Alternation

Item	Description
Input	Query q, current reasoning chain state
Operation	LLM alternates between text generation and external search engine queries
Key decision	Iterative framework: generates at each step until , </answer>, or is detected
Output	Response token sequence

Step 3.2: Search Query Detection and Execution

Item	Description
Input	Query wrapped in tags
Operation	Extract search query from rollout sequence, invoke search engine to retrieve documents D
Output	Retrieved documents, wrapped as ... and injected into the sequence

Step 3.3: Iterative Termination Judgment

Item	Description
Termination condition 1	Maximum action count B reached
Termination condition 2	Model generates final response wrapped in tags
Fallback strategy	If content is empty (model outputs "My action is not correct. Let me rethink."), continue iteration

4. Online Generation: Generation (Detailed Execution)

4.1 Reasoning Procedure

Input question q
    │
    ▼
┌─────────────────────────┐
│ Initialize: action count b ← 0 │
│ Initialize: response y ← ∅      │
└───────────┬─────────────┘
            │
            ▼
    ┌───────┴───────┐
    │  while b < B  │
    └───────┬───────┘
            │
            ▼
    ┌───────────────────┐
    │ Generate response tokens │
    │ until termination signal │
    └─────────┬─────────┘
              │
         ┌────┴────┐
         ▼         ▼
      <search>   <answer>
         │         │
         ▼         ▼
      Retrieve    Return y
      documents
      inject
         │
         ▼
      b ← b + 1
         │
         └──────→ Continue generating

Step 4.1: Online Reasoning Sequence Generation

Item	Description
Input	Query q, trained policy model π_θ, search engine R
Operation	Model autoregressively generates response tokens; pauses upon encountering to trigger retrieval, then continues generation after injecting results
Key decision	Reasoning process is structurally consistent with the training Rollout, but no longer computes gradient updates
Output	Final response sequence y, containing the answer wrapped in tags

5. Key Design Decisions

Decision Point	Search-R1’s Choice	Alternative	Rationale
Training paradigm	Reinforcement learning (PPO / GRPO)	Prompt engineering / SFT / Rejection sampling	Explicitly stated in notes: prompting advanced LLMs to use search engines is often suboptimal; LLMs may not fully possess the ability to optimally interact with search engines
Retrieval triggering	Model learns autonomously ( token)	Fixed retrieval strategy / manually predefined trigger conditions	Explicitly stated in notes: RL training enables the model to autonomously learn when search is needed
Reward design	Outcome-oriented reward (EM matching)	Process reward / neural reward model / format reward	Explicitly stated in notes: uses simple outcome reward to avoid complexity; does not train a neural reward model to avoid sensitivity and cost
Training stability	Retrieval token masking	No special handling of retrieved content	Explicitly stated in notes: optimizing on retrieved tokens equally leads to unexpected learning dynamics
Inference algorithm	Supports both PPO and GRPO	Single RL algorithm	Explicitly stated in notes: both algorithms are compatible, providing empirical comparison

6. Evaluation

6.1 Evaluation Metrics

Metric	Meaning	This System vs. Baseline
EM	Exact Match	Qwen2.5-7B +41% over RAG baselines; Qwen2.5-3B +20% over RAG baselines

6.2 RL Training Comparison

Condition	Description
PPO	Actor-critic RL; higher training stability
GRPO	Group Relative Policy Optimization; faster convergence
Conclusion	GRPO converges faster, PPO is more stable, final training rewards are comparable

6.3 Experimental Setup

Condition	Description
Search-R1	Full system (RL training + retrieval masking + outcome reward)
CoT	Chain-of-thought baseline
vanilla RAG	Standard retrieval-augmented generation baseline
IRCoT	Iterative retrieval chain-of-thought baseline
Search-o1	Inference-time search augmentation baseline
R1	DeepSeek-R1 baseline
SFT	Supervised fine-tuning baseline
Rejection Sampling	Rejection sampling baseline

6.4 Datasets

Dataset	Description
Natural Questions (NQ)	Open-domain QA benchmark
TriviaQA	Open-domain QA benchmark
PopQA	Knowledge-intensive QA benchmark
HotpotQA	Multi-hop QA benchmark
2WikiMultiHopQA	Multi-hop QA benchmark
MuSiQue	Multi-hop QA benchmark
Bamboogle	Open-domain QA benchmark

7. Limitations and Applicability

The notes do not explicitly record Search-R1’s limitations.

Best Applicable Scenarios

Reasoning tasks requiring efficient acquisition of external knowledge and up-to-date information
Open-domain QA (NQ, TriviaQA, etc.)
Multi-hop QA (HotpotQA, 2WikiMultiHopQA, MuSiQue, etc.)
Knowledge-intensive reasoning tasks

Unsuitable Scenarios

8. Quick Reference

What You Want to Know	See Which Section
What is the complete pipeline?	0. Execution Overview
High-level design comparison?	1. High-level Design
How is the model trained?	2. Offline Construction
How is retrieval triggered?	3. Online Query
RL training details?	2. Offline Construction
Why this design?	5. Key Design Decisions
How does it perform?	6. Evaluation
When should it NOT be used?	7. Limitations and Applicability