本周阅读摘选
2026-05-25 → 2026-05-31
目录
学术相关
Intern-Atlas
One-sentence positioning: Extracts “method entities + typed causal edges + bottleneck/mechanism evidence” from 1M+ AI papers to construct a queryable methodological evolution graph, serving as the underlying knowledge infrastructure for AI research agents.
Key innovation: Upgrades flat citation networks into a “method-method causal graph” where each causal edge is accompanied by verbatim citations and structured bottleneck/mechanism annotations; proposes the SGT-MCTS algorithm to reconstruct method evolution lineages on this graph, enabling idea evaluation and generation based on explicit structural evidence rather than LLM parametric memory.
0. Execution Overview
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| Offline Phase (graph construction)
├─ ① Paper corpus processing (1,030,314 papers)
├─ ② Method entity extraction + alias resolution
│ ├─ Seed method curation
│ ├─ LLM Proposer scanning expansion
│ └─ Alias registry A: 8,155 canonical / 9,545 aliases
├─ ③ Citation edge semantic classification (7 types)
│ ├─ strong-causal: extends / improves / replaces / adapts
│ ├─ non-strong: uses_component / compares
│ └─ non-causal: background
└─ ④ Causal edge evidence filling
├─ 14-category bottleneck taxonomy b_e
├─ Mechanism m_e / trade-off t_e
├─ LLM confidence c_e ∈ [0,1]
└─ Verbatim citation grounding
↓
Online Phase (retrieval + generation)
├─ ⑤ Node matching (keywords + BM25)
├─ ⑥ Lineage Reconstruction: SGT-MCTS
│ ├─ Forward/backward dual trees
│ ├─ SGT-UCT selection = standard UCT + λ·graph-aware prior
│ ├─ Temporal coherence TC + edge confidence conf joint prior
│ └─ Branch discovery + Jaccard deduplication
├─ ⑦ Lineage Rerank (length + evidence strength + search consensus)
└─ ⑧ Generation layer
├─ Graph-Grounded Idea Evaluator (5-dimension scoring)
└─ Graph-Grounded Idea Generator (4 strategy types + evidence certificate)
|
1. High-level Design (Indexing → Retrieval → Generation)
1.1 Indexing
| Dimension |
Approach |
| Chunking strategy |
Method-level entity extraction, replacing traditional paper/paragraph-level extraction |
| Index structure |
Method evolution graph: method nodes + typed causal edges + structured evidence attributes |
| Knowledge representation |
Directed causal network: nodes = methods/papers/stubs, edges = 7 semantic types (extends/improves/replaces/adapts/uses_component/compares/background), each edge carrying bottleneck/mechanism/trade-off/confidence/verbatim citation |
| Construction cost |
High: 1M+ paper corpus processing, LLM extraction (Qwen3.6-35B-A3B), alias resolution, edge classification, evidence filling |
| Core characteristic |
Method-level atomic units replace paper-level; each causal edge mandatorily carries verbatim citations and structured evidence |
1.2 Retrieval
| Dimension |
Approach |
| Retrieval method |
Graph traversal (SGT-MCTS searches evolution paths on the strong-causal subgraph) |
| Retrieval granularity |
Method nodes + Evolution paths (evolution chains) |
| Iteration strategy |
Multi-hop (forward/backward exploration along strong-causal edges) + Branch discovery restart |
| Query processing |
Keyword matching (canonical/alias) + BM25 semantic matching |
| Core characteristic |
SGT-MCTS dual-tree search with graph-aware and time-aware priors; branch discovery prevents greedy collapse |
1.3 Generation
| Dimension |
Approach |
| Context injection |
Lineage paths injected into Idea Evaluator / Generator |
| Citation tracing |
Verbatim citation grounding + evidence certificate (specific causal edge, bottleneck original text, unresolved explanation) |
| Quality control |
Graph statistics replace LLM free-text judgment; evidence certificate prevents hallucination; verification failure triggers fallback |
| Core characteristic |
Graph-grounded idea evaluation/generation; structural evidence replaces parametric memory |
2. Offline Construction: Indexing (Detailed Execution)
Step 2.1 Corpus Preparation
| Item |
Description |
| Input |
1,030,314 AI papers (covering AI conferences, journals, arXiv preprints) |
| Operation |
Collect and preprocess paper corpus, construct raw document repository |
| Output |
Structured paper corpus |
Step 2.2 Method Entity Extraction and Alias Resolution
Step 2.2.1 Seed Method Construction
| Item |
Description |
| Input |
Human-curated list of well-known methods |
| Operation |
Manually establish initial method seed set |
| Output |
Seed method set |
Step 2.2.2 Method Expansion
| Item |
Description |
| Input |
Seed method set + paper corpus |
| Operation |
LLM Proposer scans entire corpus, identifies and supplements additional candidate method entities |
| Output |
Expanded candidate method set |
Step 2.2.3 Alias Registry Construction
| Item |
Description |
| Input |
Expanded method entity set |
| Operation |
Build alias registry A: V_M → 2^Σ*, mapping each canonical method to a set of surface forms |
| Matching rules |
Substring matching + case/punctuation normalization + word boundary enforcement (prevents “GPT” matching inside “lgpto”) + longest match priority (“GPT-4 Turbo” > “GPT-4” > “GPT”) |
| Version merging |
“-v2”, “-Large” etc. appended to parent unless an independent canonical node already exists |
| Ambiguity handling |
Manually maintained negative-surface list (e.g., state space model “Mamba” vs. Python linter “Mamba”) |
| Scale |
8,155 canonical methods / 9,545 aliases |
| Output |
Alias registry A |
Step 2.3.1 Phase 1: Edge Type Classification
| Item |
Description |
| Input |
Citation relationships between papers |
| Operation |
Use Qwen3.6-35B-A3B to classify each citation edge into semantic types |
| Classification system |
7 types: strong-causal (extends / improves / replaces / adapts), non-strong (uses_component / compares), non-causal (background) |
| Accuracy |
Production model 70.4%; audit model (Claude-Sonnet-4.6) 93.0% |
| Output |
Semantically typed edges |
Step 2.3.2 Phase 2: Structured Record Completion
| Item |
Description |
| Input |
Non-background causal edges |
| Operation |
Complete structured evidence records for each edge |
| Output |
Causal edges carrying structured attributes |
Step 2.4 Causal Edge Evidence Filling
Step 2.4.1 14-Category Bottleneck Taxonomy
| Bottleneck Dimension |
Operational Definition |
| computational complexity |
asymptotic or wall-clock compute at fixed scale |
| memory efficiency |
peak activation / parameter memory footprint |
| parallelization |
degree of across-device or across-token parallelism |
| accuracy |
task-level correctness or quality metric |
| generalization |
out-of-distribution / cross-domain transfer |
| scalability |
behavior as model / data / context size grows |
| data efficiency |
sample complexity at fixed quality target |
| training stability |
variance / divergence risk during optimization |
| inference speed |
runtime latency or throughput |
| expressiveness |
function class or representational capacity |
| simplicity |
implementation, conceptual, or interface simplicity |
| robustness |
behavior under perturbation or adversarial input |
| hyperparameter sensitivity |
outcome variance w.r.t. hyperparameter choice |
| training complexity |
engineering difficulty of the training recipe |
Step 2.4.2 Structured Evidence Quadruple
Each causal edge e carries quadruple ρ(e) = (b_e, m_e, t_e, c_e):
| Attribute |
Meaning |
Key Design |
| b_e |
Bottleneck addressed |
14-axis bottleneck taxonomy (fixed at publication time) |
| m_e |
Mechanism employed |
LLM-extracted structured field |
| t_e |
Trade-off |
Cost/limitation of the mechanism |
| c_e |
LLM-reported confidence |
∈ [0,1], used later by SGT-MCTS |
| Citation grounding |
Verbatim excerpt |
All non-background edges mandatorily paired with verbatim quote |
Step 2.4.3 Verbatim Citation Verification
| Item |
Description |
| Input |
LLM-extracted citations + original papers |
| Operation |
Search Match + Symmetry Check |
| Purpose |
Ensure citations actually exist in the original text, preventing hallucination |
| Output |
Verified verbatim citations |
Step 2.4.4 Graph Scale
| Metric |
Value |
| Papers |
1,030,314 |
| Method nodes |
8,155 canonical |
| Aliases |
9,545 |
| Semantically typed edges |
9,430,201 |

3. Online Query: Retrieval (Detailed Execution)
3.1 Retrieval Mode Overview
Intern-Atlas employs a single lineage reconstruction retrieval mode, with SGT-MCTS searching on the method evolution graph:
| Mode |
Applicable Scenario |
Core Mechanism |
Characteristic |
| Lineage Reconstruction |
Query method’s evolution history |
SGT-MCTS forward/backward dual-tree search on strong-causal subgraph |
Graph-aware + time-aware; branch discovery prevents greedy collapse |
3.2 Retrieval Procedure
Step 3.1: Node Matching
| Item |
Description |
| Input |
User query q |
| Operation |
Parse query, construct seed method set S(q) |
| Matching method 1 |
Vocabulary keyword matching (exact hit on canonical / alias) |
| Matching method 2 |
BM25 semantic matching (handles semantically ambiguous queries) |
| Output |
Seed method set S(q) ⊆ C_q |
Step 3.2: SGT-MCTS Lineage Reconstruction
| Item |
Description |
| Input |
Seed method set S(q) |
| Operation |
On strong-causal subgraph (V_M, ε_sc), construct directed evolution path set Π_q along publication chronology |
| Why not standard MCTS? |
Central nodes have extremely high branching → standard UCT easily gets trapped in high-visit branches; standard UCT does not perceive graph structure and temporal direction, producing implausible paths where “ancestors come from descendants” |
Step 3.2.1: Dual-tree Construction
| Item |
Description |
| Operation |
Each seed simultaneously constructs forward + backward trees |
| Forward tree |
Searches for predecessor methods |
| Backward tree |
Searches for successor developments |
| Key constraint |
Operates only on the strong-causal edge subgraph |
Step 3.2.2: SGT-UCT Selection
\[\text{SGT-UCT}(v) = \underbrace{\text{UCT}(u, v)}_{\text{standard exploration-exploitation}} + \lambda \cdot \underbrace{\alpha_G(u, v)}_{\text{graph-aware prior}}\]
\[\alpha_G(u, v) = \underbrace{\text{conf}(e_{u \to v})}_{\text{edge confidence}} \cdot \underbrace{\text{TC}(\Delta\tau_{uv})}_{\text{temporal coherence}}\]
| Component |
Source |
| conf(e) |
Edge confidence c_e reported by LLM during graph construction |
| TC(Δτ) |
Manually segmented function scoring publication year difference |
\[\text{TC}(\Delta\tau) =
\begin{cases}
0.40 & -1 \leq \Delta\tau < 0 \quad \text{(slight temporal overlap, e.g., preprints)} \\
0.85 & \Delta\tau = 0 \quad \text{(same year)} \\
1.00 & 1 \leq \Delta\tau \leq 3 \quad \text{(optimal: 1-3 year natural evolution)} \\
0.80 & 4 \leq \Delta\tau \leq 6 \quad \text{(slightly distant but still reasonable)} \\
\max(0.30,\ 1.00 - 0.08(\Delta\tau - 6)) & \Delta\tau > 6 \\
0.70 & \tau \text{ missing}
\end{cases}\]
Calibration range limitation: TC is only calibrated on post-2015 AI literature; domains with different research paces require recalibration.
Step 3.2.3: Expansion (Confidence-prioritized)
| Item |
Description |
| Loop exclusion |
Discard nodes that would create cycles in the path |
| Hard temporal filtering |
Discard nodes with reversed temporal direction (prevents “descendants producing ancestors” paradox) |
| Expansion strategy |
Select the unexplored child node with highest confidence (no random expansion) |
Step 3.2.4: Rollout (Greedy)
| Item |
Description |
| Strategy |
Greedy rollout, no MC random sampling |
| Selection |
Directly select child node with highest confidence, avoiding noise introduction |
| Termination conditions |
Leaf node with no outgoing edges / cycle detected / maximum depth reached |
| Scoring |
R(π): paper does not provide specific form |
Step 3.2.5: Backpropagation
| Item |
Description |
| Operation |
Backpropagate to update cumulative reward and visit count |
| Special penalty |
Additional score deduction for ancestors of non-expandable leaf nodes → reduces dead ends |
Step 3.2.6: Path Stitching and Deduplication
| Item |
Description |
| Operation |
Take top-5 cumulative reward paths from forward/backward each → stitch through seed node |
| Deduplication |
Jaccard similarity threshold 0.8 to determine path homogeneity; keep only the higher-ranked path for homogeneous ones |
Step 3.2.7: Branch Discovery
| Item |
Description |
| Problem |
Addresses greedy collapse preventing discovery of branch paths |
| Branch node definition |
Node has at least 2 strong-causal child nodes in the search direction, but the current lineage final path includes only 1 |
| Restart search |
Restart algorithm from the branch node |
| Constraint |
Edges already in the main lineage are forcibly blocked; search budget halved |
| Output merge |
Aggregate main path + branch paths |
Step 3.3: Lineage Rerank
| Item |
Description |
|
|
| Input |
Candidate evolution path π |
|
|
| Ranking formula |
rank(π) = w_ℓ · |
π |
/L_max + w_c · conf̄(π) + w_m · N̄(π) |
| First term |
Rewards long paths (sufficient evolution) |
|
|
| Second term |
Path average evidence strength |
|
|
| Third term |
Path node average visit count (search consensus) |
|
|
| Output |
Re-ranked evolution paths |
|
|
3.3 Retrieval Algorithm Comparison
| Method |
NR |
ER |
CAS |
| Beam@1 |
41.0 |
18.6 |
41.0 |
| Beam@5 |
43.4 |
21.6 |
43.4 |
| Beam@10 |
44.9 |
23.2 |
44.9 |
| RW@5 |
28.1 |
0.7 |
28.1 |
| SGT-MCTS |
84.8 |
79.0 |
84.8 |

4. Online Generation: Generation (Detailed Execution)
4.1 Generation Procedure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| Input: query q, retrieved evolution paths Π_q
│
▼
┌─────────────────────────────┐
│ Parse methods mentioned │
│ in idea d, map to nodes M_d │
│ via lookup table A │
└─────────────┬───────────────┘
│
┌────┴────┐
▼ ▼
Evaluator Generator
(Evaluate) (Generate)
│ │
▼ ▼
5-dim scoring 4 strategy types
+cross penalty +evidence certificate
|
Step 4.1: Graph-Grounded Idea Evaluator
Step 4.1.1: Motivation
Free-text LLM judges prefer stacking popular methods; novelty scores are negatively correlated with actual scientific impact; graph statistics can directly provide structural evidence.
Step 4.1.2: Per-dimension Scoring
| Item |
Description |
| Input |
Methods mentioned in idea d → resolved to nodes M_d ⊆ V_M via lookup table A |
| Operation |
Independently evaluate 5 dimensions based on graph statistics |
| Dimensions |
Novelty (N), Feasibility (F), Significance (S), Validity (V), Clarity (C) |
| Calculation |
Each dimension score directly based on graph statistics of M_d’s position/connectivity structure in retrieval context C_d |
Formula:
\[s_k(d, G) = \text{clip}_{[1,10]}\left(b_k + \sum_j w_j^{(k)} \cdot \phi_j^{(k)}(M_d, C_d)\right), \quad k \in \{N, F, S, V, C\}\]
Step 4.1.3: Cross-dimensional Aggregation
| Item |
Description |
| Operation |
Fixed weight vector w + 4 hand-crafted conjunctive penalties Ω_cross |
| Representative penalty |
Strong deduction when high Novelty + low Feasibility — “novel but infeasible” typically indicates a flawed core proposal |
Formula:
\[s^*(d, G) = \text{clip}_{[1,10]}\left(\mathbf{w}^\top \mathbf{s} + \Omega_{\text{cross}}(\mathbf{s})\right), \quad \mathbf{s} = (s_N, s_F, s_S, s_V, s_C)\]
Step 4.1.4: Idea Evaluation Results
| Publication tier |
Overall Score |
| Top-tier |
8.48 ± 1 s.d. |
| Core |
7.83 ± 1 s.d. |
| Workshop |
6.85 ± 1 s.d. |
| Rejected |
5.84 ± 1 s.d. |

Step 4.2: Graph-Grounded Idea Generator
Step 4.2.1: Structural Gap Pattern → Generation Strategy Mapping
| Gap Pattern |
Corresponding Generation Strategy |
Description |
| Open axes |
Bottleneck Resolution |
Address identified but unresolved bottlenecks in the graph |
| Recent improvement direction |
Trend Extrapolation |
Extrapolate along recent improvement directions |
| Sacrifice axes |
Cross-pollination |
Cross-domain/cross-method combination, leveraging trade-offs between different methods |
| Disconnected pairs |
Paradigm Challenge |
Challenge existing paradigms, connecting disconnected method pairs in the graph |
Step 4.2.2: Evidence Certificate (Core Anti-hallucination Mechanism)
| Field |
Content |
| Certificate tuple |
(specific causal edge, bottleneck original text in the graph, explanation of why this bottleneck remains unresolved) |
| Verification |
Exact match of bottleneck original text in the certificate against actual graph content |
| Failure handling |
Discard result, fallback |
| Design goal |
Ensure ideas are traceable without over-constraining generation |
5. Key Design Decisions
| Decision Point |
Intern-Atlas’s Choice |
Alternative |
Rationale |
| Graph atomic unit |
Method entities (method-level) |
Papers (paper-level) |
Notes explicitly document: paper-level citations cannot distinguish extends/improves/replaces/compares/background |
| Causal edge evidence |
Mandatory verbatim citations + structured bottleneck/mechanism |
Edge type labels only |
Provides grounded evidence, supports certificate verification for idea generation |
| Lineage search algorithm |
SGT-MCTS (graph-aware + time-aware) |
Beam search / standard MCTS / Random Walk |
Notes explicitly document: high branching at central nodes causes beam collapse; standard MCTS does not perceive publication year direction; Random Walk performs extremely poorly (ER only 0.7) |
| Rollout strategy |
Greedy (highest confidence) |
MC random sampling |
Random sampling introduces LLM extraction noise |
| Dead-end handling |
Backpropagation additional penalty + branch discovery restart |
Pure reliance on UCT exploration term |
Explicitly suppresses greedy collapse, proactively recalls ignored branches |
| Idea evaluation |
Per-dimension graph statistics + hand-crafted conjunctive penalties |
Free-text LLM judge |
The latter’s novelty score is negatively correlated with actual impact |
| Idea generation |
Structural gap patterns + evidence certificate |
Direct LLM free generation |
Forces grounding to specific causal edges and bottlenecks, avoids method name stacking |
6. Evaluation
6.1 Graph Construction Quality
Benchmark: 30 high-impact surveys → 30 method evolution graphs, containing 2,268 nodes / 1,462 edges / 133 evolution chains.
| Metric |
Meaning |
Result |
| NMR (Node Match Ratio) |
Node match rate |
91.0% |
| ERR (Edge Reachable Ratio) |
Edge reachability rate |
89.7% |
| PSC (Path Semantic Correctness) |
Path semantic correctness |
92.0% |
| NR (Node Recall) |
Node recall |
84.8% (SGT-MCTS) |
| ER (Edge Recall) |
Edge recall |
79.0% (SGT-MCTS) |
| CAS (Chain Alignment Score) |
Lineage chain alignment score |
84.8% (SGT-MCTS) |

6.2 Idea Evaluator (Strata Dataset)
| Category |
Paper Count |
Overall Score |
| Top-tier (ICLR 2026, ICML 2025, NeurIPS 2025) |
300 |
8.48 |
| Core (AAAI 2026, IJCAI 2025) |
300 |
7.83 |
| Workshop (ICLR 2026) |
300 |
6.85 |
| Rejected (ICLR 2026) |
300 |
5.84 |
Evaluation method: Publication tier verification (across publication strata) + human evaluation.
6.3 Idea Generator Comparative Experiments
| Baseline |
Retrieval/Knowledge Source |
| Direct LLM generation |
No external retrieval |
| External search |
OpenAlex / Semantic Scholar |
| Local RAG |
BM25 |
| Intern-Atlas |
Method evolution graph + evidence certificate |
6.4 SGT-MCTS Retrieval Algorithm Comparison
| Method |
NR |
ER |
CAS |
| Beam@1 |
41.0 |
18.6 |
41.0 |
| Beam@5 |
43.4 |
21.6 |
43.4 |
| Beam@10 |
44.9 |
23.2 |
44.9 |
| RW@5 |
28.1 |
0.7 |
28.1 |
| SGT-MCTS |
84.8 |
79.0 |
84.8 |

7. Limitations and Applicability
| Limitation |
Specific Manifestation |
Mitigation |
| Edge type classification accuracy |
Phase-1 production model 70.4% / audit model 93.0% |
Key decisions can undergo secondary audit; improve production model |
| Bottleneck taxonomy rigidity |
14-axis taxonomy D fixed at publication time; new dimensions mapped to nearest existing axis |
Await future taxonomy revision |
| Alias resolution coverage |
Substring matching favors precision over recall; ambiguity handled via manual negative list |
Biased toward high-quality nodes |
| Temporal coherence calibration range |
TC calibrated on post-2015 AI literature |
Cross-pace domains require recalibration |
| Rollout scoring opacity |
R(π) paper does not provide specific form |
Reproduction requires custom design |
Best Applicable Scenarios
- AI agents performing idea generation/evaluation that require structured method evolution priors
- Traceable scientific idea generation (evidence certificate prevents hallucination)
- Method lineage research / survey automation within AI subfields
- Evaluating ideas while avoiding the “popular method stacking” bias of LLM free-text judges
Unsuitable Scenarios
- Disciplines with research paces significantly different from AI (TC calibration fails)
- Extremely niche subfields without human survey references (limited graph coverage)
- Research focused on paper-level citation network properties (e.g., PageRank, influence propagation)
8. Quick Reference
Search-R1
One-sentence positioning: A framework that trains LLMs via reinforcement learning to autonomously generate multi-turn search queries during step-by-step reasoning and perform real-time retrieval, addressing the suboptimal performance of prompt-engineering-driven search through retrieval result loss masking and outcome-oriented rewards.
Key innovation: Unlike prompt-driven search methods, Search-R1 models the search engine as part of the environment and trains the LLM via RL to autonomously learn the optimal search interaction strategy; simultaneously introduces retrieval token masking to prevent gradient updates on retrieved content, ensuring training stability.
0. Execution Overview
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| Training Phase
├─ ① Preparation: search engine as environment, policy model π_θ, reference model π_ref
├─ ② Rollout: model generates reasoning sequences, interleaving token generation with search engine retrieval
│ ├─ Encountering uncertain knowledge → generate <search>... </search> query
│ ├─ Search engine returns documents → wrap as <information>... </information>
│ ├─ Reasoning process → wrap as <think>... </think>
│ └─ Final answer → wrap as <answer>... </answer>
├─ ③ Reward computation: outcome-oriented reward r_φ(x, y) = EM(a_pred, a_gold)
├─ ④ Loss Masking: policy gradient computed only on LLM-generated tokens; retrieved content excluded from optimization
└─ ⑤ Policy update: PPO or GRPO updates the policy model
Inference Phase (per query)
├─ ① Input question q
├─ ② Iterative generation: text generation ↔ search query alternation
│ ├─ Generate response tokens
│ ├─ Detect <search> / </answer> / <eos> → determine next action
│ ├─ Search query → retrieve documents → inject into reasoning chain
│ └─ Continue generation
├─ ③ Termination condition: maximum action count B reached or <answer> generated
└─ ④ Output final answer
|
1. High-level Design (Indexing → Retrieval → Generation)
1.1 Indexing
| Dimension |
Approach |
| Chunking strategy |
— |
| Index structure |
— (relies on external search engine; offline index construction not explicitly recorded in notes) |
| Knowledge representation |
— |
| Construction cost |
— |
| Core characteristic |
Notes do not explicitly record index construction; the offline phase focuses on model training rather than knowledge base construction |
1.2 Retrieval
| Dimension |
Approach |
| Retrieval method |
Dynamic search triggering (model learns via RL to autonomously decide when to retrieve) |
| Retrieval granularity |
Documents (external documents returned by the search engine) |
| Iteration strategy |
Multi-turn iteration (search may be triggered multiple times during reasoning until termination conditions are met) |
| Query processing |
Model autonomously generates search queries based on current reasoning state |
| Core characteristic |
RL-driven agentic search: model learns the optimal search interaction strategy rather than relying on manual prompt engineering |
1.3 Generation
| Dimension |
Approach |
| Context injection |
Retrieved documents injected into the reasoning chain via tags |
| Citation tracing |
— |
| Quality control |
Outcome-oriented reward (EM matching) drives generation quality; retrieval token masking ensures training stability |
| Core characteristic |
Structured token system ( / / / ); reasoning and retrieval proceed in an interleaved manner |
2. Offline Construction: RL Training (Detailed Execution)
The core of Search-R1’s offline phase is RL training rather than traditional knowledge base index construction. The system relies on external search engines (e.g., Bing) for document retrieval, and the notes do not explicitly record traditional index construction steps.
Step 2.1 Search Engine Environment Modeling
| Item |
Description |
| Input |
Policy LLM π_θ, search engine R |
| Operation |
Model the search engine as part of the environment; sampled trajectory sequences interleave LLM token generation with search engine retrieval |
| Key decision |
Unlike previous methods that rely solely on the policy LLM for rollout generation, explicitly introduces retrieval-interleaved reasoning: π_θ(· | x; R) |
| Output |
Rollout environment with interleaved retrieval and reasoning |
Step 2.2 Structured Token System Definition
| Item |
Description |
| Operation |
Define four types of special tokens to structure the interaction |
| Key decision |
Uses special tokens rather than free text, facilitating rule-based parsing and training control |
| Output |
Token system: / , / , / , / |
Step 2.3 Rollout Sequence Generation
| Item |
Description |
| Input |
Query q, policy model π_θ, search engine R |
| Operation |
Model generates response tokens y_t ~ πθ(· | x, y*<t; R), appended to the rollout sequence |
| Key decision |
The sequence contains two types of tokens: LLM-generated tokens and retrieved tokens |
| Output |
Complete rollout sequence y |
Step 2.4 Retrieval Token Masking (Loss Masking)
| Item |
Description |
| Problem |
Optimizing retrieved tokens equally leads to unexpected learning dynamics |
| Operation |
Introduce a loss mask ensuring the policy gradient objective is computed only on LLM-generated tokens |
| Key decision |
Indicator function I(y_t): model-generated token = 1, retrieved content (within ) = 0 |
| Output |
Masked policy gradient computation |
Step 2.5 Reward Computation
| Item |
Description |
| Input |
Model’s final answer a_pred, ground-truth answer a_gold |
| Operation |
Apply a rule-based outcome-oriented reward: r_φ(x, y) = EM(a_pred, a_gold) |
| Key decision |
Does not use format reward (model already demonstrates strong structural compliance); does not train a neural reward model (avoids sensitivity and additional cost) |
| Output |
Reward value |
Step 2.6 Policy Update (PPO / GRPO)
| Item |
Description |
| PPO |
Actor-critic method; Actor generates answers, Critic value network estimates advantage, advantage function uses GAE |
| GRPO |
Group-relative advantage estimation, no Critic network needed; samples G outputs per group to compute group baseline |
| Key decision |
Both are compatible; GRPO converges faster, PPO trains more stably, final rewards are comparable |
| Output |
Updated policy model parameters |
3. Online Query: Retrieval (Detailed Execution)
3.1 Retrieval Mode Overview
Search-R1 employs a single RL-driven search mode, autonomously triggered by the trained model during inference:
| Mode |
Applicable Scenario |
Core Mechanism |
Characteristic |
| RL-driven Search |
Knowledge uncertainty encountered during reasoning |
Model autonomously generates queries → retrieves → injects |
Learned, not manually predefined strategy |
3.2 Retrieval Procedure

Step 3.1: Reasoning and Search Alternation
| Item |
Description |
| Input |
Query q, current reasoning chain state |
| Operation |
LLM alternates between text generation and external search engine queries |
| Key decision |
Iterative framework: generates at each step until , </answer>, or is detected |
| Output |
Response token sequence |
Step 3.2: Search Query Detection and Execution
| Item |
Description |
| Input |
Query wrapped in tags |
| Operation |
Extract search query from rollout sequence, invoke search engine to retrieve documents D |
| Output |
Retrieved documents, wrapped as ... and injected into the sequence |
Step 3.3: Iterative Termination Judgment
| Item |
Description |
| Termination condition 1 |
Maximum action count B reached |
| Termination condition 2 |
Model generates final response wrapped in tags |
| Fallback strategy |
If content is empty (model outputs "My action is not correct. Let me rethink."), continue iteration |
4. Online Generation: Generation (Detailed Execution)
4.1 Reasoning Procedure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| Input question q
│
▼
┌─────────────────────────┐
│ Initialize: action count b ← 0 │
│ Initialize: response y ← ∅ │
└───────────┬─────────────┘
│
▼
┌───────┴───────┐
│ while b < B │
└───────┬───────┘
│
▼
┌───────────────────┐
│ Generate response tokens │
│ until termination signal │
└─────────┬─────────┘
│
┌────┴────┐
▼ ▼
<search> <answer>
│ │
▼ ▼
Retrieve Return y
documents
inject
│
▼
b ← b + 1
│
└──────→ Continue generating
|
Step 4.1: Online Reasoning Sequence Generation
| Item |
Description |
| Input |
Query q, trained policy model π_θ, search engine R |
| Operation |
Model autoregressively generates response tokens; pauses upon encountering to trigger retrieval, then continues generation after injecting results |
| Key decision |
Reasoning process is structurally consistent with the training Rollout, but no longer computes gradient updates |
| Output |
Final response sequence y, containing the answer wrapped in tags |
5. Key Design Decisions
| Decision Point |
Search-R1’s Choice |
Alternative |
Rationale |
| Training paradigm |
Reinforcement learning (PPO / GRPO) |
Prompt engineering / SFT / Rejection sampling |
Explicitly stated in notes: prompting advanced LLMs to use search engines is often suboptimal; LLMs may not fully possess the ability to optimally interact with search engines |
| Retrieval triggering |
Model learns autonomously ( token) |
Fixed retrieval strategy / manually predefined trigger conditions |
Explicitly stated in notes: RL training enables the model to autonomously learn when search is needed |
| Reward design |
Outcome-oriented reward (EM matching) |
Process reward / neural reward model / format reward |
Explicitly stated in notes: uses simple outcome reward to avoid complexity; does not train a neural reward model to avoid sensitivity and cost |
| Training stability |
Retrieval token masking |
No special handling of retrieved content |
Explicitly stated in notes: optimizing on retrieved tokens equally leads to unexpected learning dynamics |
| Inference algorithm |
Supports both PPO and GRPO |
Single RL algorithm |
Explicitly stated in notes: both algorithms are compatible, providing empirical comparison |
6. Evaluation
6.1 Evaluation Metrics
| Metric |
Meaning |
This System vs. Baseline |
| EM |
Exact Match |
Qwen2.5-7B +41% over RAG baselines; Qwen2.5-3B +20% over RAG baselines |
6.2 RL Training Comparison
| Condition |
Description |
| PPO |
Actor-critic RL; higher training stability |
| GRPO |
Group Relative Policy Optimization; faster convergence |
| Conclusion |
GRPO converges faster, PPO is more stable, final training rewards are comparable |
6.3 Experimental Setup
| Condition |
Description |
| Search-R1 |
Full system (RL training + retrieval masking + outcome reward) |
| CoT |
Chain-of-thought baseline |
| vanilla RAG |
Standard retrieval-augmented generation baseline |
| IRCoT |
Iterative retrieval chain-of-thought baseline |
| Search-o1 |
Inference-time search augmentation baseline |
| R1 |
DeepSeek-R1 baseline |
| SFT |
Supervised fine-tuning baseline |
| Rejection Sampling |
Rejection sampling baseline |
6.4 Datasets
| Dataset |
Description |
| Natural Questions (NQ) |
Open-domain QA benchmark |
| TriviaQA |
Open-domain QA benchmark |
| PopQA |
Knowledge-intensive QA benchmark |
| HotpotQA |
Multi-hop QA benchmark |
| 2WikiMultiHopQA |
Multi-hop QA benchmark |
| MuSiQue |
Multi-hop QA benchmark |
| Bamboogle |
Open-domain QA benchmark |
7. Limitations and Applicability
The notes do not explicitly record Search-R1’s limitations.
Best Applicable Scenarios
- Reasoning tasks requiring efficient acquisition of external knowledge and up-to-date information
- Open-domain QA (NQ, TriviaQA, etc.)
- Multi-hop QA (HotpotQA, 2WikiMultiHopQA, MuSiQue, etc.)
- Knowledge-intensive reasoning tasks
Unsuitable Scenarios
8. Quick Reference