本周阅读摘选
2026-05-11 → 2026-05-17
目录
学术相关
PaSa
One-sentence positioning: An LLM-powered academic literature search agent that autonomously completes search tool invocations, paper reading, and citation expansion through a Crawler-Selector dual-component architecture, optimized via reinforcement learning for complex academic queries.
Key innovation: Models academic search as an agentic decision process, where the Crawler maximizes recall in citation networks and the Selector precisely determines relevance; both components are jointly optimized via the AGILE RL framework, with dedicated synthetic dataset AutoScholarQuery and real-world benchmark RealScholarQuery constructed for training and evaluation.
0. Execution Overview
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| Offline Phase (one-time training)
├─ ① Dataset preparation
│ ├─ AutoScholarQuery: synthetic academic query-paper pairs (35k), generated by GPT-4o from Related Work
│ └─ RealScholarQuery: real-world academic query benchmark
├─ ② Selector training (SFT)
│ ├─ Input: academic query + paper
│ ├─ Output: decision token (True/False) + supporting rationale
│ └─ Decision token placed before rationale, enabling it to serve as a single-token auxiliary reward model for the Crawler
└─ ③ Crawler training (SFT → RL)
├─ SFT: generate trajectories session by session (Search session / Expand session)
└─ RL (PPO): token-level MDP, each session independently optimized
├─ Action space: [Search] generate query and retrieve / [Expand] extract citations / [Stop] switch to next paper
├─ Reward function: α × Σ I(q, p_i, t) − c(a_t)
└─ Auxiliary reward: Selector as reward model to mitigate sparse rewards
Online Phase (per query)
├─ ① Input academic query q
├─ ② Crawler iterative execution
│ ├─ [Search] → generate search query → invoke retrieval tool → add results to paper queue
│ ├─ [Expand] → extract sub-section citations from current paper → add to paper queue
│ ├─ [Stop] → switch to next paper in queue
│ └─ Navigate through citation network, continuously discovering more relevant papers
├─ ③ Selector per-paper judgment
│ ├─ Input: query q + each paper in paper queue
│ ├─ Output: True/False judgment + rationale
│ └─ Decision token probability used for ranking search results
└─ ④ Return final retrieval result set
|
1. High-level Design (Indexing → Retrieval → Generation)
1.1 Indexing
| Dimension |
Approach |
| Chunking strategy |
— |
| Index structure |
— (relies on external search engine; notes do not explicitly document offline index construction) |
| Knowledge representation |
— |
| Construction cost |
— |
| Core characteristic |
Notes do not document index construction; offline phase focuses on model training and dataset construction |
1.2 Retrieval
| Dimension |
Approach |
| Retrieval method |
Agentic dynamic search (Crawler autonomously decides Search / Expand / Stop) |
| Retrieval granularity |
Paper level (collects entire papers through citation networks) |
| Iteration strategy |
Multi-round iteration (Crawler continuously navigates citation network, expanding paper queue) |
| Query processing |
Model automatically generates search queries based on current context and query |
| Core characteristic |
Crawler-Selector dual-component: Crawler maximizes recall, Selector maximizes precision; RL optimization teaches the agent optimal search strategies |
1.3 Generation
| Dimension |
Approach |
| Context injection |
Selector matches retrieved papers with query, generating True/False judgments and rationales |
| Citation tracing |
Selector outputs rationale supporting its judgment for each paper |
| Quality control |
Dual-component division: Crawler recall + Selector filtering; Selector decision token probability used for result ranking |
| Core characteristic |
Selector serves as single-token reward model, used both for online judgment and as auxiliary reward for Crawler RL training |
2. Offline Construction: Training (Detailed Execution)
Step 2.1 Dataset Construction
AutoScholarQuery
| Item |
Description |
| Input |
Related Work section of each paper |
| Operation |
Use GPT-4o to generate academic queries; answers correspond to references cited in Related Work |
| Key decision |
Synthetic but high-quality dataset, curated for the AI domain |
| Output |
35k query-paper pairs |
| Quality verification |
Sample 100 query-paper pairs to assess reasonableness and relevance |
RealScholarQuery
| Item |
Description |
| Input |
Real-world academic queries |
| Operation |
Collect real academic queries to construct benchmark |
| Output |
Real academic query benchmark for evaluating more realistic scenarios |
Step 2.2 Selector Training
| Item |
Description |
| Input |
Academic query and a paper |
| Operation |
Fine-tune model to generate two outputs: (1) decision token d (True/False); (2) supporting rationale r |
| Key decision |
Decision token placed before rationale, enabling Selector to serve as single-token reward model for Crawler training; token probability can be used for ranking search results |
| Training config |
SFT, 1 epoch, learning rate 1e-5, batch size 4 |
| Output |
Trained Selector model |
Step 2.3 Crawler Training: Imitation Learning
| Item |
Description |
| Input |
AutoScholarQuery training data subset |
| Operation |
Generate trajectories session by session for imitation learning |
| Key decision |
Two session types: Search session (starting from state Sq) and Expand session (starting from state S{q+p}) |
| Search Session |
GPT-4o generates search query trajectories based on user query |
| Expand Session |
Given query and retrieved papers, use Google retrieval sampling; check paper sub-sections; citations to training data papers must be included, otherwise 10% probability of random selection to augment diversity |
| Output |
SFT-initialized Crawler policy model |
Step 2.4 Crawler Training: Reinforcement Learning
| Item |
Description |
| Input |
Query q, SFT-initialized policy model πθ |
| Operation |
Model Crawler as token-level MDP, optimize with PPO |
| State |
Current LLM context + paper queue |
| Action space |
LLM vocabulary; each token represents an action; when action matches function name, execute corresponding function ([Search]/[Expand]/[Stop]) |
| Reward function |
r(st, at) = α × Σi=1nt I(q, pi, t) − c(at), where I is indicator function for new papers matching query and not in queue |
| Auxiliary reward |
Selector as auxiliary reward model, mitigating sparse reward problem from AutoScholarQuery matching only |
| Key decision |
Independent PPO per session; session defined as sub-trajectory ending with [Stop]; Monte Carlo sampling estimates returns |
| Total objective |
LRL(θ, φ) = Lpolicy(θ) + η · Lvalue(φ) |
| Output |
Trained Crawler policy model |
Step 2.5 Model Assembly
| Item |
Description |
| Input |
Trained Selector and Crawler |
| Operation |
Sequentially train both components; both based on Qwen2.5-7b |
| Output |
Final agent PaSa-7b |
3. Online Query: Retrieval (Detailed Execution)
3.1 Retrieval Mode Overview
PaSa uses a single agentic search mode, with the Crawler autonomously navigating through citation networks:
| Mode |
Applicable Scenario |
Core Mechanism |
Characteristic |
| Agentic Search |
Complex academic queries |
Crawler autonomously Search/Expand/Stop; Selector judges per paper |
Learned optimal search strategy, not manually predefined |
3.2 Retrieval Procedure
Step 3.1: Crawler Initialization and Execution
| Item |
Description |
| Input |
User academic query q |
| Operation |
Crawler executes token-level MDP, autonomously deciding the next action |
| Action [Search] |
Generate search query, invoke retrieval tool, add all results to paper queue |
| Action [Expand] |
Extract sub-sections from current paper context, parse all citations and add to paper queue |
| Action [Stop] |
Stop current paper processing, reset context, begin processing next paper in queue |
| Key decision |
Crawler aims to maximize recall of relevant papers, exploring increasingly relevant papers through citation networks |
| Output |
Continuously growing paper queue (potentially containing hundreds or even thousands of papers) |
Step 3.2: Selector Judgment and Ranking
| Item |
Description |
| Input |
Query q + each paper in paper queue |
| Operation |
Selector carefully reads each paper, determining whether it meets query requirements |
| Output |
(1) True/False judgment; (2) supporting rationale; (3) decision token probability for ranking |
| Key decision |
Selector emphasizes precise identification of papers meeting user needs, complementing Crawler’s recall objective |
4. Online Generation: Generation (Detailed Execution)
PaSa’s “generation” phase is not traditional RAG answer generation, but rather relevance judgment and ranking of retrieval results through the Selector, ultimately outputting the set of papers satisfying the query.
4.1 Judgment and Ranking Procedure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| Input: query q, paper queue Q = {p_1, p_2, ..., p_n} collected by Crawler
│
▼
┌─────────────────────────────┐
│ For each paper p_i in Q │
│ Call Selector(q, p_i) │
└─────────────┬───────────────┘
│
┌────┴────┐
▼ ▼
True False
│ │
▼ ▼
Retain and Filter
rank
(by decision probability)
│
▼
Output final paper set
|
Step 4.1: Relevance Judgment
| Item |
Description |
| Input |
Query q, paper pi |
| Operation |
Selector generates decision token d (True/False) and supporting rationale r |
| Key decision |
Decision token probability can be used for ranking search results, enabling fine-grained sorting |
| Output |
Relevance judgment + rationale + ranking score |
5. Key Design Decisions
| Decision Point |
PaSa’s Choice |
Alternative |
Rationale |
| System architecture |
Crawler-Selector dual-component |
Single-component end-to-end |
Crawler maximizes recall, Selector maximizes precision; clear division of labor |
| Training paradigm |
RL (AGILE framework, PPO) |
SFT / prompt engineering |
Optimize PaSa in the AGILE RL framework, teaching the agent optimal search strategies |
| Crawler training |
Imitation Learning + RL two-stage |
SFT only / RL only |
First generate trajectories for imitation learning, then apply RL |
| Reward design |
Outcome-oriented + Selector auxiliary reward |
Sparse matching reward only |
AutoScholarQuery matching alone causes sparse rewards; Selector as auxiliary reward model mitigates this |
| RL optimization granularity |
Independent PPO per session |
Unified optimization over entire trajectory |
Session defined as sub-trajectory ending with [Stop]; each session independently optimized |
| Selector output |
Decision token before rationale + rationale |
Rationale before decision token |
Decision token placement enables Selector to serve as single-token reward model; probability usable for ranking |
| Data generation |
GPT-4o synthesizes from Related Work |
Manual annotation |
Generates queries from Related Work sections; answers correspond to cited references |
6. Evaluation
6.1 Evaluation Metrics
| Metric |
Meaning |
PaSa vs Baseline |
| Recall@20 |
Relevant paper recall in top 20 results |
PaSa-7B vs Google with GPT-4o: +37.78% |
| Recall@50 |
Relevant paper recall in top 50 results |
PaSa-7B vs Google with GPT-4o: +39.90% |
| Recall |
Overall recall |
PaSa-7B vs PaSa-GPT-4o: +30.36% |
| Precision |
Precision |
PaSa-7B vs PaSa-GPT-4o: +4.25% |
| F1 |
Composite F1 score |
— |

6.2 Comparative Experimental Setup
| Condition |
Description |
| PaSa-7B |
Full system (Qwen2.5-7b base, SFT + RL training) |
| Google |
Traditional search engine baseline |
| Google Scholar |
Academic search engine baseline |
| Google with GPT-4o |
Google + GPT-4o query rewriting baseline |
| ChatGPT (search-enabled GPT-4o) |
Search-augmented ChatGPT baseline |
| GPT-o1 |
OpenAI o1 reasoning model baseline |
| PaSa-GPT-4o |
PaSa baseline implemented by prompting GPT-4o |
| PaSa w/o RL |
No-RL version baseline |
6.3 Datasets
| Dataset |
Description |
| AutoScholarQuery |
Synthetic academic query dataset (35k), for training |
| RealScholarQuery |
Real-world academic query benchmark, for evaluation |
7. Limitations and Applicability
Notes do not explicitly document PaSa’s limitations.
Best Applicable Scenarios
- Complex academic queries requiring long-tail expertise, comprehensive survey-level coverage, fine-grained queries
- Literature retrieval tasks requiring deep exploration within citation networks
- Academic search scenarios with high demands on both recall and precision
Unsuitable Scenarios
8. Quick Reference