【阅】本周阅读摘选2026-05-11 → 2026-05-17

Posted by Cao Zihang on May 11, 2026 Word Count:
本周阅读摘选
2026-05-11 → 2026-05-17
目录

学术相关

PaSa

One-sentence positioning: An LLM-powered academic literature search agent that autonomously completes search tool invocations, paper reading, and citation expansion through a Crawler-Selector dual-component architecture, optimized via reinforcement learning for complex academic queries.

Key innovation: Models academic search as an agentic decision process, where the Crawler maximizes recall in citation networks and the Selector precisely determines relevance; both components are jointly optimized via the AGILE RL framework, with dedicated synthetic dataset AutoScholarQuery and real-world benchmark RealScholarQuery constructed for training and evaluation.


0. Execution Overview

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Offline Phase (one-time training)
  ├─ ① Dataset preparation
  │   ├─ AutoScholarQuery: synthetic academic query-paper pairs (35k), generated by GPT-4o from Related Work
  │   └─ RealScholarQuery: real-world academic query benchmark
  ├─ ② Selector training (SFT)
  │   ├─ Input: academic query + paper
  │   ├─ Output: decision token (True/False) + supporting rationale
  │   └─ Decision token placed before rationale, enabling it to serve as a single-token auxiliary reward model for the Crawler
  └─ ③ Crawler training (SFT → RL)
      ├─ SFT: generate trajectories session by session (Search session / Expand session)
      └─ RL (PPO): token-level MDP, each session independently optimized
         ├─ Action space: [Search] generate query and retrieve / [Expand] extract citations / [Stop] switch to next paper
         ├─ Reward function: α × Σ I(q, p_i, t) − c(a_t)
         └─ Auxiliary reward: Selector as reward model to mitigate sparse rewards

Online Phase (per query)
  ├─ ① Input academic query q
  ├─ ② Crawler iterative execution
  │   ├─ [Search] → generate search query → invoke retrieval tool → add results to paper queue
  │   ├─ [Expand] → extract sub-section citations from current paper → add to paper queue
  │   ├─ [Stop] → switch to next paper in queue
  │   └─ Navigate through citation network, continuously discovering more relevant papers
  ├─ ③ Selector per-paper judgment
  │   ├─ Input: query q + each paper in paper queue
  │   ├─ Output: True/False judgment + rationale
  │   └─ Decision token probability used for ranking search results
  └─ ④ Return final retrieval result set

1. High-level Design (Indexing → Retrieval → Generation)

1.1 Indexing

Dimension Approach
Chunking strategy
Index structure — (relies on external search engine; notes do not explicitly document offline index construction)
Knowledge representation
Construction cost
Core characteristic Notes do not document index construction; offline phase focuses on model training and dataset construction

1.2 Retrieval

Dimension Approach
Retrieval method Agentic dynamic search (Crawler autonomously decides Search / Expand / Stop)
Retrieval granularity Paper level (collects entire papers through citation networks)
Iteration strategy Multi-round iteration (Crawler continuously navigates citation network, expanding paper queue)
Query processing Model automatically generates search queries based on current context and query
Core characteristic Crawler-Selector dual-component: Crawler maximizes recall, Selector maximizes precision; RL optimization teaches the agent optimal search strategies

1.3 Generation

Dimension Approach
Context injection Selector matches retrieved papers with query, generating True/False judgments and rationales
Citation tracing Selector outputs rationale supporting its judgment for each paper
Quality control Dual-component division: Crawler recall + Selector filtering; Selector decision token probability used for result ranking
Core characteristic Selector serves as single-token reward model, used both for online judgment and as auxiliary reward for Crawler RL training

2. Offline Construction: Training (Detailed Execution)

Step 2.1 Dataset Construction

AutoScholarQuery
Item Description
Input Related Work section of each paper
Operation Use GPT-4o to generate academic queries; answers correspond to references cited in Related Work
Key decision Synthetic but high-quality dataset, curated for the AI domain
Output 35k query-paper pairs
Quality verification Sample 100 query-paper pairs to assess reasonableness and relevance
RealScholarQuery
Item Description
Input Real-world academic queries
Operation Collect real academic queries to construct benchmark
Output Real academic query benchmark for evaluating more realistic scenarios

Step 2.2 Selector Training

Item Description
Input Academic query and a paper
Operation Fine-tune model to generate two outputs: (1) decision token d (True/False); (2) supporting rationale r
Key decision Decision token placed before rationale, enabling Selector to serve as single-token reward model for Crawler training; token probability can be used for ranking search results
Training config SFT, 1 epoch, learning rate 1e-5, batch size 4
Output Trained Selector model

Step 2.3 Crawler Training: Imitation Learning

Item Description
Input AutoScholarQuery training data subset
Operation Generate trajectories session by session for imitation learning
Key decision Two session types: Search session (starting from state Sq) and Expand session (starting from state S{q+p})
Search Session GPT-4o generates search query trajectories based on user query
Expand Session Given query and retrieved papers, use Google retrieval sampling; check paper sub-sections; citations to training data papers must be included, otherwise 10% probability of random selection to augment diversity
Output SFT-initialized Crawler policy model

Step 2.4 Crawler Training: Reinforcement Learning

Item Description
Input Query q, SFT-initialized policy model πθ
Operation Model Crawler as token-level MDP, optimize with PPO
State Current LLM context + paper queue
Action space LLM vocabulary; each token represents an action; when action matches function name, execute corresponding function ([Search]/[Expand]/[Stop])
Reward function r(st, at) = α × Σi=1nt I(q, pi, t) − c(at), where I is indicator function for new papers matching query and not in queue
Auxiliary reward Selector as auxiliary reward model, mitigating sparse reward problem from AutoScholarQuery matching only
Key decision Independent PPO per session; session defined as sub-trajectory ending with [Stop]; Monte Carlo sampling estimates returns
Total objective LRL(θ, φ) = Lpolicy(θ) + η · Lvalue(φ)
Output Trained Crawler policy model

Step 2.5 Model Assembly

Item Description
Input Trained Selector and Crawler
Operation Sequentially train both components; both based on Qwen2.5-7b
Output Final agent PaSa-7b

3. Online Query: Retrieval (Detailed Execution)

3.1 Retrieval Mode Overview

PaSa uses a single agentic search mode, with the Crawler autonomously navigating through citation networks:

Mode Applicable Scenario Core Mechanism Characteristic
Agentic Search Complex academic queries Crawler autonomously Search/Expand/Stop; Selector judges per paper Learned optimal search strategy, not manually predefined

3.2 Retrieval Procedure

Step 3.1: Crawler Initialization and Execution
Item Description
Input User academic query q
Operation Crawler executes token-level MDP, autonomously deciding the next action
Action [Search] Generate search query, invoke retrieval tool, add all results to paper queue
Action [Expand] Extract sub-sections from current paper context, parse all citations and add to paper queue
Action [Stop] Stop current paper processing, reset context, begin processing next paper in queue
Key decision Crawler aims to maximize recall of relevant papers, exploring increasingly relevant papers through citation networks
Output Continuously growing paper queue (potentially containing hundreds or even thousands of papers)
Step 3.2: Selector Judgment and Ranking
Item Description
Input Query q + each paper in paper queue
Operation Selector carefully reads each paper, determining whether it meets query requirements
Output (1) True/False judgment; (2) supporting rationale; (3) decision token probability for ranking
Key decision Selector emphasizes precise identification of papers meeting user needs, complementing Crawler’s recall objective

4. Online Generation: Generation (Detailed Execution)

PaSa’s “generation” phase is not traditional RAG answer generation, but rather relevance judgment and ranking of retrieval results through the Selector, ultimately outputting the set of papers satisfying the query.

4.1 Judgment and Ranking Procedure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Input: query q, paper queue Q = {p_1, p_2, ..., p_n} collected by Crawler
    │
    ▼
┌─────────────────────────────┐
│ For each paper p_i in Q     │
│ Call Selector(q, p_i)       │
└─────────────┬───────────────┘
              │
         ┌────┴────┐
         ▼         ▼
      True       False
         │         │
         ▼         ▼
    Retain and   Filter
    rank
    (by decision probability)
         │
         ▼
    Output final paper set

Step 4.1: Relevance Judgment

Item Description
Input Query q, paper pi
Operation Selector generates decision token d (True/False) and supporting rationale r
Key decision Decision token probability can be used for ranking search results, enabling fine-grained sorting
Output Relevance judgment + rationale + ranking score

5. Key Design Decisions

Decision Point PaSa’s Choice Alternative Rationale
System architecture Crawler-Selector dual-component Single-component end-to-end Crawler maximizes recall, Selector maximizes precision; clear division of labor
Training paradigm RL (AGILE framework, PPO) SFT / prompt engineering Optimize PaSa in the AGILE RL framework, teaching the agent optimal search strategies
Crawler training Imitation Learning + RL two-stage SFT only / RL only First generate trajectories for imitation learning, then apply RL
Reward design Outcome-oriented + Selector auxiliary reward Sparse matching reward only AutoScholarQuery matching alone causes sparse rewards; Selector as auxiliary reward model mitigates this
RL optimization granularity Independent PPO per session Unified optimization over entire trajectory Session defined as sub-trajectory ending with [Stop]; each session independently optimized
Selector output Decision token before rationale + rationale Rationale before decision token Decision token placement enables Selector to serve as single-token reward model; probability usable for ranking
Data generation GPT-4o synthesizes from Related Work Manual annotation Generates queries from Related Work sections; answers correspond to cited references

6. Evaluation

6.1 Evaluation Metrics

Metric Meaning PaSa vs Baseline
Recall@20 Relevant paper recall in top 20 results PaSa-7B vs Google with GPT-4o: +37.78%
Recall@50 Relevant paper recall in top 50 results PaSa-7B vs Google with GPT-4o: +39.90%
Recall Overall recall PaSa-7B vs PaSa-GPT-4o: +30.36%
Precision Precision PaSa-7B vs PaSa-GPT-4o: +4.25%
F1 Composite F1 score

Pasted image 20260514160522.png

6.2 Comparative Experimental Setup

Condition Description
PaSa-7B Full system (Qwen2.5-7b base, SFT + RL training)
Google Traditional search engine baseline
Google Scholar Academic search engine baseline
Google with GPT-4o Google + GPT-4o query rewriting baseline
ChatGPT (search-enabled GPT-4o) Search-augmented ChatGPT baseline
GPT-o1 OpenAI o1 reasoning model baseline
PaSa-GPT-4o PaSa baseline implemented by prompting GPT-4o
PaSa w/o RL No-RL version baseline

6.3 Datasets

Dataset Description
AutoScholarQuery Synthetic academic query dataset (35k), for training
RealScholarQuery Real-world academic query benchmark, for evaluation

7. Limitations and Applicability

Notes do not explicitly document PaSa’s limitations.

Best Applicable Scenarios

  • Complex academic queries requiring long-tail expertise, comprehensive survey-level coverage, fine-grained queries
  • Literature retrieval tasks requiring deep exploration within citation networks
  • Academic search scenarios with high demands on both recall and precision

Unsuitable Scenarios


8. Quick Reference

What You Want to Know See Which Section
What is the complete pipeline? 0. Execution Overview
High-level design comparison? 1. High-level Design
How is the model trained? 2. Offline Construction
How is retrieval executed? 3. Online Query
How is relevance judged? 4. Online Generation
Why is it designed this way? 5. Key Design Decisions
How does it perform? 6. Evaluation
When should it NOT be used? 7. Limitations and Applicability