arxiv

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

arXiv Research

arXiv 学术研究

Search and retrieve academic papers from arXiv via their free REST API. No API key, no dependencies — just curl.
通过arXiv的免费REST API搜索和获取学术论文。无需API密钥,无需依赖库——只需使用curl即可。

Quick Reference

快速参考

ActionCommand
Search papers
curl "https://export.arxiv.org/api/query?search_query=all:QUERY&max_results=5"
Get specific paper
curl "https://export.arxiv.org/api/query?id_list=2402.03300"
Read abstract (web)
web_extract(urls=["https://arxiv.org/abs/2402.03300"])
Read full paper (PDF)
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
操作命令
搜索论文
curl "https://export.arxiv.org/api/query?search_query=all:QUERY&max_results=5"
获取特定论文
curl "https://export.arxiv.org/api/query?id_list=2402.03300"
阅读摘要(网页)
web_extract(urls=["https://arxiv.org/abs/2402.03300"])
阅读完整论文(PDF)
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])

Searching Papers

搜索论文

The API returns Atom XML. Parse with
grep
/
sed
or pipe through
python3
for clean output.
API返回Atom XML格式数据。可以使用
grep
/
sed
解析,或者通过
python3
管道处理以获得清晰输出。

Basic search

基础搜索

bash
curl -s "https://export.arxiv.org/api/query?search_query=all:GRPO+reinforcement+learning&max_results=5"
bash
curl -s "https://export.arxiv.org/api/query?search_query=all:GRPO+reinforcement+learning&max_results=5"

Clean output (parse XML to readable format)

清晰输出(将XML解析为可读格式)

bash
curl -s "https://export.arxiv.org/api/query?search_query=all:GRPO+reinforcement+learning&max_results=5&sortBy=submittedDate&sortOrder=descending" | python3 -c "
import sys, xml.etree.ElementTree as ET
ns = {'a': 'http://www.w3.org/2005/Atom'}
root = ET.parse(sys.stdin).getroot()
for i, entry in enumerate(root.findall('a:entry', ns)):
    title = entry.find('a:title', ns).text.strip().replace('\n', ' ')
    arxiv_id = entry.find('a:id', ns).text.strip().split('/abs/')[-1]
    published = entry.find('a:published', ns).text[:10]
    authors = ', '.join(a.find('a:name', ns).text for a in entry.findall('a:author', ns))
    summary = entry.find('a:summary', ns).text.strip()[:200]
    cats = ', '.join(c.get('term') for c in entry.findall('a:category', ns))
    print(f'{i+1}. [{arxiv_id}] {title}')
    print(f'   Authors: {authors}')
    print(f'   Published: {published} | Categories: {cats}')
    print(f'   Abstract: {summary}...')
    print(f'   PDF: https://arxiv.org/pdf/{arxiv_id}')
    print()
"
bash
curl -s "https://export.arxiv.org/api/query?search_query=all:GRPO+reinforcement+learning&max_results=5&sortBy=submittedDate&sortOrder=descending" | python3 -c "
import sys, xml.etree.ElementTree as ET
ns = {'a': 'http://www.w3.org/2005/Atom'}
root = ET.parse(sys.stdin).getroot()
for i, entry in enumerate(root.findall('a:entry', ns)):
    title = entry.find('a:title', ns).text.strip().replace('\n', ' ')
    arxiv_id = entry.find('a:id', ns).text.strip().split('/abs/')[-1]
    published = entry.find('a:published', ns).text[:10]
    authors = ', '.join(a.find('a:name', ns).text for a in entry.findall('a:author', ns))
    summary = entry.find('a:summary', ns).text.strip()[:200]
    cats = ', '.join(c.get('term') for c in entry.findall('a:category', ns))
    print(f'{i+1}. [{arxiv_id}] {title}')
    print(f'   Authors: {authors}')
    print(f'   Published: {published} | Categories: {cats}')
    print(f'   Abstract: {summary}...')
    print(f'   PDF: https://arxiv.org/pdf/{arxiv_id}')
    print()
"

Search Query Syntax

搜索查询语法

PrefixSearchesExample
all:
All fields
all:transformer+attention
ti:
Title
ti:large+language+models
au:
Author
au:vaswani
abs:
Abstract
abs:reinforcement+learning
cat:
Category
cat:cs.AI
co:
Comment
co:accepted+NeurIPS
前缀搜索范围示例
all:
所有字段
all:transformer+attention
ti:
标题
ti:large+language+models
au:
作者
au:vaswani
abs:
摘要
abs:reinforcement+learning
cat:
分类
cat:cs.AI
co:
评论
co:accepted+NeurIPS

Boolean operators

布尔运算符

undefined
undefined

AND (default when using +)

AND(使用+时默认逻辑)

search_query=all:transformer+attention
search_query=all:transformer+attention

OR

OR

search_query=all:GPT+OR+all:BERT
search_query=all:GPT+OR+all:BERT

AND NOT

AND NOT

search_query=all:language+model+ANDNOT+all:vision
search_query=all:language+model+ANDNOT+all:vision

Exact phrase

精确短语

search_query=ti:"chain+of+thought"
search_query=ti:"chain+of+thought"

Combined

组合查询

search_query=au:hinton+AND+cat:cs.LG
undefined
search_query=au:hinton+AND+cat:cs.LG
undefined

Sort and Pagination

排序与分页

ParameterOptions
sortBy
relevance
,
lastUpdatedDate
,
submittedDate
sortOrder
ascending
,
descending
start
Result offset (0-based)
max_results
Number of results (default 10, max 30000)
bash
undefined
参数选项
sortBy
relevance
,
lastUpdatedDate
,
submittedDate
sortOrder
ascending
,
descending
start
结果偏移量(从0开始)
max_results
结果数量(默认10,最大30000)
bash
undefined

Latest 10 papers in cs.AI

cs.AI分类下最新的10篇论文

Fetching Specific Papers

获取特定论文

bash
undefined
bash
undefined

By arXiv ID

通过arXiv ID

Multiple papers

多篇论文

BibTeX Generation

BibTeX 生成

After fetching metadata for a paper, generate a BibTeX entry:
{% raw %}
bash
curl -s "https://export.arxiv.org/api/query?id_list=1706.03762" | python3 -c "
import sys, xml.etree.ElementTree as ET
ns = {'a': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'}
root = ET.parse(sys.stdin).getroot()
entry = root.find('a:entry', ns)
if entry is None: sys.exit('Paper not found')
title = entry.find('a:title', ns).text.strip().replace('\n', ' ')
authors = ' and '.join(a.find('a:name', ns).text for a in entry.findall('a:author', ns))
year = entry.find('a:published', ns).text[:4]
raw_id = entry.find('a:id', ns).text.strip().split('/abs/')[-1]
cat = entry.find('arxiv:primary_category', ns)
primary = cat.get('term') if cat is not None else 'cs.LG'
last_name = entry.find('a:author', ns).find('a:name', ns).text.split()[-1]
print(f'@article{{{last_name}{year}_{raw_id.replace(\".\", \"\")},')
print(f'  title     = {{{title}}},')
print(f'  author    = {{{authors}}},')
print(f'  year      = {{{year}}},')
print(f'  eprint    = {{{raw_id}}},')
print(f'  archivePrefix = {{arXiv}},')
print(f'  primaryClass  = {{{primary}}},')
print(f'  url       = {{https://arxiv.org/abs/{raw_id}}}')
print('}')
"
{% endraw %}
获取论文元数据后,生成BibTeX条目:
{% raw %}
bash
curl -s "https://export.arxiv.org/api/query?id_list=1706.03762" | python3 -c "
import sys, xml.etree.ElementTree as ET
ns = {'a': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'}
root = ET.parse(sys.stdin).getroot()
entry = root.find('a:entry', ns)
if entry is None: sys.exit('Paper not found')
title = entry.find('a:title', ns).text.strip().replace('\n', ' ')
authors = ' and '.join(a.find('a:name', ns).text for a in entry.findall('a:author', ns))
year = entry.find('a:published', ns).text[:4]
raw_id = entry.find('a:id', ns).text.strip().split('/abs/')[-1]
cat = entry.find('arxiv:primary_category', ns)
primary = cat.get('term') if cat is not None else 'cs.LG'
last_name = entry.find('a:author', ns).find('a:name', ns).text.split()[-1]
print(f'@article{{{last_name}{year}_{raw_id.replace(\".\", \"\")},')
print(f'  title     = {{{title}}},')
print(f'  author    = {{{authors}}},')
print(f'  year      = {{{year}}},')
print(f'  eprint    = {{{raw_id}}},')
print(f'  archivePrefix = {{arXiv}},')
print(f'  primaryClass  = {{{primary}}},')
print(f'  url       = {{https://arxiv.org/abs/{raw_id}}}')
print('}')
"
{% endraw %}

Reading Paper Content

阅读论文内容

After finding a paper, read it:
undefined
找到论文后,可通过以下方式阅读:
undefined

Abstract page (fast, metadata + abstract)

摘要页面(快速获取元数据+摘要)

web_extract(urls=["https://arxiv.org/abs/2402.03300"])
web_extract(urls=["https://arxiv.org/abs/2402.03300"])

Full paper (PDF → markdown via Firecrawl)

完整论文(通过Firecrawl将PDF转换为markdown)

web_extract(urls=["https://arxiv.org/pdf/2402.03300"])

For local PDF processing, see the `ocr-and-documents` skill.
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])

如需本地PDF处理,请参考`ocr-and-documents`技能。

Common Categories

常见分类

CategoryField
cs.AI
Artificial Intelligence
cs.CL
Computation and Language (NLP)
cs.CV
Computer Vision
cs.LG
Machine Learning
cs.CR
Cryptography and Security
stat.ML
Machine Learning (Statistics)
math.OC
Optimization and Control
physics.comp-ph
Computational Physics
分类领域
cs.AI
人工智能
cs.CL
计算与语言(NLP)
cs.CV
计算机视觉
cs.LG
机器学习
cs.CR
密码学与安全
stat.ML
机器学习(统计学)
math.OC
优化与控制
physics.comp-ph
计算物理

Helper Script

辅助脚本

The
scripts/search_arxiv.py
script handles XML parsing and provides clean output:
bash
python scripts/search_arxiv.py "GRPO reinforcement learning"
python scripts/search_arxiv.py "transformer attention" --max 10 --sort date
python scripts/search_arxiv.py --author "Yann LeCun" --max 5
python scripts/search_arxiv.py --category cs.AI --sort date
python scripts/search_arxiv.py --id 2402.03300
python scripts/search_arxiv.py --id 2402.03300,2401.12345
No dependencies — uses only Python stdlib.

scripts/search_arxiv.py
脚本可处理XML解析并提供清晰输出:
bash
python scripts/search_arxiv.py "GRPO reinforcement learning"
python scripts/search_arxiv.py "transformer attention" --max 10 --sort date
python scripts/search_arxiv.py --author "Yann LeCun" --max 5
python scripts/search_arxiv.py --category cs.AI --sort date
python scripts/search_arxiv.py --id 2402.03300
python scripts/search_arxiv.py --id 2402.03300,2401.12345
无需依赖库——仅使用Python标准库。

Semantic Scholar (Citations, Related Papers, Author Profiles)

Semantic Scholar(引用、相关论文、作者档案)

arXiv doesn't provide citation data or recommendations. Use the Semantic Scholar API for that — free, no key needed for basic use (1 req/sec), returns JSON.
arXiv不提供引用数据或推荐功能。可使用Semantic Scholar API获取这些信息——免费使用,基础功能无需密钥(每秒1次请求),返回JSON格式数据。

Get paper details + citations

获取论文详情+引用数据

bash
undefined
bash
undefined

By arXiv ID

通过arXiv ID

By Semantic Scholar paper ID or DOI

通过Semantic Scholar论文ID或DOI

Get citations OF a paper (who cited it)

获取某篇论文的引用文献(谁引用了它)

bash
curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300/citations?fields=title,authors,year,citationCount&limit=10" | python3 -m json.tool
bash
curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300/citations?fields=title,authors,year,citationCount&limit=10" | python3 -m json.tool

Get references FROM a paper (what it cites)

获取某篇论文的参考文献(它引用了什么)

bash
curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300/references?fields=title,authors,year,citationCount&limit=10" | python3 -m json.tool
bash
curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300/references?fields=title,authors,year,citationCount&limit=10" | python3 -m json.tool

Search papers (alternative to arXiv search, returns JSON)

搜索论文(arXiv搜索的替代方案,返回JSON)

bash
curl -s "https://api.semanticscholar.org/graph/v1/paper/search?query=GRPO+reinforcement+learning&limit=5&fields=title,authors,year,citationCount,externalIds" | python3 -m json.tool
bash
curl -s "https://api.semanticscholar.org/graph/v1/paper/search?query=GRPO+reinforcement+learning&limit=5&fields=title,authors,year,citationCount,externalIds" | python3 -m json.tool

Get paper recommendations

获取论文推荐

bash
curl -s -X POST "https://api.semanticscholar.org/recommendations/v1/papers/" \
  -H "Content-Type: application/json" \
  -d '{"positivePaperIds": ["arXiv:2402.03300"], "negativePaperIds": []}' | python3 -m json.tool
bash
curl -s -X POST "https://api.semanticscholar.org/recommendations/v1/papers/" \
  -H "Content-Type: application/json" \
  -d '{"positivePaperIds": ["arXiv:2402.03300"], "negativePaperIds": []}' | python3 -m json.tool

Author profile

作者档案

bash
curl -s "https://api.semanticscholar.org/graph/v1/author/search?query=Yann+LeCun&fields=name,hIndex,citationCount,paperCount" | python3 -m json.tool
bash
curl -s "https://api.semanticscholar.org/graph/v1/author/search?query=Yann+LeCun&fields=name,hIndex,citationCount,paperCount" | python3 -m json.tool

Useful Semantic Scholar fields

实用的Semantic Scholar字段

title
,
authors
,
year
,
abstract
,
citationCount
,
referenceCount
,
influentialCitationCount
,
isOpenAccess
,
openAccessPdf
,
fieldsOfStudy
,
publicationVenue
,
externalIds
(contains arXiv ID, DOI, etc.)

title
,
authors
,
year
,
abstract
,
citationCount
,
referenceCount
,
influentialCitationCount
,
isOpenAccess
,
openAccessPdf
,
fieldsOfStudy
,
publicationVenue
,
externalIds
(包含arXiv ID、DOI等)

Complete Research Workflow

完整研究工作流

  1. Discover:
    python scripts/search_arxiv.py "your topic" --sort date --max 10
  2. Assess impact:
    curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:ID?fields=citationCount,influentialCitationCount"
  3. Read abstract:
    web_extract(urls=["https://arxiv.org/abs/ID"])
  4. Read full paper:
    web_extract(urls=["https://arxiv.org/pdf/ID"])
  5. Find related work:
    curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:ID/references?fields=title,citationCount&limit=20"
  6. Get recommendations: POST to Semantic Scholar recommendations endpoint
  7. Track authors:
    curl -s "https://api.semanticscholar.org/graph/v1/author/search?query=NAME"
  1. 发现论文
    python scripts/search_arxiv.py "你的研究主题" --sort date --max 10
  2. 评估影响力
    curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:ID?fields=citationCount,influentialCitationCount"
  3. 阅读摘要
    web_extract(urls=["https://arxiv.org/abs/ID"])
  4. 阅读完整论文
    web_extract(urls=["https://arxiv.org/pdf/ID"])
  5. 查找相关研究
    curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:ID/references?fields=title,citationCount&limit=20"
  6. 获取推荐论文:向Semantic Scholar推荐端点发送POST请求
  7. 追踪作者
    curl -s "https://api.semanticscholar.org/graph/v1/author/search?query=作者姓名"

Rate Limits

请求频率限制

APIRateAuth
arXiv~1 req / 3 secondsNone needed
Semantic Scholar1 req / secondNone (100/sec with API key)
API频率限制认证要求
arXiv约每3秒1次请求无需认证
Semantic Scholar每秒1次请求无需认证(使用API密钥可达到每秒100次)

Notes

注意事项

  • arXiv returns Atom XML — use the helper script or parsing snippet for clean output
  • Semantic Scholar returns JSON — pipe through
    python3 -m json.tool
    for readability
  • arXiv IDs: old format (
    hep-th/0601001
    ) vs new (
    2402.03300
    )
  • PDF:
    https://arxiv.org/pdf/{id}
    — Abstract:
    https://arxiv.org/abs/{id}
  • HTML (when available):
    https://arxiv.org/html/{id}
  • For local PDF processing, see the
    ocr-and-documents
    skill
  • arXiv返回Atom XML格式数据——建议使用辅助脚本或解析代码片段以获得清晰输出
  • Semantic Scholar返回JSON格式数据——可通过
    python3 -m json.tool
    管道处理以提高可读性
  • arXiv ID格式:旧格式(
    hep-th/0601001
    ) vs 新格式(
    2402.03300
  • PDF地址:
    https://arxiv.org/pdf/{id}
    —— 摘要地址:
    https://arxiv.org/abs/{id}
  • HTML页面(若可用):
    https://arxiv.org/html/{id}
  • 如需本地PDF处理,请参考
    ocr-and-documents
    技能

ID Versioning

ID版本控制

  • arxiv.org/abs/1706.03762
    always resolves to the latest version
  • arxiv.org/abs/1706.03762v1
    points to a specific immutable version
  • When generating citations, preserve the version suffix you actually read to prevent citation drift (a later version may substantially change content)
  • The API
    <id>
    field returns the versioned URL (e.g.,
    http://arxiv.org/abs/1706.03762v7
    )
  • arxiv.org/abs/1706.03762
    始终指向最新版本
  • arxiv.org/abs/1706.03762v1
    指向特定的不可变版本
  • 生成引用时,请保留你实际阅读的版本后缀,以避免引用偏差(后续版本可能大幅修改内容)
  • API的
    <id>
    字段返回带版本的URL(例如:
    http://arxiv.org/abs/1706.03762v7

Withdrawn Papers

撤回的论文

Papers can be withdrawn after submission. When this happens:
  • The
    <summary>
    field contains a withdrawal notice (look for "withdrawn" or "retracted")
  • Metadata fields may be incomplete
  • Always check the summary before treating a result as a valid paper
论文提交后可能被撤回。发生撤回时:
  • <summary>
    字段包含撤回通知(查找"withdrawn"或"retracted"关键词)
  • 元数据字段可能不完整
  • 在将结果视为有效论文前,请务必检查摘要内容