literature-search-arxiv

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

arXiv Search and Retrieval

arXiv 搜索与检索

Prerequisites

前提条件

  1. uv
    : Read the
    uv
    skill and follow its Setup instructions to ensure
    uv
    is installed and on PATH.
  2. User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://info.arxiv.org/help/api/index.html and to always check the license of the papers retrieved by the skill for any restrictions, then (2) create the file recording the notification text and timestamp.
  1. uv
    : 阅读
    uv
    技能文档并遵循其设置说明,确保
    uv
    已安装且在PATH路径中。
  2. 用户通知: 如果此技能目录中不存在LICENSE_NOTIFICATION.txt文件,则(1) 显著通知用户查看https://info.arxiv.org/help/api/index.html上的条款,并务必检查该技能检索到的论文的许可证是否存在任何限制,然后(2) 创建记录通知文本和时间戳的文件。

Core Rules

核心规则

  • Terms of Use: You MUST respect arXiv's Terms of Use.
    • Maximum 1 request every 3 seconds.
    • The provided utility scripts handle rate limiting automatically. Always use these scripts rather than writing your own curl/python requests.
  • If this skill is used, ensure this is mentioned in the output AND list the URLs of all papers that were used in producing the output.
  • 使用条款: 您必须遵守arXiv的使用条款。
    • 每3秒最多发送1次请求。
    • 提供的实用脚本会自动处理速率限制。请始终使用这些脚本,而非自行编写curl/python请求。
  • 如果使用了此技能,请确保在输出中提及这一点,并列出所有用于生成输出的论文的URL。

Utility Scripts

实用脚本

1. Search and Extract Metadata
Search arXiv and return a clean JSON array of matching papers.
bash
uv run scripts/search_arxiv.py --query "au:einstein AND ti:relativity" \
  --max_results 5 2>/dev/null > /tmp/arxiv_search_results.json
Important: The tool outputs a large JSON result to stdout. Requesting 100+ results will produce a massive JSON that might exceed your context length. Limit
--max_results
(e.g., 5-10) or paginate carefully using
--start
. Always redirect output to a file and parse it separately, otherwise terminal output will be truncated.
Returned Metadata: JSON results include
id
,
title
,
summary
,
published
,
authors
,
pdf_url
,
primary_category
,
doi
,
journal_ref
, and
comment
. Note: the
doi
field only contains DOI information in case the paper has an external DOI and if only an arXiv-issued DOI exists, this is DOI is not returned.
Options:
  • --query
    : Search string. See references/query_syntax.md for advanced syntax.
  • --id_list
    : Comma-separated list of arXiv IDs to fetch directly (e.g.,
    1706.03762v5
    ).
  • --start
    : Pagination offset (default 0).
  • --max_results
    : Number of results to return (default 10).
  • --sort_by
    :
    relevance
    ,
    lastUpdatedDate
    , or
    submittedDate
    . (Use
    --sort_by submittedDate --sort_order descending
    for the most recent papers).
  • --sort_order
    :
    ascending
    or
    descending
    .
2. Download Paper (PDF or HTML)
Download the full text of a paper to your local workspace for reading.
bash
uv run scripts/download_paper.py --id 1706.03762 --format pdf --output attention.pdf
Options:
  • --id
    : The arXiv ID (e.g.,
    1706.03762
    or
    1706.03762v5
    ).
  • --format
    :
    pdf
    or
    html
    . Note: HTML is only available for newer papers.
  • --output
    : Filepath to save the downloaded document.
Important: when downloading papers, make sure you download them to a location where you do not overwrite other files and do not clutter existing directory structure.
3. Download Paper Source (tar.gz)
Download the LaTeX source files of a paper to your local workspace. Note that not all papers have source available.
bash
uv run scripts/download_paper_source.py --id 2010.11645 --output source.tar.gz
Options:
  • --id
    : The arXiv ID (e.g.,
    2010.11645
    ).
  • --output
    : Filepath to save the downloaded tar.gz file.
Caution: Care should be exercised when untar'ing the downloaded file for security and to avoid cluttering your filesystem, as archives may contain many files or unexpected directory structures.
Safe Extraction Requirements: NEVER extract directly into your working directory! Always extract into a dedicated new directory:
bash mkdir paper_source && tar -xzf source.tar.gz -C paper_source
1. 搜索并提取元数据
在arXiv上搜索并返回匹配论文的清晰JSON数组。
bash
uv run scripts/search_arxiv.py --query "au:einstein AND ti:relativity" \
  --max_results 5 2>/dev/null > /tmp/arxiv_search_results.json
重要提示: 该工具会向标准输出输出大量JSON结果。请求100+条结果会生成庞大的JSON,可能超出上下文长度限制。请限制
--max_results
(例如5-10),或使用
--start
谨慎分页。始终将输出重定向到文件并单独解析,否则终端输出会被截断。
返回的元数据: JSON结果包含
id
,
title
,
summary
,
published
,
authors
,
pdf_url
,
primary_category
,
doi
,
journal_ref
, and
comment
。 注意:仅当论文有外部DOI时,
doi
字段才会包含DOI信息;如果只有arXiv颁发的DOI,则不会返回该DOI。
选项:
  • --query
    : 搜索字符串。高级语法请参见 references/query_syntax.md
  • --id_list
    : 直接获取的arXiv ID的逗号分隔列表(例如
    1706.03762v5
    )。
  • --start
    : 分页偏移量(默认值为0)。
  • --max_results
    : 返回的结果数量(默认值为10)。
  • --sort_by
    :
    relevance
    ,
    lastUpdatedDate
    , or
    submittedDate
    。(使用
    --sort_by submittedDate --sort_order descending
    获取最新论文)。
  • --sort_order
    :
    ascending
    or
    descending
2. 下载论文(PDF或HTML格式)
将论文全文下载到本地工作区以供阅读。
bash
uv run scripts/download_paper.py --id 1706.03762 --format pdf --output attention.pdf
选项:
  • --id
    : arXiv ID(例如
    1706.03762
    1706.03762v5
    )。
  • --format
    :
    pdf
    or
    html
    。注意:仅较新的论文提供HTML格式。
  • --output
    : 保存下载文档的文件路径。
重要提示: 下载论文时,请确保将其下载到不会覆盖其他文件且不会打乱现有目录结构的位置。
3. 下载论文源代码(tar.gz格式)
将论文的LaTeX源文件下载到本地工作区。注意:并非所有论文都提供源代码。
bash
uv run scripts/download_paper_source.py --id 2010.11645 --output source.tar.gz
选项:
  • --id
    : arXiv ID(例如
    2010.11645
    )。
  • --output
    : 保存下载的tar.gz文件的文件路径。
注意事项: 解压下载的文件时需谨慎,以保障安全并避免杂乱文件系统,因为归档文件可能包含大量文件或意外的目录结构。
安全解压要求: 切勿直接解压到工作目录!始终解压到专用的新目录:
bash mkdir paper_source && tar -xzf source.tar.gz -C paper_source

Reference

参考资料

  • Advanced Query Syntax: See references/query_syntax.md for prefixes (au, ti, abs), booleans, and date filtering.
  • 高级查询语法: 前缀(au, ti, abs)、布尔运算符和日期过滤请参见 references/query_syntax.md

Workflow

工作流程

  1. Search for papers using
    search_arxiv.py
    . Review the JSON summaries.
  2. If full text is needed, use
    download_paper.py
    to fetch the PDF or HTML.
  3. If downloading a PDF, verify the PDF is not empty or corrupted.
  4. Read the downloaded file using standard file reading tools.
  1. 使用
    search_arxiv.py
    搜索论文。查看JSON摘要。
  2. 如果需要全文,使用
    download_paper.py
    获取PDF或HTML版本。
  3. 如果下载PDF,请验证PDF文件非空且未损坏。
  4. 使用标准文件阅读工具读取下载的文件。