ncbi-sequence-fetch
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseNCBI Sequence Fetch
NCBI序列获取
Prerequisites
前置条件
-
: Read the
uvskill and follow its Setup instructions to ensureuvis installed and on PATH.uv -
User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://www.ncbi.nlm.nih.gov/ and https://www.ncbi.nlm.nih.gov/home/about/policies/, then (2) create the file recording the notification text and timestamp.
-
file: Make sure the
.envfile exists in your home directory. Create one if it does not exist..env -
(optional): Raises the NCBI rate limit from 3 to 10 requests/second. The skill works without it, but a key is recommended if the user plans many queries or encounters a 429 error. The user can obtain one for free by registering at https://www.ncbi.nlm.nih.gov/account/settings/. If the variable is missing from
NCBI_API_KEY, do NOT ask the user to paste it into the chat (this would leak the key into the agent's context). Instead, give the user this command — substituting.envwith the resolved literal path to theENV_FILEfile:.envbashprintf "Enter NCBI API key (typing hidden): " && read -s key && echo && echo "NCBI_API_KEY=$key" >> "ENV_FILE" && echo "Saved."The scripts load credentials automatically via. NEVER read, print, or inspect thedotenvfile or its variables (e.g. no.env,cat,grep,echo, orprintenvon keys). Credentials must stay out of the agent's context.os.environ.get
-
:查看
uv技能并按照其设置说明操作,确保uv已安装并添加至PATH。uv -
用户通知:若此技能目录中不存在LICENSE_NOTIFICATION.txt文件,则需(1)显著告知用户查看https://www.ncbi.nlm.nih.gov/和https://www.ncbi.nlm.nih.gov/home/about/policies/上的条款,然后(2)创建该文件并记录通知文本和时间戳。
-
文件:确保您的主目录中存在
.env文件,若不存在则创建一个。.env -
(可选):将NCBI的请求速率限制从3次/秒提升至10次/秒。即使没有该密钥,技能也能正常运行,但如果用户计划进行大量查询或遇到429错误,建议使用密钥。用户可通过注册https://www.ncbi.nlm.nih.gov/account/settings/免费获取。若`.env`文件中缺少该变量,请勿要求用户在聊天中粘贴密钥(这会导致密钥泄露至Agent上下文),而是向用户提供以下命令——**将`ENV_FILE`替换为`.env`文件的实际路径**:
NCBI_API_KEYbashprintf "Enter NCBI API key (typing hidden): " && read -s key && echo && echo "NCBI_API_KEY=$key" >> "ENV_FILE" && echo "Saved."脚本会通过自动加载凭证。绝对不要读取、打印或检查dotenv文件及其变量(例如不要对密钥使用.env、cat、grep、echo或printenv)。凭证必须远离Agent上下文。os.environ.get
Core Rules
核心规则
- Use the Wrapper: ALWAYS execute the provided helper scripts to query the database rather than accessing the database directly. The scripts automatically enforce the required rate limit gracefully.
- API Key Support: If the user provides an in their environment, the query speed limits are automatically increased significantly.
NCBI_API_KEY - Notification: If this skill is used, ensure this is mentioned in the output.
- 使用封装脚本:始终执行提供的辅助脚本查询数据库,而非直接访问数据库。脚本会自动优雅地执行必要的速率限制。
- API密钥支持:若用户环境中提供了,查询速度限制会自动大幅提升。
NCBI_API_KEY - 通知要求:若使用此技能,需确保在输出中提及这一点。
Overview
概述
Wraps NCBI's Entrez E-utilities (efetch, esearch, elink, esummary) for
retrieving protein and nucleotide sequences. Provides 10 subcommands covering
the full range of sequence retrieval workflows:
- — Direct protein accession lookup (GenPept, RefSeq)
fetch-protein - — Direct nucleotide accession lookup
fetch-nucleotide - — Fetch CDS and translate to protein (3 methods)
cds-translate - — Free-text search of any NCBI database
search - — Follow cross-database links (PubMed→Protein, etc.)
elink - — Search protein by gene name + organism
gene-protein - — Search protein by locus tag + organism
locus-protein - — Find proteins linked to a PubMed article
pubmed-proteins - — Extract protein sequences from patents
patent-search - — Last-resort search by organism + exact AA length
organism-length
封装NCBI的Entrez E-utilities(efetch、esearch、elink、esummary)用于检索蛋白质和核苷酸序列。提供10个子命令,覆盖全范围的序列检索工作流:
- — 直接蛋白质登录号查询(GenPept、RefSeq)
fetch-protein - — 直接核苷酸登录号查询
fetch-nucleotide - — 获取CDS并翻译为蛋白质(3种方法)
cds-translate - — 任意NCBI数据库的自由文本搜索
search - — 跨数据库链接跳转(如PubMed→Protein等)
elink - — 通过基因名称+物种搜索蛋白质
gene-protein - — 通过基因座标签+物种搜索蛋白质
locus-protein - — 查找与PubMed文章关联的蛋白质
pubmed-proteins - — 从专利中提取蛋白质序列
patent-search - — 备选方案:通过物种+精确氨基酸长度搜索
organism-length
Utility Scripts
实用脚本
scripts/ncbi_fetch.pyAll subcommands write structured JSON output. Use to save to a
file, or omit it to print to stdout. A human-readable summary is always printed
to stdout.
--output FILEscripts/ncbi_fetch.py所有子命令均输出结构化JSON。使用保存至文件,或省略该参数直接打印至标准输出。标准输出始终会打印人类可读的摘要。
--output FILE1. Fetch Protein by Accession
1. 通过登录号获取蛋白质
Fetches protein FASTA from NCBI by accession (XP_, NP_, GenPept, etc.)
bash
uv run scripts/ncbi_fetch.py fetch-protein XP_022033624 -o /tmp/result.json
uv run scripts/ncbi_fetch.py fetch-protein NP_001234567 ABC12345.1通过登录号(XP_、NP_、GenPept等)从NCBI获取蛋白质FASTA序列
bash
uv run scripts/ncbi_fetch.py fetch-protein XP_022033624 -o /tmp/result.json
uv run scripts/ncbi_fetch.py fetch-protein NP_001234567 ABC12345.12. Fetch Nucleotide by Accession
2. 通过登录号获取核苷酸
Fetches nucleotide FASTA from NCBI by accession.
bash
uv run scripts/ncbi_fetch.py fetch-nucleotide MK034466 -o /tmp/result.json通过登录号从NCBI获取核苷酸FASTA序列
bash
uv run scripts/ncbi_fetch.py fetch-nucleotide MK034466 -o /tmp/result.json3. CDS Translate
3. CDS翻译
Fetches a CDS/nucleotide accession and translates to protein sequence. Tries
three approaches in order: 1. NCBI's pre-translated CDS protein ()
2. GenBank XML CDS annotation translations 3. Raw nucleotide → 6-frame ORF
finding
fasta_cds_aabash
uv run scripts/ncbi_fetch.py cds-translate MK034466 -o /tmp/result.json
uv run scripts/ncbi_fetch.py cds-translate HQ662330 --target-length 1043If the accession is a genomic record (not mRNA/CDS), the tool will report
so you can fall back to a homology-based approach instead.
is_genomic: true获取CDS/核苷酸登录号并翻译为蛋白质序列。按以下顺序尝试三种方法:1. NCBI预翻译的CDS蛋白质()2. GenBank XML CDS注释翻译 3. 原始核苷酸→6框ORF查找
fasta_cds_aabash
uv run scripts/ncbi_fetch.py cds-translate MK034466 -o /tmp/result.json
uv run scripts/ncbi_fetch.py cds-translate HQ662330 --target-length 1043若登录号对应的是基因组记录(而非mRNA/CDS),工具会返回,此时您可以改用基于同源性的方法。
is_genomic: true4. Search Any Database
4. 任意数据库搜索
Free-text search using Entrez query syntax. Supports all NCBI databases.
bash
undefined使用Entrez查询语法进行自由文本搜索,支持所有NCBI数据库。
bash
undefinedSearch protein database
搜索蛋白质数据库
uv run scripts/ncbi_fetch.py search "WRR4B[Gene Name] AND Arabidopsis[Organism]"
--database protein --retmax 5 --fetch-sequences
--database protein --retmax 5 --fetch-sequences
uv run scripts/ncbi_fetch.py search "WRR4B[Gene Name] AND Arabidopsis[Organism]"
--database protein --retmax 5 --fetch-sequences
--database protein --retmax 5 --fetch-sequences
Search nucleotide database
搜索核苷酸数据库
uv run scripts/ncbi_fetch.py search "Rz2[Gene Name] AND Beta vulgaris[Organism]"
--database nuccore --retmax 10
--database nuccore --retmax 10
uv run scripts/ncbi_fetch.py search "Rz2[Gene Name] AND Beta vulgaris[Organism]"
--database nuccore --retmax 10
--database nuccore --retmax 10
Search with patent filter
带专利过滤的搜索
uv run scripts/ncbi_fetch.py search "disease resistance AND Solanum[Organism] AND patent[Properties]"
--database protein --fetch-sequences
--database protein --fetch-sequences
uv run scripts/ncbi_fetch.py search "disease resistance AND Solanum[Organism] AND patent[Properties]"
--database protein --fetch-sequences
--database protein --fetch-sequences
Search by sequence length
按序列长度搜索
uv run scripts/ncbi_fetch.py search '"Oryza sativa"[Organism] AND 1043[SLEN]'
--database protein --fetch-sequences --retmax 50
--database protein --fetch-sequences --retmax 50
undefineduv run scripts/ncbi_fetch.py search '"Oryza sativa"[Organism] AND 1043[SLEN]'
--database protein --fetch-sequences --retmax 50
--database protein --fetch-sequences --retmax 50
undefined5. Cross-Database Links (elink)
5. 跨数据库链接(elink)
Follow NCBI's cross-database links (e.g., PubMed article → linked proteins).
bash
uv run scripts/ncbi_fetch.py elink 24896089 --dbfrom pubmed --db protein \
--fetch-sequences -o /tmp/linked.json跳转NCBI的跨数据库链接(例如,PubMed文章→关联蛋白质)
bash
uv run scripts/ncbi_fetch.py elink 24896089 --dbfrom pubmed --db protein \
--fetch-sequences -o /tmp/linked.json6. Gene + Organism Search
6. 基因+物种搜索
Searches for protein sequences by gene name and organism. Searches NCBI Protein
with and qualifiers.
[Gene Name][Organism]bash
uv run scripts/ncbi_fetch.py gene-protein WRR4B --organism "Arabidopsis thaliana"
uv run scripts/ncbi_fetch.py gene-protein Pikh-2 --organism "Oryza sativa" \
--target-length 1043 -o /tmp/result.json通过基因名称和物种搜索蛋白质序列。使用和限定符搜索NCBI Protein数据库。
[Gene Name][Organism]bash
uv run scripts/ncbi_fetch.py gene-protein WRR4B --organism "Arabidopsis thaliana"
uv run scripts/ncbi_fetch.py gene-protein Pikh-2 --organism "Oryza sativa" \
--target-length 1043 -o /tmp/result.json7. Locus Tag Search
7. 基因座标签搜索
Searches by locus tag in both NCBI Protein and Nuccore databases. Extracts CDS
translations from GenBank XML when direct protein hits aren't available.
bash
uv run scripts/ncbi_fetch.py locus-protein At1g56540 --organism "Arabidopsis thaliana"
uv run scripts/ncbi_fetch.py locus-protein Niben101Scf02422g02015.1 \
--organism "Nicotiana benthamiana" -o /tmp/result.json在NCBI Protein和Nuccore数据库中通过基因座标签搜索。当没有直接蛋白质匹配结果时,从GenBank XML中提取CDS翻译序列。
bash
uv run scripts/ncbi_fetch.py locus-protein At1g56540 --organism "Arabidopsis thaliana"
uv run scripts/ncbi_fetch.py locus-protein Niben101Scf02422g02015.1 \
--organism "Nicotiana benthamiana" -o /tmp/result.json8. PubMed-Linked Proteins
8. PubMed关联蛋白质
Finds protein sequences linked to a PubMed article. Searches NCBI Protein by
PMID, follows elink PubMed→Protein, and extracts CDS translations from linked
Nuccore records.
bash
uv run scripts/ncbi_fetch.py pubmed-proteins 30692254 --identifier WRR4B
uv run scripts/ncbi_fetch.py pubmed-proteins 24896089 --identifier "K2" \
-o /tmp/result.json查找与PubMed文章关联的蛋白质序列。通过PMID搜索NCBI Protein,跳转PubMed→Protein的elink链接,并从关联的Nuccore记录中提取CDS翻译序列。
bash
uv run scripts/ncbi_fetch.py pubmed-proteins 30692254 --identifier WRR4B
uv run scripts/ncbi_fetch.py pubmed-proteins 24896089 --identifier "K2" \
-o /tmp/result.json9. Patent Sequence Search
9. 专利序列搜索
Two modes:
By patent number — fetches all protein sequences from a specific patent:
bash uv run scripts/ncbi_fetch.py patent-search --patent-number US10123456 -o /tmp/patent.jsonBy keywords — searches NCBI Protein with filter:
patent[Properties]bash uv run scripts/ncbi_fetch.py patent-search --keywords WRR4B Albugo --organism "Arabidopsis thaliana" -o /tmp/patent.json[!IMPORTANT] Patent convention: In molecular biology patents, SEQ ID NO: 1 is typically the DNA sequence and SEQ ID NO: 2 is the primary protein. Higher SEQ ID NOs are variants or related sequences. Prefer Sequence 2 when selecting the primary protein of interest.
两种模式:
按专利号 — 从特定专利中获取所有蛋白质序列:
bash
uv run scripts/ncbi_fetch.py patent-search --patent-number US10123456 -o /tmp/patent.json按关键词 — 使用过滤条件搜索NCBI Protein:
patent[Properties]bash
uv run scripts/ncbi_fetch.py patent-search --keywords WRR4B Albugo --organism "Arabidopsis thaliana" -o /tmp/patent.json[!IMPORTANT] 专利惯例:在分子生物学专利中,SEQ ID NO:1通常是DNA序列,SEQ ID NO:2是主要蛋白质序列。编号更高的SEQ ID NO是变体或相关序列。选择目标主要蛋白质时优先考虑Sequence 2。
10. Organism + Length Search
10. 物种+长度搜索
Last-resort search when only organism and expected protein length are known.
Uses NCBI's filter for exact length matching.
[SLEN]bash
uv run scripts/ncbi_fetch.py organism-length \
--organism "Arabidopsis thaliana" --length 1048 --retmax 50 \
-o /tmp/result.json[!NOTE] This often returns multiple candidates. Use the JSON output headers to identify the correct protein.
当仅已知物种和预期蛋白质长度时的备选搜索方案。使用NCBI的过滤器进行精确长度匹配。
[SLEN]bash
uv run scripts/ncbi_fetch.py organism-length \
--organism "Arabidopsis thaliana" --length 1048 --retmax 50 \
-o /tmp/result.json[!NOTE] 此方法通常会返回多个候选结果。使用JSON输出的标题信息识别正确的蛋白质。
Workflow
工作流程
Standard Sequence Retrieval Cascade
标准序列检索流程
When trying to find a protein sequence, follow this priority order:
- Direct accession — with GenPept/RefSeq accession
fetch-protein - CDS translation — with nucleotide/CDS accession
cds-translate - PubMed-linked — with PMID + gene name
pubmed-proteins - Locus lookup — with locus tag + organism
locus-protein - Gene + organism — with gene name + organism
gene-protein - Patent search — with patent number or keywords
patent-search - Organism + length — as last resort
organism-length
查找蛋白质序列时,请遵循以下优先级顺序:
- 直接登录号 — 使用搭配GenPept/RefSeq登录号
fetch-protein - CDS翻译 — 使用搭配核苷酸/CDS登录号
cds-translate - PubMed关联 — 使用搭配PMID+基因名称
pubmed-proteins - 基因座查询 — 使用搭配基因座标签+物种
locus-protein - 基因+物种 — 使用搭配基因名称+物种
gene-protein - 专利搜索 — 使用搭配专利号或关键词
patent-search - 物种+长度 — 使用作为最后备选方案
organism-length
Interpreting Results
结果解读
- All subcommands return JSON with a array
results - Each result has (AA string),
sequence, andlength/metadataheader - When multiple results are returned, select by:
- Closest match to expected length ()
target_length - Header relevance (matching gene name, "disease resistance" keywords)
- Source priority (RefSeq > GenPept > patent)
- Closest match to expected length (
- 所有子命令均返回包含数组的JSON
results - 每个结果包含(氨基酸字符串)、
sequence以及length/元数据header - 返回多个结果时,按以下条件选择:
- 与预期长度()最接近
target_length - 标题相关性(匹配基因名称、“抗病性”等关键词)
- 来源优先级(RefSeq > GenPept > 专利)
- 与预期长度(
Reference
参考资料
- NCBI E-utilities docs: https://www.ncbi.nlm.nih.gov/books/NBK25499/
- Entrez search syntax: https://www.ncbi.nlm.nih.gov/books/NBK49540/
- Database list: protein, nuccore, gene, pubmed, pmc, biosample, etc.
- Common accession formats:
- /
XP_— NCBI RefSeq proteinNP_ - to
AAA+ digits — GenPept (translated GenBank)AZZ - ,
MK,MN, etc. + digits — GenBank nucleotideHQ - ,
ENSG,ENST— Ensembl (useENSPskill instead)ensembl-database - ,
Q,P+ digits — UniProt (useOskill instead)uniprot-database
- NCBI E-utilities文档:https://www.ncbi.nlm.nih.gov/books/NBK25499/
- Entrez搜索语法:https://www.ncbi.nlm.nih.gov/books/NBK49540/
- 数据库列表:protein、nuccore、gene、pubmed、pmc、biosample等
- 常见登录号格式:
- /
XP_— NCBI RefSeq蛋白质NP_ - 至
AAA+数字 — GenPept(翻译后的GenBank)AZZ - 、
MK、MN等+数字 — GenBank核苷酸HQ - 、
ENSG、ENST— Ensembl(请改用ENSP技能)ensembl-database - 、
Q、P+数字 — UniProt(请改用O技能)uniprot-database