xcrawl

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

XCrawl

XCrawl

Overview

概述

This skill is the default XCrawl entry point when the user asks for XCrawl directly without naming a specific API or sub-skill. It currently targets single-page extraction through XCrawl Scrape APIs. Default behavior is raw passthrough: return upstream API response bodies as-is.
当用户直接请求XCrawl但未指定具体API或子Skill时,此Skill是默认的XCrawl入口。 目前它通过XCrawl Scrape API实现单页面提取功能。 默认行为是直接透传:原样返回上游API的响应体。

Routing Guidance

路由指引

  • If the user wants to extract one or more specific URLs, use this skill and default to XCrawl Scrape.
  • If the user wants site URL discovery, prefer XCrawl Map APIs.
  • If the user wants multi-page or site-wide crawling, prefer XCrawl Crawl APIs.
  • If the user wants keyword-based discovery, prefer XCrawl Search APIs.
  • 如果用户想要提取一个或多个特定URL,使用此Skill并默认调用XCrawl Scrape。
  • 如果用户想要站点URL发现,优先使用XCrawl Map API。
  • 如果用户想要多页面或全站爬取,优先使用XCrawl Crawl API。
  • 如果用户想要基于关键词的发现,优先使用XCrawl Search API。

Required Local Config

本地配置要求

Before using this skill, the user must create a local config file and write
XCRAWL_API_KEY
into it.
Path:
~/.xcrawl/config.json
json
{
  "XCRAWL_API_KEY": "<your_api_key>"
}
Read API key from local config file only. Do not require global environment variables.
在使用此Skill前,用户必须创建本地配置文件并写入
XCRAWL_API_KEY
路径:
~/.xcrawl/config.json
json
{
  "XCRAWL_API_KEY": "<your_api_key>"
}
仅从本地配置文件读取API密钥,不依赖全局环境变量。

Credits and Account Setup

积分与账户设置

Using XCrawl APIs consumes credits. If the user does not have an account or available credits, guide them to register at
https://dash.xcrawl.com/
. After registration, they can activate the free
1000
credits plan before running requests.
使用XCrawl API会消耗积分。 如果用户没有账户或可用积分,引导他们前往
https://dash.xcrawl.com/
注册。 注册后,他们可以激活免费的1000积分套餐,之后再发起请求。

Tool Permission Policy

工具权限策略

Request runtime permissions for
curl
and
node
only. Do not request Python, shell helper scripts, or other runtime permissions.
仅请求
curl
node
的运行时权限。 不要请求Python、Shell辅助脚本或其他运行时权限。

API Surface

API接口

  • Start scrape:
    POST /v1/scrape
  • Read async result:
    GET /v1/scrape/{scrape_id}
  • Base URL:
    https://run.xcrawl.com
  • Required header:
    Authorization: Bearer <XCRAWL_API_KEY>
  • 启动抓取:
    POST /v1/scrape
  • 读取异步结果:
    GET /v1/scrape/{scrape_id}
  • 基础URL:
    https://run.xcrawl.com
  • 必填请求头:
    Authorization: Bearer <XCRAWL_API_KEY>

Usage Examples

使用示例

cURL (sync)

cURL(同步)

bash
API_KEY="$(node -e "const fs=require('fs');const p=process.env.HOME+'/.xcrawl/config.json';const k=JSON.parse(fs.readFileSync(p,'utf8')).XCRAWL_API_KEY||'';process.stdout.write(k)")"

curl -sS -X POST "https://run.xcrawl.com/v1/scrape" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${API_KEY}" \
  -d '{"url":"https://example.com","mode":"sync","output":{"formats":["markdown","links"]}}'
bash
API_KEY="$(node -e "const fs=require('fs');const p=process.env.HOME+'/.xcrawl/config.json';const k=JSON.parse(fs.readFileSync(p,'utf8')).XCRAWL_API_KEY||'';process.stdout.write(k)")"

curl -sS -X POST "https://run.xcrawl.com/v1/scrape" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${API_KEY}" \
  -d '{"url":"https://example.com","mode":"sync","output":{"formats":["markdown","links"]}}'

cURL (async create + result)

cURL(异步创建+结果获取)

bash
API_KEY="$(node -e "const fs=require('fs');const p=process.env.HOME+'/.xcrawl/config.json';const k=JSON.parse(fs.readFileSync(p,'utf8')).XCRAWL_API_KEY||'';process.stdout.write(k)")"

CREATE_RESP="$(curl -sS -X POST "https://run.xcrawl.com/v1/scrape" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${API_KEY}" \
  -d '{"url":"https://example.com/product/1","mode":"async","output":{"formats":["json"]},"json":{"prompt":"Extract title and price."}}')"

echo "$CREATE_RESP"

SCRAPE_ID="$(node -e 'const s=process.argv[1];const j=JSON.parse(s);process.stdout.write(j.scrape_id||"")' "$CREATE_RESP")"

curl -sS -X GET "https://run.xcrawl.com/v1/scrape/${SCRAPE_ID}" \
  -H "Authorization: Bearer ${API_KEY}"
bash
API_KEY="$(node -e "const fs=require('fs');const p=process.env.HOME+'/.xcrawl/config.json';const k=JSON.parse(fs.readFileSync(p,'utf8')).XCRAWL_API_KEY||'';process.stdout.write(k)")"

CREATE_RESP="$(curl -sS -X POST "https://run.xcrawl.com/v1/scrape" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${API_KEY}" \
  -d '{"url":"https://example.com/product/1","mode":"async","output":{"formats":["json"]},"json":{"prompt":"Extract title and price."}}')"

echo "$CREATE_RESP"

SCRAPE_ID="$(node -e 'const s=process.argv[1];const j=JSON.parse(s);process.stdout.write(j.scrape_id||"")' "$CREATE_RESP")"

curl -sS -X GET "https://run.xcrawl.com/v1/scrape/${SCRAPE_ID}" \
  -H "Authorization: Bearer ${API_KEY}"

Node

Node

bash
node -e '
const fs=require("fs");
const apiKey=JSON.parse(fs.readFileSync(process.env.HOME+"/.xcrawl/config.json","utf8")).XCRAWL_API_KEY;
const body={url:"https://example.com",mode:"sync",output:{formats:["markdown","json"]},json:{prompt:"Extract title and publish date."}};
fetch("https://run.xcrawl.com/v1/scrape",{
  method:"POST",
  headers:{"Content-Type":"application/json",Authorization:`Bearer ${apiKey}`},
  body:JSON.stringify(body)
}).then(async r=>{console.log(await r.text());});
'
bash
node -e '
const fs=require("fs");
const apiKey=JSON.parse(fs.readFileSync(process.env.HOME+"/.xcrawl/config.json","utf8")).XCRAWL_API_KEY;
const body={url:"https://example.com",mode:"sync",output:{formats:["markdown","json"]},json:{prompt:"Extract title and publish date."}};
fetch("https://run.xcrawl.com/v1/scrape",{
  method:"POST",
  headers:{"Content-Type":"application/json",Authorization:`Bearer ${apiKey}`},
  body:JSON.stringify(body)
}).then(async r=>{console.log(await r.text());});
'

Request Parameters

请求参数

Request endpoint and headers

请求端点与请求头

  • Endpoint:
    POST https://run.xcrawl.com/v1/scrape
  • Headers:
  • Content-Type: application/json
  • Authorization: Bearer <api_key>
  • 端点:
    POST https://run.xcrawl.com/v1/scrape
  • 请求头:
  • Content-Type: application/json
  • Authorization: Bearer <api_key>

Request body: top-level fields

请求体:顶级字段

FieldTypeRequiredDefaultDescription
url
stringYes-Target URL
mode
stringNo
sync
sync
or
async
proxy
objectNo-Proxy config
request
objectNo-Request config
js_render
objectNo-JS rendering config
output
objectNo-Output config
webhook
objectNo-Async webhook config (
mode=async
)
字段类型必填默认值描述
url
string-目标URL
mode
string
sync
sync
(同步)或
async
(异步)
proxy
object-代理配置
request
object-请求配置
js_render
object-JS渲染配置
output
object-输出配置
webhook
object-异步回调配置(仅
mode=async
时有效)

proxy

proxy

FieldTypeRequiredDefaultDescription
location
stringNo
US
ISO-3166-1 alpha-2 country code, e.g.
US
/
JP
/
SG
sticky_session
stringNoAuto-generatedSticky session ID; same ID attempts to reuse exit
字段类型必填默认值描述
location
string
US
ISO-3166-1 alpha-2国家代码,例如
US
/
JP
/
SG
sticky_session
string自动生成粘性会话ID;相同ID会尝试复用出口节点

request

request

FieldTypeRequiredDefaultDescription
locale
stringNo
en-US,en;q=0.9
Affects
Accept-Language
device
stringNo
desktop
desktop
/
mobile
; affects UA and viewport
cookies
object mapNo-Cookie key/value pairs
headers
object mapNo-Header key/value pairs
only_main_content
booleanNo
true
Return main content only
block_ads
booleanNo
true
Attempt to block ad resources
skip_tls_verification
booleanNo
true
Skip TLS verification
字段类型必填默认值描述
locale
string
en-US,en;q=0.9
影响
Accept-Language
请求头
device
string
desktop
desktop
(桌面端)/
mobile
(移动端);影响User-Agent和视口
cookies
对象映射-Cookie键值对
headers
对象映射-请求头键值对
only_main_content
boolean
true
仅返回主内容
block_ads
boolean
true
尝试屏蔽广告资源
skip_tls_verification
boolean
true
跳过TLS验证

js_render

js_render

FieldTypeRequiredDefaultDescription
enabled
booleanNo
true
Enable browser rendering
wait_until
stringNo
load
load
/
domcontentloaded
/
networkidle
viewport.width
integerNo-Viewport width (desktop
1920
, mobile
402
)
viewport.height
integerNo-Viewport height (desktop
1080
, mobile
874
)
字段类型必填默认值描述
enabled
boolean
true
启用浏览器渲染
wait_until
string
load
load
/
domcontentloaded
/
networkidle
(等待页面加载完成的时机)
viewport.width
integer-视口宽度(桌面端默认1920,移动端默认402)
viewport.height
integer-视口高度(桌面端默认1080,移动端默认874)

output

output

FieldTypeRequiredDefaultDescription
formats
string[]No
["markdown"]
Output formats
screenshot
stringNo
viewport
full_page
/
viewport
(only if
formats
includes
screenshot
)
json.prompt
stringNo-Extraction prompt
json.json_schema
objectNo-JSON Schema
output.formats
enum:
  • html
  • raw_html
  • markdown
  • links
  • summary
  • screenshot
  • json
字段类型必填默认值描述
formats
string[]
["markdown"]
输出格式
screenshot
string
viewport
full_page
(整页)/
viewport
(视口)(仅当
formats
包含
screenshot
时有效)
json.prompt
string-提取提示语
json.json_schema
object-JSON Schema
output.formats
可选值:
  • html
  • raw_html
  • markdown
  • links
  • summary
  • screenshot
  • json

webhook

webhook

FieldTypeRequiredDefaultDescription
url
stringNo-Callback URL
headers
object mapNo-Custom callback headers
events
string[]No
["started","completed","failed"]
Events:
started
/
completed
/
failed
字段类型必填默认值描述
url
string-回调URL
headers
对象映射-自定义回调请求头
events
string[]
["started","completed","failed"]
触发回调的事件:
started
(已启动)/
completed
(已完成)/
failed
(已失败)

Response Parameters

响应参数

Sync create response (
mode=sync
)

同步创建响应(
mode=sync

FieldTypeDescription
scrape_id
stringTask ID
endpoint
stringAlways
scrape
version
stringVersion
status
string
completed
/
failed
url
stringTarget URL
data
objectResult data
started_at
stringStart time (ISO 8601)
ended_at
stringEnd time (ISO 8601)
total_credits_used
integerTotal credits used
data
fields (based on
output.formats
):
  • html
    ,
    raw_html
    ,
    markdown
    ,
    links
    ,
    summary
    ,
    screenshot
    ,
    json
  • metadata
    (page metadata)
  • traffic_bytes
  • credits_used
  • credits_detail
credits_detail
fields:
FieldTypeDescription
base_cost
integerBase scrape cost
traffic_cost
integerTraffic cost
json_extract_cost
integerJSON extraction cost
字段类型描述
scrape_id
string任务ID
endpoint
string固定为
scrape
version
string版本号
status
string
completed
(已完成)/
failed
(已失败)
url
string目标URL
data
object结果数据
started_at
string启动时间(ISO 8601格式)
ended_at
string结束时间(ISO 8601格式)
total_credits_used
integer总消耗积分
data
字段(基于
output.formats
):
  • html
    ,
    raw_html
    ,
    markdown
    ,
    links
    ,
    summary
    ,
    screenshot
    ,
    json
  • metadata
    (页面元数据)
  • traffic_bytes
    (流量字节数)
  • credits_used
    (消耗积分)
  • credits_detail
    (积分消耗明细)
credits_detail
字段:
字段类型描述
base_cost
integer基础抓取成本
traffic_cost
integer流量成本
json_extract_cost
integerJSON提取成本

Async create response (
mode=async
)

异步创建响应(
mode=async

FieldTypeDescription
scrape_id
stringTask ID
endpoint
stringAlways
scrape
version
stringVersion
status
stringAlways
pending
字段类型描述
scrape_id
string任务ID
endpoint
string固定为
scrape
version
string版本号
status
string固定为
pending
(待处理)

Async result response (
GET /v1/scrape/{scrape_id}
)

异步结果响应(
GET /v1/scrape/{scrape_id}

FieldTypeDescription
scrape_id
stringTask ID
endpoint
stringAlways
scrape
version
stringVersion
status
string
pending
/
crawling
/
completed
/
failed
url
stringTarget URL
data
objectSame shape as sync
data
started_at
stringStart time (ISO 8601)
ended_at
stringEnd time (ISO 8601)
字段类型描述
scrape_id
string任务ID
endpoint
string固定为
scrape
version
string版本号
status
string
pending
(待处理)/
crawling
(爬取中)/
completed
(已完成)/
failed
(已失败)
url
string目标URL
data
object结构与同步响应的
data
一致
started_at
string启动时间(ISO 8601格式)
ended_at
string结束时间(ISO 8601格式)

Workflow

工作流程

  1. Classify the request through the default XCrawl entry behavior.
  • If the user provides specific URLs for extraction, default to XCrawl Scrape.
  • If the user clearly asks for map, crawl, or search behavior, route to the dedicated XCrawl API instead of pretending this endpoint covers it.
  1. Restate the user goal as an extraction contract.
  • URL scope, required fields, accepted nulls, and precision expectations.
  1. Build the scrape request body.
  • Keep only necessary options.
  • Prefer explicit
    output.formats
    .
  1. Execute scrape and capture task metadata.
  • Track
    scrape_id
    ,
    status
    , and timestamps.
  • If async, poll until
    completed
    or
    failed
    .
  1. Return raw API responses directly.
  • Do not synthesize or compress fields by default.
  1. 通过默认XCrawl入口行为对请求进行分类。
  • 如果用户提供了用于提取的特定URL,默认使用XCrawl Scrape。
  • 如果用户明确要求map、crawl或search行为,路由到对应的XCrawl专用API,不要假装此端点支持这些功能。
  1. 将用户目标重述为提取契约。
  • 包括URL范围、必填字段、允许空值的情况以及精度要求。
  1. 构建抓取请求体。
  • 仅保留必要的选项。
  • 优先使用显式的
    output.formats
  1. 执行抓取并捕获任务元数据。
  • 跟踪
    scrape_id
    status
    和时间戳。
  • 如果是异步模式,轮询直到状态变为
    completed
    failed
  1. 直接返回原始API响应。
  • 默认情况下不要合成或压缩字段。

Output Contract

输出契约

Return:
  • Endpoint(s) used and mode (
    sync
    or
    async
    )
  • request_payload
    used for the request
  • Raw response body from each API call
  • Error details when request fails
Do not generate summaries unless the user explicitly requests a summary.
返回内容包括:
  • 使用的端点及模式(
    sync
    async
  • 请求使用的
    request_payload
  • 每个API调用的原始响应体
  • 请求失败时的错误详情
除非用户明确要求摘要,否则不要生成摘要。

Guardrails

约束规则

  • Do not present XCrawl Scrape as if it also covers map, crawl, or search semantics.
  • Default to scrape only when user intent is URL extraction.
  • Do not invent unsupported output fields.
  • Do not hardcode provider-specific tool schemas in core logic.
  • Call out uncertainty when page structure is unstable.
  • 不要将XCrawl Scrape描述为同时支持map、crawl或search功能。
  • 仅当用户意图是URL提取时,默认使用scrape。
  • 不要编造不支持的输出字段。
  • 不要在核心逻辑中硬编码特定供应商的工具 schema。
  • 当页面结构不稳定时,要说明存在不确定性。