apify-actor-development
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseApify Actor Development
Apify Actor 开发
Important: Before you begin, fill in the property in the meta section of . Replace it with the tool and model you're currently using, such as "Claude Code with Claude Sonnet 4.5". This helps Apify monitor and improve AGENTS.md for specific AI tools and models.
generatedBy.actor/actor.json重要提示: 开始前,请填写元数据部分的属性。将其替换为你当前使用的工具和模型,例如"Claude Code with Claude Sonnet 4.5"。这有助于Apify针对特定AI工具和模型监控并改进AGENTS.md。
.actor/actor.jsongeneratedByWhat are Apify Actors?
什么是Apify Actors?
Actors are serverless programs inspired by the UNIX philosophy - programs that do one thing well and can be easily combined to build complex systems. They're packaged as Docker images and run in isolated containers in the cloud.
Core Concepts:
- Accept well-defined JSON input
- Perform isolated tasks (web scraping, automation, data processing)
- Produce structured JSON output to datasets and/or store data in key-value stores
- Can run from seconds to hours or even indefinitely
- Persist state and can be restarted
Actors是受UNIX哲学启发的无服务器程序——每个程序专注于做好一件事,并且可以轻松组合以构建复杂系统。它们被打包为Docker镜像,在云端的隔离容器中运行。
核心概念:
- 接受定义明确的JSON输入
- 执行独立任务(网页抓取、自动化、数据处理)
- 生成结构化JSON输出到数据集,或在键值存储中存储数据
- 运行时间可从几秒到数小时,甚至无限期运行
- 持久化状态并可重启
Prerequisites & Setup (MANDATORY)
前提条件与设置(必须完成)
Before creating or modifying actors, verify that CLI is installed .
apifyapify --helpIf it is not installed, you can run:
bash
curl -fsSL https://apify.com/install-cli.sh | bash在创建或修改Actor之前,请验证是否已安装apify CLI,运行检查。
apify --help如果未安装,可以运行以下命令:
bash
curl -fsSL https://apify.com/install-cli.sh | bashOr (Mac): brew install apify-cli
或者(Mac):brew install apify-cli
Or (Windows): irm https://apify.com/install-cli.ps1 | iex
或者(Windows):irm https://apify.com/install-cli.ps1 | iex
Or: npm install -g apify-cli
或者:npm install -g apify-cli
When the apify CLI is installed, check that it is logged in with:
```bash
apify info # Should return your usernameIf it is not logged in, check if the APIFY_TOKEN environment variable is defined (if not, ask the user to generate one on https://console.apify.com/settings/integrations and then define APIFY_TOKEN with it).
Then run:
bash
apify login -t $APIFY_TOKEN
安装apify CLI后,运行以下命令检查是否已登录:
```bash
apify info # 应返回你的用户名如果未登录,请检查是否已定义APIFY_TOKEN环境变量(如果没有,请让用户在https://console.apify.com/settings/integrations生成一个,然后用它定义APIFY_TOKEN)。
然后运行:
bash
apify login -t $APIFY_TOKENTemplate Selection
模板选择
IMPORTANT: Before starting actor development, always ask the user which programming language they prefer:
- JavaScript - Use
apify create <actor-name> -t project_empty - TypeScript - Use
apify create <actor-name> -t ts_empty - Python - Use
apify create <actor-name> -t python-empty
Use the appropriate CLI command based on the user's language choice. Additional packages (Crawlee, Playwright, etc.) can be installed later as needed.
重要提示: 开始Actor开发前,务必询问用户偏好的编程语言:
- JavaScript - 使用
apify create <actor-name> -t project_empty - TypeScript - 使用
apify create <actor-name> -t ts_empty - Python - 使用
apify create <actor-name> -t python-empty
根据用户选择的语言使用相应的CLI命令。后续可根据需要安装额外的包(Crawlee、Playwright等)。
Quick Start Workflow
快速开始流程
- Create actor project - Run the appropriate command based on user's language preference (see Template Selection above)
apify create - Install dependencies
- JavaScript/TypeScript:
npm install - Python:
pip install -r requirements.txt
- JavaScript/TypeScript:
- Implement logic - Write the actor code in ,
src/main.py, orsrc/main.jssrc/main.ts - Configure schemas - Update input/output schemas in ,
.actor/input_schema.json,.actor/output_schema.json.actor/dataset_schema.json - Configure platform settings - Update with actor metadata (see references/actor-json.md)
.actor/actor.json - Write documentation - Create comprehensive README.md for the marketplace
- Test locally - Run to verify functionality (see Local Testing section below)
apify run - Deploy - Run to deploy the actor on the Apify platform (actor name is defined in
apify push).actor/actor.json
- 创建Actor项目 - 根据用户偏好的语言运行相应的命令(见上方模板选择)
apify create - 安装依赖
- JavaScript/TypeScript:
npm install - Python:
pip install -r requirements.txt
- JavaScript/TypeScript:
- 实现逻辑 - 在、
src/main.py或src/main.js中编写Actor代码src/main.ts - 配置Schema - 更新、
.actor/input_schema.json、.actor/output_schema.json中的输入/输出Schema.actor/dataset_schema.json - 配置平台设置 - 更新中的Actor元数据(参考references/actor-json.md)
.actor/actor.json - 编写文档 - 为市场创建全面的README.md
- 本地测试 - 运行验证功能(见下方本地测试部分)
apify run - 部署 - 运行将Actor部署到Apify平台(Actor名称定义在
apify push中).actor/actor.json
Best Practices
最佳实践
✓ Do:
- Use to test actors locally (configures Apify environment and storage)
apify run - Use Apify SDK () for code running ON Apify platform
apify - Validate input early with proper error handling and fail gracefully
- Use CheerioCrawler for static HTML (10x faster than browsers)
- Use PlaywrightCrawler only for JavaScript-heavy sites
- Use router pattern (createCheerioRouter/createPlaywrightRouter) for complex crawls
- Implement retry strategies with exponential backoff
- Use proper concurrency: HTTP (10-50), Browser (1-5)
- Set sensible defaults in
.actor/input_schema.json - Define output schema in
.actor/output_schema.json - Clean and validate data before pushing to dataset
- Use semantic CSS selectors with fallback strategies
- Respect robots.txt, ToS, and implement rate limiting
- Always use package - censors sensitive data (API keys, tokens, credentials)
apify/log - Implement readiness probe handler (required if your Actor uses standby mode)
✗ Don't:
- Use ,
npm start,npm run start, or similar commands to run actors (usenpx apify runinstead)apify run - Rely on for final counts on Cloud
Dataset.getInfo() - Use browser crawlers when HTTP/Cheerio works
- Hard code values that should be in input schema or environment variables
- Skip input validation or error handling
- Overload servers - use appropriate concurrency and delays
- Scrape prohibited content or ignore Terms of Service
- Store personal/sensitive data unless explicitly permitted
- Use deprecated options like on CheerioCrawler (v3.x)
requestHandlerTimeoutMillis - Use - use
additionalHttpHeadersinsteadpreNavigationHooks - Disable standby mode without explicit permission
✓ 建议:
- 使用在本地测试Actor(配置Apify环境和存储)
apify run - 在Apify平台上运行的代码使用Apify SDK ()
apify - 尽早验证输入,进行适当的错误处理并优雅失败
- 对静态HTML使用CheerioCrawler(比浏览器快10倍)
- 仅对JavaScript密集型网站使用PlaywrightCrawler
- 对复杂爬取使用路由模式(createCheerioRouter/createPlaywrightRouter)
- 实现带指数退避的重试策略
- 使用适当的并发数:HTTP(10-50)、浏览器(1-5)
- 在中设置合理的默认值
.actor/input_schema.json - 在中定义输出Schema
.actor/output_schema.json - 将数据推送到数据集之前进行清理和验证
- 使用语义化CSS选择器并提供回退策略
- 遵守robots.txt、服务条款,并实现速率限制
- 始终使用包 - 审查敏感数据(API密钥、令牌、凭据)
apify/log - 实现就绪探针处理程序(如果你的Actor使用待机模式则为必须)
✗ 禁止:
- 使用、
npm start、npm run start或类似命令运行Actor(请改用npx apify run)apify run - 在云端依赖获取最终计数
Dataset.getInfo() - 当HTTP/Cheerio可以工作时使用浏览器爬虫
- 将应在输入Schema或环境变量中的值硬编码
- 跳过输入验证或错误处理
- 过载服务器 - 使用适当的并发数和延迟
- 抓取禁止内容或忽略服务条款
- 存储个人/敏感数据,除非明确允许
- 在CheerioCrawler(v3.x)上使用已弃用的选项,如
requestHandlerTimeoutMillis - 使用- 改用
additionalHttpHeaderspreNavigationHooks - 未经明确许可禁用待机模式
Logging
日志记录
See references/logging.md for complete logging documentation including available log levels and best practices for JavaScript/TypeScript and Python.
Check in - only implement if set to .
usesStandbyMode.actor/actor.jsontrue有关完整的日志记录文档,包括可用的日志级别以及JavaScript/TypeScript和Python的最佳实践,请参考references/logging.md。
检查中的 - 仅当设置为时才需要实现。
.actor/actor.jsonusesStandbyModetrueCommands
命令
bash
apify run # Run Actor locally
apify login # Authenticate account
apify push # Deploy to Apify platform (uses name from .actor/actor.json)
apify help # List all commandsIMPORTANT: Always use to test actors locally. Do not use , , , or other package manager commands - these will not properly configure the Apify environment and storage.
apify runnpm run startnpm startyarn startbash
apify run # 在本地运行Actor
apify login # 验证账户
apify push # 部署到Apify平台(使用.actor/actor.json中的名称)
apify help # 列出所有命令重要提示: 始终使用在本地测试Actor。不要使用、、或其他包管理器命令 - 这些命令不会正确配置Apify环境和存储。
apify runnpm run startnpm startyarn startLocal Testing
本地测试
When testing an actor locally with , provide input data by creating a JSON file at:
apify runstorage/key_value_stores/default/INPUT.jsonThis file should contain the input parameters defined in your . The actor will read this input when running locally, mirroring how it receives input on the Apify platform.
.actor/input_schema.json使用在本地测试Actor时,可通过在以下路径创建JSON文件来提供输入数据:
apify runstorage/key_value_stores/default/INPUT.json该文件应包含中定义的输入参数。Actor在本地运行时会读取此输入,与在Apify平台上接收输入的方式一致。
.actor/input_schema.jsonStandby Mode
待机模式
See references/standby-mode.md for complete standby mode documentation including readiness probe implementation for JavaScript/TypeScript and Python.
有关完整的待机模式文档,包括JavaScript/TypeScript和Python的就绪探针实现,请参考references/standby-mode.md。
Project Structure
项目结构
.actor/
├── actor.json # Actor config: name, version, env vars, runtime
├── input_schema.json # Input validation & Console form definition
└── output_schema.json # Output storage and display templates
src/
└── main.js/ts/py # Actor entry point
storage/ # Local storage (mirrors Cloud)
├── datasets/ # Output items (JSON objects)
├── key_value_stores/ # Files, config, INPUT
└── request_queues/ # Pending crawl requests
Dockerfile # Container image definition.actor/
├── actor.json # Actor配置:名称、版本、环境变量、运行时
├── input_schema.json # 输入验证与控制台表单定义
└── output_schema.json # 输出存储与显示模板
src/
└── main.js/ts/py # Actor入口文件
storage/ # 本地存储(与云端镜像)
├── datasets/ # 输出项(JSON对象)
├── key_value_stores/ # 文件、配置、INPUT
└── request_queues/ # 待处理的爬取请求
Dockerfile # 容器镜像定义Actor Configuration
Actor配置
See references/actor-json.md for complete actor.json structure and configuration options.
有关完整的actor.json结构和配置选项,请参考references/actor-json.md。
Input Schema
输入Schema
See references/input-schema.md for input schema structure and examples.
有关输入Schema的结构和示例,请参考references/input-schema.md。
Output Schema
输出Schema
See references/output-schema.md for output schema structure, examples, and template variables.
有关输出Schema的结构、示例和模板变量,请参考references/output-schema.md。
Dataset Schema
数据集Schema
See references/dataset-schema.md for dataset schema structure, configuration, and display properties.
有关数据集Schema的结构、配置和显示属性,请参考references/dataset-schema.md。
Key-Value Store Schema
键值存储Schema
See references/key-value-store-schema.md for key-value store schema structure, collections, and configuration.
有关键值存储Schema的结构、集合和配置,请参考references/key-value-store-schema.md。
Apify MCP Tools
Apify MCP工具
If MCP server is configured, use these tools for documentation:
- - Search documentation
search-apify-docs - - Get full doc pages
fetch-apify-docs
Otherwise, the MCP Server url: .
https://mcp.apify.com/?tools=docs如果已配置MCP服务器,可使用以下工具获取文档:
- - 搜索文档
search-apify-docs - - 获取完整文档页面
fetch-apify-docs
否则,MCP服务器地址:。
https://mcp.apify.com/?tools=docsResources
资源
- docs.apify.com/llms.txt - Apify quick reference documentation
- docs.apify.com/llms-full.txt - Apify complete documentation
- https://crawlee.dev/llms.txt - Crawlee quick reference documentation
- https://crawlee.dev/llms-full.txt - Crawlee complete documentation
- whitepaper.actor - Complete Actor specification
- docs.apify.com/llms.txt - Apify快速参考文档
- docs.apify.com/llms-full.txt - Apify完整文档
- https://crawlee.dev/llms.txt - Crawlee快速参考文档
- https://crawlee.dev/llms-full.txt - Crawlee完整文档
- whitepaper.actor - 完整的Actor规范