alibabacloud-pai-rec-diagnosis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePAI-Rec Engine Diagnosis and Configuration Validation
PAI-Rec引擎诊断与配置验证
This skill provides comprehensive diagnostic and validation capabilities for Alibaba Cloud PAI-Rec (Programmable Recommendation System) engines, including interface troubleshooting and configuration analysis.
本Skill为阿里云PAI-Rec(可编程推荐系统)引擎提供全面的诊断与验证能力,包括接口故障排查和配置分析。
Scenario Description
场景说明
PAI-Rec is Alibaba Cloud's programmable recommendation system that provides intelligent recommendation capabilities. This skill helps users:
-
Diagnose PAI-Rec Engine Interface Issues: When engine API returns errors or unexpected results, trace the request through EAS service logs and engine configurations to identify root causes.
-
Validate Engine Configurations: Analyze engine configuration files for potential issues, inconsistencies, or misconfigurations before deployment.
Architecture: PAI-EAS Service + PAI-Rec Engine + Engine Configuration Management
PAI-Rec是阿里云的可编程推荐系统,提供智能推荐能力。本Skill帮助用户:
-
诊断PAI-Rec引擎接口问题:当引擎API返回错误或异常结果时,通过EAS服务日志和引擎配置追踪请求,定位根本原因。
-
验证引擎配置:在部署前分析引擎配置文件,排查潜在问题、不一致性或配置错误。
架构:PAI-EAS服务 + PAI-Rec引擎 + 引擎配置管理
Key Components
核心组件
- PAI-EAS Service: Elastic Algorithm Service hosting the recommendation engine
- PAI-Rec Engine: The recommendation engine processing requests
- Engine Configuration: Configuration files defining engine behavior
- Service Logs: EAS service logs containing request traces
- PAI-EAS Service:托管推荐引擎的弹性算法服务
- PAI-Rec Engine:处理请求的推荐引擎
- Engine Configuration:定义引擎行为的配置文件
- Service Logs:包含请求追踪信息的EAS服务日志
Installation
安装
Pre-check: Aliyun CLI >= 3.3.3 required
Runto verify >= 3.3.3. If not installed or version too low, runaliyun versionto install/update, or seecurl -fsSL https://aliyuncli.alicdn.com/setup.sh | bashfor installation instructions.references/cli-installation-guide.md
Pre-check: Aliyun CLI plugin update required
[MUST] runto enable automatic plugin installation. [MUST] runaliyun configure set --auto-plugin-install trueto ensure that any existing plugins are always up-to-date.aliyun plugin update
前置检查:需安装Aliyun CLI >= 3.3.3版本
运行验证版本是否>=3.3.3。若未安装或版本过低, 运行aliyun version进行安装/更新, 或查看curl -fsSL https://aliyuncli.alicdn.com/setup.sh | bash获取安装说明。references/cli-installation-guide.md
前置检查:需更新Aliyun CLI插件
[必须] 运行启用自动插件安装。 [必须] 运行aliyun configure set --auto-plugin-install true确保所有现有插件保持最新。aliyun plugin update
Authentication
身份认证
Pre-check: Alibaba Cloud Credentials RequiredSecurity Rules:
- NEVER read, echo, or print AK/SK values (e.g.,
is FORBIDDEN)echo $ALIBABA_CLOUD_ACCESS_KEY_ID- NEVER ask the user to input AK/SK directly in the conversation or command line
- NEVER use
with literal credential valuesaliyun configure set- ONLY use
to check credential statusaliyun configure listbashaliyun configure listCheck the output for a valid profile (AK, STS, or OAuth identity).If no valid profile exists, STOP here.
- Obtain credentials from Alibaba Cloud Console
- Configure credentials outside of this session (via
in terminal or environment variables in shell profile)aliyun configure- Return and re-run after
shows a valid profilealiyun configure list
前置检查:需准备阿里云凭证安全规则:
- 严禁读取、回显或打印AK/SK值(例如:
是禁止操作)echo $ALIBABA_CLOUD_ACCESS_KEY_ID- 严禁要求用户在对话或命令行中直接输入AK/SK
- 严禁使用
命令传入明文凭证值aliyun configure set- 仅允许使用
检查凭证状态aliyun configure listbashaliyun configure list检查输出是否包含有效的配置文件(AK、STS或OAuth身份)。若无有效配置文件,请在此处暂停。
- 从阿里云控制台获取凭证
- 在本次会话外配置凭证(通过终端中的
命令或Shell配置文件中的环境变量)aliyun configure- 待
显示有效配置文件后,再返回继续操作aliyun configure list
RAM Policy
RAM权限策略
[MUST] Permission Failure Handling: When any command or API call fails due to permission errors at any point during execution, follow this process:
- Read
to get the full list of permissions required by this SKILLreferences/ram-policies.md- Use
skill to guide the user through requesting the necessary permissionsram-permission-diagnose- Pause and wait until the user confirms that the required permissions have been granted
For detailed RAM policies required by this skill, see references/ram-policies.md.
[必须] 权限失败处理流程: 在执行过程中,若任何命令或API调用因权限错误失败,请遵循以下流程:
- 查看
获取本Skill所需的完整权限列表references/ram-policies.md- 使用
Skill引导用户申请必要权限ram-permission-diagnose- 暂停操作,等待用户确认所需权限已授予
本Skill所需的详细RAM权限策略,请查看 references/ram-policies.md。
Parameter Confirmation
参数确认
IMPORTANT: Parameter Confirmation — Before executing any command or API call, ALL user-customizable parameters (e.g., RegionId, service names, instance IDs, environment names, config names, etc.) MUST be confirmed with the user. Do NOT assume or use default values without explicit user approval.
重要提示:参数确认 — 在执行任何命令或API调用前, 所有用户可自定义的参数(例如:RegionId、服务名称、实例ID、 环境名称、配置名称等)必须与用户确认。 未经用户明确批准,不得假设或使用默认值。
Required Parameters
必填参数
| Parameter Name | Required/Optional | Description | Default Value |
|---|---|---|---|
| Required | PAI-EAS service name | N/A |
| Required | EAS cluster/region ID | N/A |
| Required | PAI-Rec instance ID | N/A |
| Required (for diagnosis) | Request ID from engine API response | N/A |
| Optional | Engine configuration name | N/A |
| Optional | Engine environment (Prod/Pre) | N/A |
| Optional | Specific engine config version ID | N/A |
| Optional | Alibaba Cloud region | N/A |
| 参数名称 | 必填/可选 | 描述 | 默认值 |
|---|---|---|---|
| 必填 | PAI-EAS服务名称 | N/A |
| 必填 | EAS集群/地域ID | N/A |
| 必填 | PAI-Rec实例ID | N/A |
| 必填(诊断场景) | 引擎API返回的请求ID | N/A |
| 可选 | 引擎配置名称 | N/A |
| 可选 | 引擎环境(Prod/Pre) | N/A |
| 可选 | 特定引擎配置版本ID | N/A |
| 可选 | 阿里云地域 | N/A |
Core Workflow
核心工作流
At the start of the Core Workflow (before any CLI invocation): [MUST] Enable AI-Mode — AI-mode is required for Agent Skill execution. Run the following commands before any CLI invocation:bashaliyun configure ai-mode enable aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-pai-rec-diagnosis"
[MUST] Disable AI-Mode at EVERY exit point — Before delivering the final response for ANY reason, always disable AI-mode first. This applies to ALL exit paths: workflow success, workflow failure, error/exception, user cancellation, session end, or any other scenario where no further CLI commands will be executed. AI-mode is only used for Agent Skill invocation scenarios and MUST NOT remain enabled after the skill stops running.bashaliyun configure ai-mode disable
在核心工作流开始时(执行任何CLI命令前): [必须] 启用AI模式 — 执行Agent Skill需开启AI模式。 在执行任何CLI命令前,先运行以下命令:bashaliyun configure ai-mode enable aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-pai-rec-diagnosis"
[必须] 在所有退出点禁用AI模式 — 无论因何种原因返回最终响应前,始终先禁用AI模式。此规则适用于所有退出路径:工作流成功、工作流失败、错误/异常、用户取消、会话结束,或任何不再执行CLI命令的场景。 AI模式仅用于Agent Skill调用场景,Skill停止运行后必须禁用。bashaliyun configure ai-mode disable
Workflow 1: PAI-Rec Engine Interface Diagnosis
工作流1:PAI-Rec引擎接口诊断
This workflow helps diagnose issues when a PAI-Rec engine API returns errors or unexpected results.
Input Example:
Service Name: embedding_recall
API Response:
{
"code": 299,
"msg": "items size not enough",
"request_id": "941b4e14-d1c5-489f-a184-b2b17f8b4fdb",
"size": 0,
"experiment_id": "",
"items": []
}此工作流用于诊断PAI-Rec引擎API返回错误或异常结果时的问题。
输入示例:
服务名称: embedding_recall
API响应:
{
"code": 299,
"msg": "items size not enough",
"request_id": "941b4e14-d1c5-489f-a184-b2b17f8b4fdb",
"size": 0,
"experiment_id": "",
"items": []
}Step 1: Retrieve EAS Service Information
步骤1:获取EAS服务信息
Get the service details to find the EAS service ID and configuration:
bash
aliyun eas describe-service \
--cluster-id <cluster-id> \
--service-name <service-name>What to extract:
- : EAS service resource ID (e.g.,
Resource)eas-r-1v4qb1yan3qmnjwxqe - : Environment variables containing:
ServiceConfig.envs- : The region
REGION - : PAI-Rec instance ID
INSTANCE_ID - : Engine configuration name
CONFIG_NAME - : Environment (product/prepub)
PAIREC_ENVIRONMENT
获取服务详情以找到EAS服务ID和配置:
bash
aliyun eas describe-service \\
--cluster-id <cluster-id> \\
--service-name <service-name>需提取的信息:
- : EAS服务资源ID(例如:
Resource)eas-r-1v4qb1yan3qmnjwxqe - : 包含以下内容的环境变量:
ServiceConfig.envs- : 地域
REGION - : PAI-Rec实例ID
INSTANCE_ID - : 引擎配置名称
CONFIG_NAME - : 环境(product/prepub)
PAIREC_ENVIRONMENT
Step 2: Extract Request ID from API Response
步骤2:从API响应中提取请求ID
Parse the API response JSON to get the field. This will be used to search service logs.
request_id解析API响应JSON,获取 字段,用于搜索服务日志。
request_idStep 3: Query EAS Service Logs
步骤3:查询EAS服务日志
Use the request ID as the sole filter to search service logs. Do NOT pass / when searching PAI-Rec business logs:
--start-time--end-timebash
aliyun eas describe-service-log \
--cluster-id <cluster-id> \
--service-name <service-name> \
--keyword <request-id> \
--page-size 500[CRITICAL] Known CLI pitfall — keyword-only lookup is required for business logs:
- When only is supplied (no time range), the CLI returns the full PAI-Rec application trace (
--keyword/controller.go/feed.go/recall.goetc.) matching the request_id.rank_service.go - As soon as /
--start-timeare added — even if the window covers the real log timestamp — the CLI silently drops business logs and only returns infrastructure noise (--end-timewrapper heartbeats,/bin/shretries,502 Bad Gateway).postgres.go dbstat - Therefore: for request-level diagnosis, always omit the time range and rely on alone.
--keyword <request-id>
Notes:
- : Use the full
--keywordextracted from the API response (case-sensitive exact match).request_id - : Raise to 500 to capture the entire trace in a single page; total matched entries for one request is usually < 30.
--page-size - /
--start-time: Only use these for broad time-window scans without--end-time(e.g., when investigating non-request-specific issues). Required format is--keywordin UTC (space separator, noyyyy-MM-dd HH:mm:ss/ noT). ISO-8601 forms likeZwill be rejected with2025-04-28T00:00:00Z.InvalidParameter
使用请求ID作为唯一过滤器搜索服务日志。搜索PAI-Rec业务日志时,请勿传入 / 参数:
--start-time--end-timebash
aliyun eas describe-service-log \\
--cluster-id <cluster-id> \\
--service-name <service-name> \\
--keyword <request-id> \\
--page-size 500[关键] 已知CLI陷阱 — 业务日志需仅用关键词查询:
- 仅传入 参数(无时间范围)时,CLI会返回与request_id匹配的完整PAI-Rec应用追踪日志(
--keyword/controller.go/feed.go/recall.go等)。rank_service.go - 一旦添加 /
--start-time参数 — 即使时间范围包含真实日志时间戳 — CLI会自动过滤掉业务日志,仅返回基础设施日志(--end-time包装器心跳、/bin/sh重试、502 Bad Gateway等)。postgres.go dbstat - 因此:针对请求级诊断,始终省略时间范围,仅依赖 参数。
--keyword <request-id>
注意事项:
- : 使用从API响应中提取的完整
--keyword(区分大小写的精确匹配)。request_id - : 设置为500,以便在单页中捕获完整追踪日志;单个请求的匹配条目通常少于30条。
--page-size - /
--start-time: 仅在不传入--end-time的宽时间范围扫描时使用(例如:排查非请求特定问题)。格式必须为UTC时区的--keyword(空格分隔,无yyyy-MM-dd HH:mm:ss/ 无T)。类似Z的ISO-8601格式会被拒绝并返回2025-04-28T00:00:00Z错误。InvalidParameter
Step 4: List Engine Configurations
步骤4:列出引擎配置
Map the environment and list matching configurations:
Environment Mapping:
- →
productProd - →
prepubPre
bash
aliyun pairecservice list-engine-configs \
--instance-id <instance-id> \
--environment <Prod|Pre> \
--status Released \
--name <config-name>What to extract:
- Find the configuration with
Status: Released - Get and
EngineConfigIdVersion
映射环境并列出匹配的配置:
环境映射:
- →
productProd - →
prepubPre
bash
aliyun pairecservice list-engine-configs \\
--instance-id <instance-id> \\
--environment <Prod|Pre> \\
--status Released \\
--name <config-name>需提取的信息:
- 找到状态为 的配置
Released - 获取 和
EngineConfigIdVersion
Step 5: Get Engine Configuration Details
步骤5:获取引擎配置详情
bash
aliyun pairecservice get-engine-config \
--instance-id <instance-id> \
--engine-config-id <engine-config-id>What to extract:
- : The actual engine configuration (JSON/YAML)
ConfigValue
bash
aliyun pairecservice get-engine-config \\
--instance-id <instance-id> \\
--engine-config-id <engine-config-id>需提取的信息:
- : 实际引擎配置(JSON/YAML格式)
ConfigValue
Step 5.5 (Optional): Static Config Sanity Check
步骤5.5(可选):静态配置合理性检查
Optionally run against the retrieved to quickly
rule out structural / reference / naming errors in the engine configuration
before diving into the log trace. See Workflow 2 § Step 3 and
references/config-validation.md for usage,
exit codes, and the full rule list.
scripts/validate.pyConfigValuebash
printf '%s' "$CONFIG_VALUE" | python3 scripts/validate.py --stdinWhen to run: when the log trace points at a specific configuration element
(e.g. a / / entry), or when the
configuration is being diagnosed for the first time in this skill session.
RecallConfsFilterConfsSceneConfsWhen to skip: when the log trace already shows a decisive non-config root
cause (e.g. a not present in , a 5xx from an upstream
EAS dependency, a missing feature table). is a static checker and
cannot detect request-time mismatches between client input and configuration.
scene_idSceneConfsvalidate.py[MUST] Scoping rule for the final report:
- findings may enter the final diagnosis ONLY when they are directly tied to the log evidence for the current
validate.py(e.g. the log blames arequest_idname thatRecallConfflags as duplicated or dangling).validate.py - Findings unrelated to the current trace MUST NOT be added to the final conclusion. They remain an internal sanity-check signal only. This preserves the evidence-only reporting rule in Step 6.
request_id
可选择将获取到的 传入 运行,在深入分析日志追踪前快速排除引擎配置中的结构/引用/命名错误。使用方法、退出码和完整规则列表请查看工作流2 §步骤3和 references/config-validation.md。
ConfigValuescripts/validate.pybash
printf '%s' "$CONFIG_VALUE" | python3 scripts/validate.py --stdin何时运行: 当日志追踪指向特定配置元素(例如 / / 条目),或本次Skill会话中首次诊断该配置时。
RecallConfsFilterConfsSceneConfs何时跳过: 当日志追踪已明确显示非配置类根本原因(例如 不存在于 、上游EAS依赖返回5xx错误、特征表缺失)。 是静态检查工具,无法检测请求时客户端输入与配置的不匹配问题。
scene_idSceneConfsvalidate.py[必须] 最终报告范围规则:
- 的检查结果仅当与当前
validate.py的日志证据直接关联时,才可纳入最终诊断报告(例如:日志指出某个request_id名称存在问题,而RecallConf标记该名称重复或无效)。validate.py - 与当前 追踪无关的检查结果不得添加到最终结论中,仅作为内部合理性检查信号。此规则遵循步骤6中的"仅基于证据报告"原则。
request_id
Step 6: Comprehensive Analysis
步骤6:综合分析
Analyze the following components together:
- API Response: Error code, message, and returned data
- Service Logs: Trace logs for the request_id showing processing flow
- Engine Configuration: Settings that may affect the behavior
Common Issues to Check:
- Configuration mismatches (e.g., recall settings, filtering rules)
- Resource limitations (e.g., insufficient items, timeout settings)
- Data source issues (e.g., table access, feature availability)
- Environment inconsistencies (e.g., prod config in prepub environment)
[MUST] Evidence-only reporting rule:
The final diagnosis delivered to the user MUST be grounded strictly in what the EAS service logs and the engine configuration directly show. Apply the following constraints:
- Report only what is observed. Quote the exact log line (file:line, level, message) and the exact config fragment that proves each claim.
- State the direct causal chain from log evidence to the API response, and stop there.
- Do NOT add any of the following unless the user explicitly asks:
- Speculative root causes not visible in logs/config (e.g., "client probably sent wrong X")
- Fix recommendations or remediation steps
- Conditional "if X then Y" scenarios
- Tangential best-practice advice (security, fallback design, naming, etc.)
- Guesses about upstream systems, client code, or data sources not covered by the logs/config
- If the evidence is insufficient to reach a conclusion, state explicitly what additional data (specific log lines, other config versions, other environments) is needed, instead of guessing.
- Recommendations are opt-in only. Provide fixes/suggestions only when the user explicitly requests them in a follow-up.
结合以下组件进行分析:
- API响应:错误码、消息和返回数据
- 服务日志:request_id对应的追踪日志,展示处理流程
- 引擎配置:可能影响行为的设置
需检查的常见问题:
- 配置不匹配(例如:召回设置、过滤规则)
- 资源限制(例如:物品数量不足、超时设置)
- 数据源问题(例如:表访问权限、特征可用性)
- 环境不一致(例如:预发布环境使用生产配置)
[必须] 仅基于证据报告规则:
提交给用户的最终诊断报告必须严格基于EAS服务日志和引擎配置直接显示的内容。请遵循以下约束:
- 仅报告观察到的内容。引用确切的日志行(文件:行号、级别、消息)和确切的配置片段来证明每个结论。
- 陈述从日志证据到API响应的直接因果链,点到为止。
- 不得添加以下内容,除非用户明确要求:
- 日志/配置中未体现的推测性根本原因(例如:"客户端可能发送了错误的X")
- 修复建议或补救步骤
- 条件性的"如果X则Y"场景
- 无关的最佳实践建议(安全、 fallback设计、命名等)
- 对日志/配置未覆盖的上游系统、客户端代码或数据源的猜测
- 若证据不足以得出结论,明确说明需要哪些额外数据(特定日志行、其他配置版本、其他环境),而非猜测。
- 建议仅在用户主动请求时提供。仅当用户在后续对话中明确要求时,才提供修复/建议方案。
Workflow 2: PAI-Rec Engine Configuration Validation
工作流2:PAI-Rec引擎配置验证
This workflow validates engine configurations for potential issues.
Input: Configuration name and environment (Prod/Pre)
此工作流用于验证引擎配置是否存在潜在问题。
输入: 配置名称和环境(Prod/Pre)
Step 1: List Configuration Versions
步骤1:列出配置版本
If user doesn't provide , list available versions:
engine-config-idbash
aliyun pairecservice list-engine-configs \
--instance-id <instance-id> \
--environment <Prod|Pre> \
--name <config-name>Display to user:
- : Version number
Version - : Configuration status (Released/Draft/Archived)
Status - : Creation timestamp
GmtCreateTime - : Version ID
EngineConfigId
Ask user to select a version or provide the .
engine-config-id若用户未提供 ,列出可用版本:
engine-config-idbash
aliyun pairecservice list-engine-configs \\
--instance-id <instance-id> \\
--environment <Prod|Pre> \\
--name <config-name>展示给用户的信息:
- : 版本号
Version - : 配置状态(Released/Draft/Archived)
Status - : 创建时间戳
GmtCreateTime - : 版本ID
EngineConfigId
请用户选择版本或提供 。
engine-config-idStep 2: Retrieve Configuration Details
步骤2:获取配置详情
bash
aliyun pairecservice get-engine-config \
--instance-id <instance-id> \
--engine-config-id <engine-config-id>bash
aliyun pairecservice get-engine-config \\
--instance-id <instance-id> \\
--engine-config-id <engine-config-id>Step 3: Run Schema + Rule Validation
步骤3:运行Schema + 规则验证
[MUST] Feed the extracted JSON into . The script
enforces JSON Schema () + reference-consistency rules and exits
with status 0 on pass, 1 on failure.
ConfigValuescripts/validate.pyreferences/schema.jsonbash
undefined[必须] 将提取到的 JSON传入 。该脚本会强制执行JSON Schema()+ 引用一致性规则,验证通过时退出码为0,失败时为1。
ConfigValuescripts/validate.pyreferences/schema.jsonbash
undefinedFrom stdin (recommended when ConfigValue is already in memory)
从标准输入传入(推荐在ConfigValue已在内存中时使用)
printf '%s' "$CONFIG_VALUE" | python3 scripts/validate.py --stdin
printf '%s' "$CONFIG_VALUE" | python3 scripts/validate.py --stdin
From a saved JSON file
从保存的JSON文件传入
python3 scripts/validate.py /tmp/engine-config.json
python3 scripts/validate.py /tmp/engine-config.json
From an inline JSON string
从内联JSON字符串传入
python3 scripts/validate.py '{"RunMode":"product","RecallConfs":[...]}'
Requires `jsonschema` (`pip install jsonschema`); if missing the script falls back to
rule-only validation without Schema checks.
**What the script checks (summary):**
1. **Structure** — JSON well-formedness, required fields, types (`RunMode`,
`RecallConfs`, `FilterConfs`, `SortConfs`, `AlgoConfs`, `SceneConfs`, `RankConf`,
`FeatureConfs`, `UserFeatureConfs`, `DebugConfs`, `FeatureLogConfs`,
`CallBackConfs`, `PipelineConfs`, etc.)
2. **Enum values** — `RecallType` / `FilterType` / `SortType` / `RunMode` /
`DebugConfs.OutputType` / `GeneralRankConfs.ActionConfs[].ActionType`
3. **Reference consistency** — `SceneConfs.RecallNames` → `RecallConfs`;
`FilterNames` → `FilterConfs`; `SortNames` → `SortConfs`;
`RankConf.RankAlgoList` → `AlgoConfs`; any `DaoConf.AdapterType` +
`*Name` → the corresponding `*Confs` (Hologres / Redis / MySQL / TableStore /
FeatureStore / …)
4. **Business rules**
- `User2ItemExposureFilter` with `WriteLog=true` + FeatureStore adapter: must set
`TimeInterval > 0`
- `PriorityAdjustCountFilter` in `accumulator` mode: `Count` must be strictly
increasing (use `Type="fix"` for independent per-recall caps)
- `PipelineConfs.*.Name` must be globally unique
- `DebugConfs.Rate` must be an integer in `[0, 100]`
5. **Duplicate name detection** within `RecallConfs`, `FilterConfs`, `SortConfs`,
`AlgoConfs`
Detailed usage, exit codes, example outputs and the full rule list live in
[references/config-validation.md](references/config-validation.md).python3 scripts/validate.py '{"RunMode":"product","RecallConfs":[...]}'
需安装 `jsonschema`(`pip install jsonschema`);若未安装,脚本会回退到仅执行规则验证,跳过Schema检查。
**脚本检查内容摘要:**
1. **结构** — JSON格式正确性、必填字段、类型(`RunMode`、
`RecallConfs`、`FilterConfs`、`SortConfs`、`AlgoConfs`、`SceneConfs`、`RankConf`、
`FeatureConfs`、`UserFeatureConfs`、`DebugConfs`、`FeatureLogConfs`、
`CallBackConfs`、`PipelineConfs`等)
2. **枚举值** — `RecallType` / `FilterType` / `SortType` / `RunMode` /
`DebugConfs.OutputType` / `GeneralRankConfs.ActionConfs[].ActionType`
3. **引用一致性** — `SceneConfs.RecallNames` → `RecallConfs`;
`FilterNames` → `FilterConfs`; `SortNames` → `SortConfs`;
`RankConf.RankAlgoList` → `AlgoConfs`; 任何 `DaoConf.AdapterType` +
`*Name` → 对应的 `*Confs`(Hologres / Redis / MySQL / TableStore /
FeatureStore / …)
4. **业务规则**
- `User2ItemExposureFilter` 设置 `WriteLog=true` 且使用FeatureStore适配器时:必须设置
`TimeInterval > 0`
- `PriorityAdjustCountFilter` 处于`accumulator`模式时:`Count`必须严格递增(使用`Type="fix"`实现独立的召回上限)
- `PipelineConfs.*.Name` 必须全局唯一
- `DebugConfs.Rate` 必须是`[0, 100]`范围内的整数
5. **重复名称检测** — 在`RecallConfs`、`FilterConfs`、`SortConfs`、
`AlgoConfs`中检测重复名称
详细使用方法、退出码、示例输出和完整规则列表请查看
[references/config-validation.md](references/config-validation.md)。Step 4: Evidence-Grounded Report
步骤4:基于证据的报告
Report to the user based strictly on the script's output plus any additional
inspection of :
ConfigValue- ✅ Checks passed (Schema clean, references resolved, no rule violations)
- ⚠️ Warnings reported by the script (severity=warning) or inconsistencies
observed in (e.g. naming collisions between
ConfigValuevariables and model output fields, env/region mismatches)RankScore - ❌ Errors reported by the script (severity=error) or missing required fields
- Missing-evidence notes — what extra data (other config versions, model signatures, etc.) would be needed to turn a warning into a confirmed error
Do not add speculative fixes or best-practice tangents; suggestions are provided
only when the user explicitly asks for them.
严格基于脚本输出和对的额外检查结果向用户报告:
ConfigValue- ✅ 检查通过(Schema合规、引用已解析、无规则违反)
- ⚠️ 脚本报告的警告(severity=warning)或在中观察到的不一致性(例如:
ConfigValue变量与模型输出字段命名冲突、环境/地域不匹配)RankScore - ❌ 脚本报告的错误(severity=error)或缺失必填字段
- 证据缺失说明 — 需要哪些额外数据(其他配置版本、模型签名等)才能将警告确认为错误
请勿添加推测性修复方案或无关的最佳实践内容;仅当用户明确要求时才提供建议。
Success Verification Method
成功验证方法
For detailed verification steps, see references/verification-method.md.
Quick Verification:
-
For Diagnosis Workflow:
- Service information retrieved successfully
- Logs found containing the request_id
- Configuration loaded correctly
- Root cause identified
-
For Validation Workflow:
- Configuration retrieved successfully
- All validation checks executed
- Issues clearly reported
- Recommendations provided (if applicable)
详细验证步骤请查看 references/verification-method.md。
快速验证:
-
诊断工作流:
- 成功获取服务信息
- 找到包含request_id的日志
- 正确加载配置
- 定位到根本原因
-
验证工作流:
- 成功获取配置
- 执行所有验证检查
- 清晰报告问题
- 提供建议(如适用)
Cleanup
清理
This skill performs read-only operations and does not create any resources that require cleanup.
本Skill仅执行只读操作,不会创建需要清理的资源。
Best Practices
最佳实践
-
Always capture request_id: When reporting API issues, include the full response with request_id for accurate log correlation.
-
Log queries — keyword only, no time range: For request-level diagnosis, passto
--keyword <request_id>and leavealiyun eas describe-service-log/--start-timeunset. Combining keyword with a time range filters out business logs due to a CLI quirk (see Workflow 1, Step 3). Only use time ranges for broad non-request scans, and only with the--end-timeUTC format (noyyyy-MM-dd HH:mm:ss/ noT).Z -
Environment awareness: Always verify that configurations match the target environment (Prod vs Pre).
-
Version control: When validating configurations, check multiple versions if issues persist across deployments.
-
Log retention: EAS service logs are retained for limited periods; diagnose issues promptly after occurrence.
-
Configuration backup: Before applying changes based on validation results, ensure current configurations are backed up.
-
Cross-reference: Compare working configurations with problematic ones to identify differences.
-
Service status: Check EAS service status before diagnosing; service-level issues may mask configuration problems.
-
Evidence-only conclusions: Ground every statement in the diagnosis on a specific log line or config fragment. Do not speculate, do not propose fixes, and do not volunteer best-practice advice unless the user explicitly asks. If the evidence is insufficient, say what is missing rather than inferring.
-
Structured analysis: Follow the systematic workflow rather than jumping to conclusions based on error messages alone.
-
Document findings: Keep track of recurring issues and their resolutions for faster future diagnosis.
-
始终捕获request_id:报告API问题时,包含完整响应及request_id,以便准确关联日志。
-
日志查询 — 仅用关键词,不设时间范围:针对请求级诊断,调用时传入
aliyun eas describe-service-log,不设置--keyword <request_id>/--start-time。由于CLI特性,同时使用关键词和时间范围会过滤掉业务日志(查看工作流1步骤3)。仅在进行非请求相关的宽范围扫描时使用时间范围,且必须使用UTC时区的--end-time格式(无yyyy-MM-dd HH:mm:ss/ 无T)。Z -
环境感知:始终验证配置与目标环境(Prod vs Pre)是否匹配。
-
版本控制:验证配置时,若问题在部署后持续存在,请检查多个版本。
-
日志保留:EAS服务日志保留期限有限,请在问题发生后及时诊断。
-
配置备份:根据验证结果修改配置前,确保已备份当前配置。
-
交叉对比:将正常工作的配置与有问题的配置进行对比,找出差异。
-
服务状态检查:诊断前先检查EAS服务状态;服务级问题可能掩盖配置问题。
-
仅基于证据得出结论:诊断中的每个陈述都必须基于特定日志行或配置片段。请勿推测、请勿主动提出修复方案、请勿主动提供最佳实践建议,除非用户明确要求。若证据不足,请说明缺失的内容,而非推断。
-
结构化分析:遵循系统化工作流,而非仅根据错误消息直接得出结论。
-
记录发现:跟踪重复出现的问题及其解决方案,以便未来更快诊断。
Reference Links
参考链接
| Reference Document | Description |
|---|---|
| RAM Policies | Required RAM permissions for PAI-Rec and EAS APIs |
| Related Commands | Complete CLI command reference |
| Verification Method | Detailed verification procedures |
| CLI Installation Guide | Alibaba Cloud CLI installation instructions |
| Configuration Examples | Sample engine configurations and common patterns |
| Config Validation | |
| Troubleshooting Guide | Common issues and solutions |
| 参考文档 | 描述 |
|---|---|
| RAM Policies | PAI-Rec和EAS API所需的RAM权限 |
| Related Commands | 完整CLI命令参考 |
| Verification Method | 详细验证流程 |
| CLI Installation Guide | 阿里云CLI安装说明 |
| Configuration Examples | 示例引擎配置和常见模式 |
| Config Validation | |
| Troubleshooting Guide | 常见问题及解决方案 |
| ", |