Alibaba Cloud Elasticsearch Instance Diagnosis
Collect signals from
Alibaba Cloud OpenAPI (control plane) and the
Elasticsearch REST API (data plane), combine them with the SOP knowledge base under
, and produce
root-cause analysis, an
evidence chain,
prioritized remediation guidance, and—when multiple dimensions fire—a
recency-ordered incident timeline (severity vs time in window; see
Timeline and recency (MUST) in §5 Step 4).
Architecture: Alibaba Cloud Elasticsearch OpenAPI + Alibaba CloudMonitor (CMS) + Elasticsearch REST API + diagnostic SOPs
Closure: If MUST applies and
is set, finish authenticated ES API evidence before the final report (see
Feasibility order in §5).
1. Prerequisites
1.1 Aliyun CLI
Pre-check: Aliyun CLI >= 3.3.1 required (for RAM permission checks and OpenAPI CLI fallback)
Run
to verify the version is >= 3.3.1. If the CLI is missing or too old, see
references/cli-installation-guide.md
.
After installation, run
aliyun configure set --auto-plugin-install true
to enable automatic plugin installation (
do not pass plaintext AccessKey pairs on this command line; see §1.2).
1.2 Alibaba Cloud account authentication and security (MUST)
Security rules (mandatory):
- NEVER read, echo, or print AccessKey ID or AccessKey Secret values.
- NEVER prompt or ask the user to paste plaintext AccessKeys in the conversation.
- NEVER embed AccessKeys in scripts, CLI arguments, or URLs.
- NEVER use (or similar) to pass literal AccessKey ID/Secret on the command line.
- NEVER accept AccessKeys that the user pastes into the chat, even if offered voluntarily.
- ONLY use configured CLI profiles () or environment variables such as
ALIBABA_CLOUD_ACCESS_KEY_ID
/ ALIBABA_CLOUD_ACCESS_KEY_SECRET
that the user has set in their local shell (the agent must not echo those values in the session).
⚠️ If the user provides AccessKeys in the chat (e.g. “my AK is xxx”)
- Stop immediately: do not run any Alibaba Cloud command that requires credentials.
- Decline politely and give only the names of approved configuration methods (do not repeat any secret the user may have leaked):
- Recommended: run in a local terminal and enter credentials when prompted; credentials are stored in the local profile file.
- Alternatively: set
ALIBABA_CLOUD_ACCESS_KEY_ID
/ ALIBABA_CLOUD_ACCESS_KEY_SECRET
in the local shell (the user types values only in the terminal, not in chat).
- Resume the diagnosis request only after credentials are configured correctly.
Verify credentials without exposing secrets:
bash
aliyun configure list
aliyun --profile <profile_name> sts get-caller-identity
Credential policy:
- Prefer an profile (default or ).
- If there is no valid identity ( / fails), STOP and guide the user to configure locally; do not guess or fabricate credentials.
- Never pass plaintext AccessKeys through the conversation.
1.3 Elasticsearch direct-connect credential boundary
- NEVER ask the user to paste in chat; NEVER echo, print, or log the password; NEVER copy a password from chat into commands, hooks, or repo files.
- Shell expansion for
curl -u "$ES_USERNAME:$ES_PASSWORD"
(or equivalent) is allowed when vars are pre-exported in the user’s local shell; NEVER put the secret as a literal in chat, scripts checked into repos, or command output.
- If the user tries to send a password in chat: STOP as well and ask them to set only locally via (see §2.2).
2. Environment setup
2.1 Control plane OpenAPI (via Aliyun CLI)
All control-plane and CMS data collection for this skill uses the Aliyun CLI.
[MUST] / — plugin-mode shell only (avoid legacy CLI)
Whenever the agent emits
executable lines (chat, reproducibility exports, or copy-paste steps), use
plugin subcommands (lowercase-hyphenated) and
kebab-case flags — the same shape as
scripts/openapi_cli_collect.py
and
references/verification-method.md.
- Do not use legacy POP-style invocations: a PascalCase verb immediately after or on the same line (the old “action name = subcommand” style), or CamelCase flags like , , in new commands. Use plugin verbs only (, , …).
- Naming split: , , , etc. are OpenAPI action names (PascalCase — docs, RAM, console). The token after or in a shell must be the CLI plugin name (, , , …).
- Prefer
python3 scripts/check_es_instance_health.py
for the standard control-plane + CMS bundle so subprocess calls stay aligned with this repo.
- CLI references: Elasticsearch CLI 中心, 云监控 CLI 中心.
AI-Mode and plugin baseline (required) — wrap every diagnosis session that runs
OpenAPI/CMS commands:
bash
aliyun configure ai-mode enable
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-elasticsearch-instance-diagnose"
aliyun plugin update
# … diagnosis: aliyun / python3 scripts/check_es_instance_health.py …
aliyun configure ai-mode disable
missing or failing: Skip the wrapper above; use
(next block). Log the CLI failure (e.g. subcommand unavailable). Whether the profile is
valid is determined only by
and
— write
valid /
validity, not
vaild.
User-Agent (required): set a User-Agent for Alibaba Cloud API calls:
bash
export ALIBABA_CLOUD_USER_AGENT="AlibabaCloud-Agent-Skills/alibabacloud-elasticsearch-instance-diagnose"
CLI hardening (recommended): when authoring raw
commands, use
§2.1 MUST plugin shape first, then add
--connect-timeout 3 --read-timeout 10
(increase
for large responses or CMS), consistent with the instance-management skill examples, to avoid indefinite hangs on network faults. If the global User-Agent is not set, add
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-elasticsearch-instance-diagnose
per invocation. For
optional Elasticsearch probes inside
check_es_instance_health.py
(when
is set), the same knobs exist as
/
on that script — they map to
for engine calls only, not to the Aliyun OpenAPI client.
Run before diagnosis:
bash
aliyun version
aliyun configure list
aliyun --profile <profile_name> sts get-caller-identity
2.2 Elasticsearch API direct access ()
Have the user set connection variables in a local terminal after you confirm the Elasticsearch endpoint (VPC or public) and admin credentials—do not hardcode user-specific values in chat:
bash
export ES_ENDPOINT="http://<elasticsearch-endpoint-ip>:9200"
export ES_USERNAME="elastic"
export ES_PASSWORD="<elasticsearch-admin-password>"
Public access and vs : From
, use
/
and the reported
. When
is (typical public listener), set
to http://<publicDomain>:9200
. Using
against an
HTTP-only endpoint causes
TLS errors (e.g.
). Use
only when
is (or TLS is actually enabled on the port you use), and supply CA / fingerprint options as in
HTTPS options below.
If “does not work” — when to try : Treat
as the
source of truth for the
REST listener.
,
timeouts, or
connection refused on
usually mean
network path / allowlist / security group / wrong host or port —
not “try HTTPS next” when
is still .
Do switch to
when
is (or the console / product doc states TLS on that endpoint) and the failure on
is a
TLS or scheme symptom (e.g.
,
, immediate SSL alert while probing with the wrong scheme). If
is and only
plain TCP is advertised,
HTTPS is not a fallback for reachability.
Credential safety
- NEVER echo, print, or log ; NEVER copy credentials from chat into shell history or saved files.
- NEVER ask the user to paste the password in plaintext in chat.
- ONLY use the following checks to verify that variables are set:
bash
[[ -n "$ES_ENDPOINT" ]] && echo "ES_ENDPOINT: $ES_ENDPOINT" || echo "ES_ENDPOINT: NOT SET"
[[ -n "$ES_PASSWORD" ]] && echo "ES_PASSWORD: SET" || echo "ES_PASSWORD: NOT SET"
Network connectivity and access control
| Issue | How to check | Mitigation |
|---|
| Public network access disabled | Elasticsearch console → Network | Enable public access or use the VPC endpoint |
| Public access allowlist | Console → Security → Public access allowlist | Add the agent host’s public IP |
| VPC isolation | e.g. | VPC peering, Express Connect, or equivalent |
| Security group | Inbound rules on the ECS/security group hosting Elasticsearch | Allow TCP 9200 (or the configured port) |
Connectivity probe:
curl -sS -o /dev/null -w "%{http_code}" --connect-timeout 5 "${ES_ENDPOINT}"
— HTTP code
usually means the path is unreachable.
without is normal (auth required); if
is SET, proceed to
authenticated (§7).
with → wrong credentials.
/ refused / timeout → network, allowlist, or TLS/scheme mismatch.
HTTPS — prerequisites (what must be true)
- Listener: The Elasticsearch HTTP port you call (9200 unless changed) must actually speak TLS — align with () or console/network documentation.
- URL: with the same host (e.g. ) you would use for HTTP.
- Client trust of the server certificate: Your client must trust the cluster’s certificate chain (cluster / cloud CA PEM, or corporate proxy CA if TLS is intercepted). : prefer
curl --cacert /path/to/ca.crt ...
; / only for short, non-production diagnosis.
- Auth: Same / as for HTTP (Basic auth over TLS).
HTTPS — how this skill documents it
- Manual (§7 and es-api-call-failures.md): Add (or for testing) to every when using if the default trust store does not include your cluster CA.
check_es_instance_health.py
optional ES probes: They invoke with only; they do not read / / (those names are common for Python Elasticsearch clients). For HTTPS instances, use §7 with for deep checks, or extend the script later to pass from an env var.
- Python-style env vars (reference for other tooling): , , (testing only) — not wired into this repo’s optional path today.
3. RAM permission check
[MUST] RAM permission pre-check
Before running this skill, verify the principal has the required RAM permissions.
See
references/ram-policies.md
for the full list.
If the user reports insufficient permissions, direct them to attach the corresponding policies in the RAM console.
4. Parameter confirmation
IMPORTANT: Parameter confirmation
Confirm the following with the user before any command or API call.
Do not assume undeclared defaults or hardcode user-specific parameters.
Boundary controls (MUST)
- Region and must not be guessed or taken from unverified defaults; if they disagree with or the user’s explicit statement, reconfirm.
- Do not apply metrics, logs, or conclusions from instance A to instance B; must match the instance under diagnosis (see Pre-flight validation for Elasticsearch API below).
- This skill is read-only diagnosis: do not invoke mutating control-plane APIs (create, resize, restart, delete instance, etc.). If the user requests a change, provide recommendations only; execution belongs in the console or an approved change workflow.
| Parameter | Required | Description | Default |
|---|
| Yes | Elasticsearch instance ID, e.g. . flag is (not ). | - |
| Yes | Region ID (e.g. ). flag is (not ). | - |
| No | Aliyun CLI profile (explicit recommended) | |
| No | Elasticsearch endpoint (direct API access only) | - |
| No | Elasticsearch admin password (direct API access only) | - |
| No | check_es_instance_health.py
: analysis window in minutes (default 60) | 60 |
| , | No | check_es_instance_health.py
: timeouts for optional ES engine probes when is set ( → ; contributes to together with connect). Defaults 5 / 10 seconds. | 5 / 10 |
5. End-to-end diagnostic workflow
Agent hard rules (non-negotiable)
Aliyun CLI shape: For
and
, follow
§2.1 MUST (plugin mode only) in every new executable command — do not resurrect legacy
/
-as-subcommand lines or
-style flags in session exports or user-facing step lists (they drift from
and fail static checks).
OpenAPI/CMS cannot replace MUST engine APIs. For any
§5 MUST table row or
check_es_instance_health.py
rule-engine MUST, Alibaba Cloud OpenAPI and CloudMonitor do
not replace the listed Elasticsearch REST calls for engine-level root cause—when
feasibility holds, run those
endpoints (see §7); they are complementary layers, not interchangeable.
Feasibility is decided only by checks, not by assumption. Whether the agent may call Elasticsearch
must be determined by actually running the
Feasibility order (§5): at minimum verify
/
per §2.2, align
with
, then authenticated
.
Do not assume
is unset or the path is unreachable without performing these steps in the session.
For Elasticsearch incidents, follow these four steps; each has a distinct role.
Execution strategy (root-cause driven)
Full policy: es-api-diagnosis-strategy.md
Data-plane
collection requires
both:
- Feasibility: and are set and the network path works.
- Necessity: root-cause analysis needs data-plane evidence that the control plane or CMS cannot establish alone.
For endpoints
listed under a fired
MUST table row
or rule-engine MUST,
necessity for those calls is
already satisfied by the trigger—still require
feasibility (
Feasibility order). For
optional engine
not in those lists, apply
feasibility and
necessity per
es-api-diagnosis-strategy.md.
MUST triggers (if any CMS condition below holds, collect the listed Elasticsearch evidence):
| Trigger | Scenario | Required Elasticsearch evidence |
|---|
| max ≥ Yellow / Red | Cluster health | , |
| max > 80% | CPU overload | , |
NodeHeapMemoryUtilization
max > 85% | Memory pressure | , GET /_cluster/settings?include_defaults=true
( in transient / persistent ) |
| Thread pool > 0 | Performance | , |
| Inter-node resource CV > 0.3 | Load imbalance | , |
| Write failures or index read-only | Disk / watermark / blocks | , _all/_settings?filter_path=*.settings.index.blocks
, |
| Intermittent Elasticsearch API timeouts + CMS CPU > 80% | Possible cascading failure | , , |
Thread-pool row: interpret search vs write / bulk using sop-query-thread-pool.md vs sop-write-performance.md (see also Write-path / bulk saturation below).
Rule-engine MUST: If
check_es_instance_health.py
prints a
§5 MUST / §5–§7 callout for this run, treat it like a row above—collect that listed ES evidence when feasibility holds.
Binding rule (MUST triggers): If
any MUST-trigger row
or the
rule-engine MUST line above applies,
necessity is satisfied for that evidence set—OpenAPI/CMS cannot replace those calls for engine-level root cause (cluster-health:
+
for Yellow/Red). Confirm
feasibility per
Feasibility order below. If reachable with auth,
run the MUST-listed endpoints in Step 2 in parallel with control-plane collection. If still blocked
after authenticated , lead with
blocking reason: unset
; transport failure (
, refused, timeout);
401 with ; scheme/TLS mismatch—not
401 on an unauthenticated probe when
is SET.
Write-path / bulk saturation
If
or
pool stress matches
high-QPS bulk indexing, read and follow
references/sop-write-performance.md
— §2, subsection
“Evidence interpretation: bulk QPS → write pool” for the evidence chain,
semantics (cumulative since node start),
report ordering vs Old GC / heap (causal chain or dual P0 — write path before JVM-only headline),
per-node / numbers (reject share), per-node asymmetry, and write-only vs search.
Do not lead with a JVM-only narrative when that subsection applies. For
write-queue–style acceptance prompts, the
opening conclusion should read as
write-capacity (data-plane counters + optional CMS rule names), not
only a GC/heap headline.
Search-primary vs write (both pools show cumulative )
When
shows
≫ on the same node(s) and
ThreadPool.SearchRejected
/
query-driven overload applies,
lead the
executive summary and
P0 ordering with
(high concurrent query / terms / slow query; hot index when verified) —
not first.
may remain
P0/P1 as
parallel or
secondary (bulk, catch-up);
Old GC / CPU / node disconnect stay
co-stress or cascade. Checker
listing order is not proof of narrative order — see
acceptance-criteria.md §6.5 and
sop-query-thread-pool.md Report narrative.
Recency overrides this magnitude default when
time-resolved evidence exists:
do not rank the opening story by
vs alone — cumulative counters lack timestamps. Full rubric:
acceptance-criteria.md §6.5 (
P0 / executive order vs ≫ :
unless write dominated
by time) and
§6.6 (
Executive order,
No false recency from counters).
Binding: Timeline and recency (MUST) below (same skill).
/ change workflow stuck (cross-layer root cause)
When an instance stays in
, a change is unfinished, and
Red or unassigned shards coexist, follow
references/sop-activating-change-stuck.md
end-to-end (
MUST includes
,
before/after remediation, collection order
section 3.1, reporting
section 4).
Pre-flight validation for Elasticsearch API
[IMPORTANT] must match the diagnosed instance
Compare
/
and
from
with
.
If they differ, warn:
⚠️ ES_ENDPOINT does not match the current instance; run export ES_ENDPOINT="http://{publicDomain}:9200"
when
is , or
only when
is (adjust host/port to match the deployment).
When Elasticsearch credentials are missing or connections fail
[CRITICAL] Guide the user to fix connectivity explicitly;
classify failure modes (do not default persistent timeouts to “allowlist only”).
Do not imply the agent “forgot” Elasticsearch — if the first answer is CMS/OpenAPI-heavy, give the
blocking reason per
Feasibility order below: unset
; transport errors;
401 with valid ; TLS/scheme—not
401 on a probe
without when
is SET (use authenticated
first).
Progressive playbook (read in order): references/es-api-call-failures.md (sections 1 → 4).
MUST / strategy context: references/es-api-diagnosis-strategy.md (sections 1–3 and 3.5 summary table).
Mandatory warning when MUST applies but Elasticsearch is not configured
[CRITICAL] If a MUST trigger fires but data-plane evidence is missing, put a warning at the top of the report: follow
section 4 of
references/es-api-call-failures.md (blocking reason first, then MUST list, missing evidence; if
unset, pointer to
section 2.2 of this SKILL; if vars are set, use es-api-call-failures
sections 1–2 for auth vs transport).
Step 1: Quick health scan (initial signals)
Run the lightweight rules engine (17 metric rules) to list P0 / P1 / P2 findings and steer deeper collection:
bash
python3 scripts/check_es_instance_health.py -i <InstanceId> -r <RegionId> [--window <minutes, default 60>] [--profile <profile_name>]
Feasibility order (agent)
- Run §2.2 checks (password = SET only)—do not skip; never infer feasibility without this step.
- matches / (scheme/port).
- Authenticated —do not stop at 401 on an unauthenticated probe if is SET.
- MUST scope: table rows and/or rule-engine MUST line in §5.
Step 2: Collect evidence in parallel
Based on Step 1, run collection in parallel (prioritize dimensions with signals).
If a
MUST-trigger row
or rule-engine MUST applies: run
Feasibility order, then
run that Required Elasticsearch evidence via
in the
same round (see §7). If
no MUST applies, add optional data-plane
only when
feasibility and
necessity both hold per the strategy doc.
Re-run
check_es_instance_health.py
with the same invocation pattern as Step 1; for this parallel round,
and explicit
are common.
To backfill control-plane evidence (
,
, CMS-style calls), use
patterns in
references/verification-method.md (epoch times, profiles, namespaces).
Note: data-plane access still requires
/
; the Aliyun CLI
cannot replace
to the cluster.
For
MUST-trigger rows, necessity for the
listed endpoints is
already established—do
not skip them when feasibility including reachability holds.
Outside those rows, avoid unrelated bulk
solely because
is set; use the strategy doc’s feasibility + necessity test instead.
Step 3: Read SOPs by signal
Map signals to SOPs and read for deeper reasoning. With multiple signals, process P0 → P1 → P2 for severity, then apply Timeline and recency (MUST) in Step 4 so the narrative order matches when signals mattered in the window—not only static rule-engine print order.
| Observed signal | Read |
|---|
| Cluster Red/Yellow, node loss, pending tasks | references/sop-cluster-health.md
|
| Long , unfinished change records, Red / unassigned shards | references/sop-cluster-health.md
+ references/sop-activating-change-stuck.md
|
| High CPU, load, imbalance | references/sop-cpu-load.md
|
| Per-node load imbalance (CPU/memory/disk/shard count) | references/sop-node-load-imbalance.md
|
| JVM pressure, GC, circuit breaker, OOM | references/sop-memory-gc.md
|
| Disk watermark, IO, write failures (read-only) | references/sop-disk-storage.md
|
| Watermark misconfiguration, index blocks, “normal” disk % but write failures | references/sop-disk-storage.md
(Section 3 — watermark misconfiguration) |
| Write timeouts / rejections / latency / QPS drop | references/sop-write-performance.md
|
| Query timeouts / rejections / slow queries | references/sop-query-thread-pool.md
|
| Nodes look down but CPU still reported; | references/sop-service-avalanche.md
|
| Intermittent Elasticsearch timeouts + CMS CPU > 80% | references/sop-service-avalanche.md
|
| Risky settings, Ngram issues, API anomalies | references/sop-configuration.md
|
| Event code definitions | references/health-events-catalog.md
|
Step 4: Synthesize and write the structured report
Acceptance-style optional checklists: references/acceptance-criteria.md §6.1–
§6.6 — Red/Yellow; read-heavy CPU +
pool (+ CMS alignment); JVM / breakers / fielddata; write-queue vs GC +
/; read-heavy
search pool vs GC-only headline (expand in
sop-query-thread-pool.md Report narrative: search pool vs GC / CPU headlines); timeline/recency.
Bulk/write: references/sop-write-performance.md §2.
Shard : references/sop-node-load-imbalance.md §1.3 (allocator / change control only).
[CRITICAL] Remediation must match the diagnosed root cause — avoid generic templates. Wrong breaker or concurrency fixes (e.g.
vs
, “split query” when concurrency is the issue) → see
and the fired signal’s SOP.
+ data-plane anomaly: include the
one-line cross-layer root cause; see
references/sop-activating-change-stuck.md
section 4.
Report skeleton (copy/fill): references/report-template.md.
Timeline and recency (MUST for synthesized reports)
Problem: check_es_instance_health.py
and P0/P1/P2 bands express
severity, not
when a signal mattered most within the analysis window.
Cumulative engine counters (
,
) do
not encode recency—
write and
search issues can both be “real” while
only one path dominated the recent past (e.g. search pressure
closer to window end than write pressure).
Binding rules for the agent:
- Two axes — Treat severity (P0/P1/P2) and temporal relevance (proximity to window end / “now”) as orthogonal. Do not infer recency from priority alone (e.g. “write is P0 so it must be the current headline”) when time-resolved evidence says otherwise.
- Mandatory human-facing section — When more than one major finding fires (e.g. write pool + search pool + GC/CPU), the synthesized report must include an
### Incident timeline (recency-ordered)
(or equivalent) block before or immediately after the executive summary, unless the user explicitly asks for a minimal report. In that block:
- Order bullets or rows by time (earlier → later), or state which signal cluster peaked / persisted in the latter portion of .
- Call out divergence: e.g. “write-path stress earlier in window; search-path / CPU more recent” when CMS or logs support it.
- Evidence for recency (use what exists; do not invent timestamps):
- CloudMonitor: per-metric time series — note peak timestamp or sustained-high interval for ,
NodeHeapMemoryUtilization
, GC-related metrics, if exposed as rates or non-cumulative series in the collected JSON.
- Slow logs / : correlate query vs index slow entries to minutes.
- Engine (optional): two samples at known times to show delta on / ; or / for current skew vs historical cumulative counters.
- Executive summary ordering — The opening 2–4 sentences should reflect recency-weighted user impact: if search pressure is closer to current than write pressure, lead with search/query concurrency and co-stress (GC/CPU) as appropriate, and place historical write saturation as context or second wave—without dropping P0 write findings if they remain valid for remediation backlog.
- Explicit uncertainty — If only cumulative counters exist and no time series differentiates paths, state one line: recency is undifferentiated; recommend narrower window, slow logs, or delta sampling for the next run.
6. Data collection details (CLI OpenAPI + injected input)
One-shot entry
Use the same
check_es_instance_health.py
command as
§5 Step 1 (optional
/
; default window
60 minutes if omitted).
Injected input mode (paired with CLI)
check_es_instance_health.py
accepts external JSON to avoid duplicate calls:
bash
python3 scripts/check_es_instance_health.py \
-i <InstanceId> -r <RegionId> \
--data-source input \
--input-json-file /path/to/diag-input.json
Input JSON shape:
json
{
"status_info": {},
"metrics": {},
"events": [],
"logs": []
}
- : prefer injected fields; backfill gaps via Aliyun CLI.
- : ignore injection; fetch everything via CLI.
- : injection only; no OpenAPI calls.
Manual control-plane CLI backfill
For additional OpenAPI examples, see
references/verification-method.md
.
7. Elasticsearch direct API access (data-plane deep dive)
When feasibility holds (including reachability), execute the REST calls required by any MUST-trigger row (§5). For endpoints not listed in a fired MUST row, call them only when feasibility and necessity both hold per the strategy doc.
may be
or a full URL. For the samples below, normalize to
http://${ES_ENDPOINT#http://}
(use
consistently when the cluster serves TLS).
Timeouts: every
must use
--connect-timeout 10 --max-time 30
.
Red / Yellow (MUST) — recommended set
Scope: The cluster-health MUST row uses
max ≥
Yellow (includes
Red). Use this set for
unassigned / misallocated shard root cause on the engine.
bash
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cluster/health?pretty"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
-H "Content-Type: application/json" \
-X POST "http://${ES_ENDPOINT#http://}/_cluster/allocation/explain?pretty" \
-d '{}'
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason&s=state"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cluster/pending_tasks?pretty"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_nodes/stats/thread_pool?pretty"
Query / write performance (MUST) — recommended set
Include
when
heap / GC / breaker rules fired in Step 1 or
shows concern — read
transient and
persistent /
.
bash
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_nodes/hot_threads?threads=3"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_nodes/stats/breaker?pretty"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cluster/settings?include_defaults=true&pretty"
and
GET /_nodes/stats/thread_pool
are also listed under
Red / Yellow (MUST) above—one call each per session when both sections apply. If you run
only this performance block, add those two
lines from that block.
Resource anomalies without a closed loop (SHOULD) — recommended set
bash
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cat/nodes?v&s=cpu:desc&h=name,ip,cpu,heap.percent,ram.percent,load_1m"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_nodes/stats/jvm?pretty"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cat/allocation?v&bytes=gb"
GET /_cluster/settings?include_defaults=true
also appears under
Query / write performance (MUST) above—reuse one response when both blocks apply. If you run
only this SHOULD block, add the same
line from that block.
Protocol sanity (avoid
): usually
http/https scheme mismatch on
— fix scheme/port and retry.
Scenario → endpoint index: references/es-api-catalog.md.
8. Diagnostic coverage
The knowledge base covers
48+ health-event-style rules and chained scenarios (e.g. disk pressure → allocation → Red).
Per-category counts, P0/P1/P2 mix, and event codes: references/health-events-catalog.md — scenario runbooks:
(index:
references/README.md).
9. Best practices
Read-only: no mutating control-plane APIs; no teardown.
- Layered + evidence-bound: scan → SOP depth; every conclusion cites metrics/logs/events; if ES is unreachable, state limits (es-api-call-failures.md).
- Priority vs narrative: P0→P2 for urgency; Incident timeline when multiple dimensions differ in time (Step 4). Credentials / TLS / parameters: §1–2 and §4.
- Green is not “all clear” — watermarks, blocks, mis-set limits still matter; MUST + reachable ES: do not skip §5/§7 evidence because the cluster is Green or OpenAPI “explains” symptoms.
- Thread-pool : cumulative unless you show a delta — sop-query-thread-pool.md §1–2; write/bulk: sop-write-performance.md §2.
10. Reference links
references/verification-method.md
— Verification (how to validate diagnosis; metrics, APIs, workflows)
references/report-template.md
— Structured diagnosis report skeleton
- — Language map (reference assets and runbooks; English in this repo)
references/ram-policies.md
— RAM policy list
references/acceptance-criteria.md
— Correct/incorrect patterns and acceptance (includes credential and safety anti-patterns)
references/cli-installation-guide.md
— Aliyun CLI installation
references/es-api-catalog.md
— Elasticsearch REST API catalog
references/health-events-catalog.md
— Health event catalog
- — Scenario SOPs (e.g.
sop-activating-change-stuck.md
for / change stuck, cross-layer root cause)
references/es-api-diagnosis-strategy.md
— Elasticsearch API diagnosis strategy