ai-nlp-analytics

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AI NLP Analytics

AI NLP分析

Use When

适用场景

Text analytics using LLM APIs — sentiment analysis, customer feedback classification, document entity extraction, multi-language support (English/Luganda/Swahili), feedback aggregation, and NLP feature implementation for PHP/Android/iOS. Sources...
The task needs reusable judgment, domain constraints, or a proven workflow rather than ad hoc advice.

使用LLM API进行文本分析——情感分析、客户反馈分类、文档实体提取、多语言支持（英语/卢干达语/斯瓦希里语）、反馈聚合，以及为PHP/Android/iOS实现NLP功能。来源...
任务需要可复用的判断逻辑、领域约束或成熟工作流，而非临时建议。

Do Not Use When

不适用场景

The task is unrelated to
```
ai-nlp-analytics
```
or would be better handled by a more specific companion skill.
The request only needs a trivial answer and none of this skill's constraints or references materially help.

任务与
```
ai-nlp-analytics
```
无关，或更适合由特定配套技能处理。
请求只需简单答案，本技能的约束或参考资料无法提供实质性帮助。

Required Inputs

必要输入

Gather relevant project context, constraints, and the concrete problem to solve.
Confirm the desired deliverable: design, code, review, migration plan, audit, or documentation.

收集相关项目背景、约束条件及具体待解决问题。
确认期望交付物：设计方案、代码、评审结果、迁移计划、审计报告或文档。

Workflow

工作流程

Read this
```
SKILL.md
```
first, then load only the referenced deep-dive files that are necessary for the task.
Apply the ordered guidance, checklists, and decision rules in this skill instead of cherry-picking isolated snippets.
Produce the deliverable with assumptions, risks, and follow-up work made explicit when they matter.

先阅读此
```
SKILL.md
```
，再仅加载完成任务所需的相关深度文档。
应用本技能中的有序指导、检查清单和决策规则，而非随意挑选孤立片段。
生成交付物时，若相关需明确说明假设条件、风险及后续工作。

Quality Standards

质量标准

Keep outputs execution-oriented, concise, and aligned with the repository's baseline engineering standards.
Preserve compatibility with existing project conventions unless the skill explicitly requires a stronger standard.
Prefer deterministic, reviewable steps over vague advice or tool-specific magic.

输出内容需以执行导向为主，简洁明了，并与仓库的基线工程标准保持一致。
除非技能明确要求更高标准，否则需兼容现有项目惯例。
优先采用可确定、可评审的步骤，而非模糊建议或工具特定的“魔法操作”。

Anti-Patterns

反模式

Treating examples as copy-paste truth without checking fit, constraints, or failure modes.
Loading every reference file by default instead of using progressive disclosure.

将示例视为可直接复制粘贴的标准，而不检查适用性、约束条件或失败模式。
默认加载所有参考文件，而非逐步按需披露。

Outputs

输出结果

A concrete result that fits the task: implementation guidance, review findings, architecture decisions, templates, or generated artifacts.
Clear assumptions, tradeoffs, or unresolved gaps when the task cannot be completed from available context alone.
References used, companion skills, or follow-up actions when they materially improve execution.

符合任务需求的具体成果：实施指南、评审发现、架构决策、模板或生成的工件。
若现有上下文无法完成任务，需明确说明假设、权衡或未解决的缺口。
若能实质性提升执行效果，需列出使用的参考资料、配套技能或后续行动。

Evidence Produced

生成的证据

Category	Artifact	Format	Example
Correctness	NLP analytics evaluation	Markdown doc covering sentiment, classification, and entity-extraction accuracy on a fixed eval set	`docs/ai/nlp-eval-2026-04-16.md`

分类	工件	格式	示例
正确性	NLP分析评估报告	涵盖固定评估集上情感分析、分类和实体提取准确率的Markdown文档	`docs/ai/nlp-eval-2026-04-16.md`

References

参考资料

Use the links and companion skills already referenced in this file when deeper context is needed.

如需更深入的上下文，使用本文件中已引用的链接和配套技能。

What NLP Analytics Does

NLP分析的功能

Natural Language Processing (NLP) analytics transforms unstructured text — feedback, comments, messages, documents, forms — into structured, actionable insights. Using LLM APIs, you can perform sophisticated NLP without training custom models.

Use cases for SaaS products:

Analyse parent/patient/customer feedback automatically.
Classify support tickets or complaints by type and urgency.
Extract key entities from uploaded documents (invoices, receipts, forms).
Summarise free-text notes into structured records.
Detect sentiment in survey responses across thousands of users.

自然语言处理（NLP）分析将非结构化文本——反馈、评论、消息、文档、表单——转化为结构化、可执行的洞察。借助LLM API，无需训练自定义模型即可实现复杂的NLP功能。

SaaS产品适用场景：

自动分析家长/患者/客户反馈。
按类型和紧急程度分类支持工单或投诉。
从上传文档（发票、收据、表单）中提取关键实体。
将自由文本笔记总结为结构化记录。
检测数千用户调查回复中的情感倾向。

Feature 1: Sentiment Analysis

功能1：情感分析

Classify the emotional tone of text as Positive, Neutral, or Negative. Apply to: feedback forms, app reviews, survey responses, support messages.

将文本的情感基调分类为积极、中性或消极。适用于：反馈表单、应用评论、调查回复、支持消息。

Prompt Template

提示词模板

You are a sentiment analysis engine for a business management system.
Classify the sentiment of each piece of text.

Input: array of { id, text, source, language }
Output — strict JSON array:
[
  {
    "id": <string>,
    "sentiment": "positive|neutral|negative",
    "intensity": "strong|moderate|mild",
    "key_phrase": "<the phrase that most drives the sentiment, max 8 words>",
    "language_detected": "<ISO 639-1 code>"
  }
]

Rules:
- Detect language automatically; do not require English input.
- Do not infer sentiment from punctuation alone — read meaning.
- If text is too short to judge (< 3 words), return sentiment: "neutral", intensity: "mild".

You are a sentiment analysis engine for a business management system.
Classify the sentiment of each piece of text.

Input: array of { id, text, source, language }
Output — strict JSON array:
[
  {
    "id": <string>,
    "sentiment": "positive|neutral|negative",
    "intensity": "strong|moderate|mild",
    "key_phrase": "<the phrase that most drives the sentiment, max 8 words>",
    "language_detected": "<ISO 639-1 code>"
  }
]

Rules:
- Detect language automatically; do not require English input.
- Do not infer sentiment from punctuation alone — read meaning.
- If text is too short to judge (< 3 words), return sentiment: "neutral", intensity: "mild".

Aggregation Query (PHP/Laravel)

聚合查询（PHP/Laravel）

php

// Aggregate sentiment results by tenant for the dashboard
$summary = DB::table('nlp_results')
    ->where('tenant_id', $tenantId)
    ->where('period', $period)
    ->selectRaw('
        sentiment,
        COUNT(*) as count,
        ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) as pct
    ')
    ->groupBy('sentiment')
    ->get();

// Store individual results
NLPResult::create([
    'tenant_id'   => $tenantId,
    'source_type' => 'parent_feedback',
    'source_id'   => $feedbackId,
    'sentiment'   => $result['sentiment'],
    'intensity'   => $result['intensity'],
    'key_phrase'  => $result['key_phrase'],
    'period'      => now()->format('Y-m'),
]);

php

// Aggregate sentiment results by tenant for the dashboard
$summary = DB::table('nlp_results')
    ->where('tenant_id', $tenantId)
    ->where('period', $period)
    ->selectRaw('
        sentiment,
        COUNT(*) as count,
        ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) as pct
    ')
    ->groupBy('sentiment')
    ->get();

// Store individual results
NLPResult::create([
    'tenant_id'   => $tenantId,
    'source_type' => 'parent_feedback',
    'source_id'   => $feedbackId,
    'sentiment'   => $result['sentiment'],
    'intensity'   => $result['intensity'],
    'key_phrase'  => $result['key_phrase'],
    'period'      => now()->format('Y-m'),
]);

Dashboard Display

仪表盘展示

Feedback Sentiment — This Term
Positive  ████████████░░░░  74%  (148 responses)
Neutral   ███░░░░░░░░░░░░░  18%  (36 responses)
Negative  ██░░░░░░░░░░░░░░   8%  (16 responses)

Top Negative Themes:
- "Fees too high" (6 mentions)
- "Poor communication from teachers" (4 mentions)
- "Long waiting times at the clinic" (3 mentions)

Feedback Sentiment — This Term
Positive  ████████████░░░░  74%  (148 responses)
Neutral   ███░░░░░░░░░░░░░  18%  (36 responses)
Negative  ██░░░░░░░░░░░░░░   8%  (16 responses)

Top Negative Themes:
- "Fees too high" (6 mentions)
- "Poor communication from teachers" (4 mentions)
- "Long waiting times at the clinic" (3 mentions)

Feature 2: Text Classification

功能2：文本分类

Categorise incoming text into predefined business categories. Apply to: support tickets, expense descriptions, complaint types, document types.

将传入文本归类到预定义的业务类别中。适用于：支持工单、费用描述、投诉类型、文档类型。

Prompt Template

提示词模板

You are a text classification engine.
Classify each item into exactly one category from the provided list.

Categories: [<list from caller>]

Input: array of { id, text }
Output — strict JSON array:
[
  {
    "id": <string>,
    "category": "<one of the provided categories>",
    "confidence": "high|medium|low",
    "secondary_category": "<second best category or null>"
  }
]

If the text does not fit any category, use the category: "uncategorised".

You are a text classification engine.
Classify each item into exactly one category from the provided list.

Categories: [<list from caller>]

Input: array of { id, text }
Output — strict JSON array:
[
  {
    "id": <string>,
    "category": "<one of the provided categories>",
    "confidence": "high|medium|low",
    "secondary_category": "<second best category or null>"
  }
]

If the text does not fit any category, use the category: "uncategorised".

Domain Category Examples

领域类别示例

Support tickets (school):

["Fee query", "Grade query", "Attendance query", "Technical issue",
 "Complaint — teacher", "Complaint — facilities", "Admission enquiry", "Other"]

Expense classification (ERP):

["Travel", "Accommodation", "Meals", "Office supplies", "IT equipment",
 "Professional services", "Utilities", "Marketing", "Miscellaneous"]

Healthcare complaints:

["Wait time", "Staff conduct", "Treatment quality", "Billing",
 "Facility cleanliness", "Medication", "Communication", "Other"]

学校支持工单：

["Fee query", "Grade query", "Attendance query", "Technical issue",
 "Complaint — teacher", "Complaint — facilities", "Admission enquiry", "Other"]

ERP费用分类：

["Travel", "Accommodation", "Meals", "Office supplies", "IT equipment",
 "Professional services", "Utilities", "Marketing", "Miscellaneous"]

医疗投诉：

["Wait time", "Staff conduct", "Treatment quality", "Billing",
 "Facility cleanliness", "Medication", "Communication", "Other"]

Bulk Classification Cost

批量分类成本

Processing 500 support tickets per month:

Input: ~200 tokens per ticket × 500 = 100,000 tokens
Output: ~30 tokens per ticket × 500 = 15,000 tokens
With Haiku: (100K × $0.80 + 15K × $4.00) / 1M = $0.08 + $0.06 = $0.14/month

每月处理500条支持工单：

输入：约200 tokens/工单 × 500 = 100,000 tokens
输出：约30 tokens/工单 × 500 = 15,000 tokens
使用Haiku模型：(100K × $0.80 + 15K × $4.00) / 1M = $0.08 + $0.06 = $0.14/月

Feature 3: Named Entity Extraction

功能3：命名实体提取

Pull structured data from free-form documents. Apply to: uploaded invoices, receipts, ID documents, lab reports, application forms.

从自由格式文档中提取结构化数据。适用于：上传的发票、收据、身份证件、实验室报告、申请表。

Prompt Template — Invoice Extraction

提示词模板——发票提取

You are a document intelligence engine.
Extract structured data from the provided invoice or receipt text.

Output — strict JSON:
{
  "vendor_name": "<string or null>",
  "vendor_tin": "<string or null>",
  "invoice_number": "<string or null>",
  "invoice_date": "<YYYY-MM-DD or null>",
  "due_date": "<YYYY-MM-DD or null>",
  "currency": "<ISO 4217 code>",
  "subtotal": <float or null>,
  "tax_amount": <float or null>,
  "total_amount": <float or null>,
  "line_items": [
    { "description": "<string>", "quantity": <float>, "unit_price": <float>, "amount": <float> }
  ],
  "extraction_confidence": "high|medium|low",
  "flags": ["<any field that could not be reliably extracted>"]
}

If a field is not present in the document, return null.
Do not invent or guess values — only extract what is explicitly stated.

You are a document intelligence engine.
Extract structured data from the provided invoice or receipt text.

Output — strict JSON:
{
  "vendor_name": "<string or null>",
  "vendor_tin": "<string or null>",
  "invoice_number": "<string or null>",
  "invoice_date": "<YYYY-MM-DD or null>",
  "due_date": "<YYYY-MM-DD or null>",
  "currency": "<ISO 4217 code>",
  "subtotal": <float or null>,
  "tax_amount": <float or null>,
  "total_amount": <float or null>,
  "line_items": [
    { "description": "<string>", "quantity": <float>, "unit_price": <float>, "amount": <float> }
  ],
  "extraction_confidence": "high|medium|low",
  "flags": ["<any field that could not be reliably extracted>"]
}

If a field is not present in the document, return null.
Do not invent or guess values — only extract what is explicitly stated.

Photo-to-Text Pipeline (Android/iOS)

图片转文本流程（Android/iOS）

kotlin

// Android — OCR via ML Kit, then send text to AI Service
val recognizer = TextRecognition.getClient(TextRecognizerOptions.DEFAULT_OPTIONS)
recognizer.process(InputImage.fromBitmap(bitmap, 0))
    .addOnSuccessListener { visionText ->
        val extractedText = visionText.text
        viewModel.extractInvoiceData(extractedText)  // calls AI Service
    }

kotlin

// Android — OCR via ML Kit, then send text to AI Service
val recognizer = TextRecognition.getClient(TextRecognizerOptions.DEFAULT_OPTIONS)
recognizer.process(InputImage.fromBitmap(bitmap, 0))
    .addOnSuccessListener { visionText ->
        val extractedText = visionText.text
        viewModel.extractInvoiceData(extractedText)  // calls AI Service
    }

Feature 4: Feedback Aggregation and Theme Detection

功能4：反馈聚合与主题检测

Identify recurring themes across large volumes of free-text feedback. Useful for end-of-term parent surveys, patient satisfaction, customer reviews.

在大量自由文本反馈中识别重复出现的主题。适用于期末家长调查、患者满意度调查、客户评论。

Prompt Template

提示词模板

You are a qualitative research analyst.
Read the following responses and identify the top themes expressed.

Responses: [<array of text responses>]

Output — strict JSON:
{
  "total_responses_analysed": <int>,
  "themes": [
    {
      "theme": "<short label, max 5 words>",
      "description": "<one sentence explaining the theme>",
      "frequency": "<approximate number of responses mentioning this>",
      "sentiment": "positive|negative|mixed",
      "representative_quotes": ["<verbatim quote 1>", "<verbatim quote 2>"]
    }
  ],
  "overall_summary": "<2–3 sentence executive summary>",
  "top_recommended_action": "<one sentence — most impactful thing to address>"
}

Identify 3–7 distinct themes. Do not overlap themes.

Batch size guidance: Process 30–50 responses per API call. For 500 responses, run 10–17 calls nightly.

You are a qualitative research analyst.
Read the following responses and identify the top themes expressed.

Responses: [<array of text responses>]

Output — strict JSON:
{
  "total_responses_analysed": <int>,
  "themes": [
    {
      "theme": "<short label, max 5 words>",
      "description": "<one sentence explaining the theme>",
      "frequency": "<approximate number of responses mentioning this>",
      "sentiment": "positive|negative|mixed",
      "representative_quotes": ["<verbatim quote 1>", "<verbatim quote 2>"]
    }
  ],
  "overall_summary": "<2–3 sentence executive summary>",
  "top_recommended_action": "<one sentence — most impactful thing to address>"
}

Identify 3–7 distinct themes. Do not overlap themes.

批量处理规模建议： 每次API调用处理30–50条回复。若有500条回复，可在夜间运行10–17次调用。

Feature 5: Multi-Language Support

功能5：多语言支持

East African clients write in English, Luganda, Swahili, and mixed code-switching. LLMs handle this natively — no translation step needed.

In every NLP prompt, add:

Language handling:
- Accept input in any language including Luganda, Swahili, and East African English varieties.
- Output must always be in [target_language — default English].
- Do not transliterate names or places.

Detected language handling (PHP):

php

$languageDetected = $nlpResult['language_detected']; // 'lg' = Luganda, 'sw' = Swahili

// Store for analytics — track which languages clients use
NLPResult::create([
    'language' => $languageDetected,
    // ...
]);

// Show language breakdown on admin dashboard
// "Feedback received: 62% English | 24% Luganda | 14% Swahili"

东非客户使用英语、卢干达语、斯瓦希里语及混合语码转换。LLM可原生处理此类情况——无需翻译步骤。

在所有NLP提示词中添加：

Language handling:
- Accept input in any language including Luganda, Swahili, and East African English varieties.
- Output must always be in [target_language — default English].
- Do not transliterate names or places.

检测语言处理（PHP）：

php

$languageDetected = $nlpResult['language_detected']; // 'lg' = Luganda, 'sw' = Swahili

// Store for analytics — track which languages clients use
NLPResult::create([
    'language' => $languageDetected,
    // ...
]);

// Show language breakdown on admin dashboard
// "Feedback received: 62% English | 24% Luganda | 14% Swahili"

NLP Analytics Storage Schema

NLP分析存储表结构

sql

CREATE TABLE nlp_results (
    id              BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    tenant_id       BIGINT UNSIGNED NOT NULL,
    source_type     VARCHAR(64) NOT NULL,  -- 'feedback', 'ticket', 'invoice', 'survey'
    source_id       BIGINT UNSIGNED NOT NULL,
    nlp_task        VARCHAR(32) NOT NULL,  -- 'sentiment', 'classification', 'extraction', 'themes'
    result_json     JSON NOT NULL,
    sentiment       ENUM('positive','neutral','negative') NULL,
    category        VARCHAR(128) NULL,
    confidence      ENUM('high','medium','low') NULL,
    language        CHAR(5) NULL,
    period          CHAR(7) NOT NULL,
    created_at      DATETIME DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_tenant_period  (tenant_id, period),
    INDEX idx_source         (source_type, source_id),
    INDEX idx_sentiment      (tenant_id, sentiment, period)
);

sql

CREATE TABLE nlp_results (
    id              BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    tenant_id       BIGINT UNSIGNED NOT NULL,
    source_type     VARCHAR(64) NOT NULL,  -- 'feedback', 'ticket', 'invoice', 'survey'
    source_id       BIGINT UNSIGNED NOT NULL,
    nlp_task        VARCHAR(32) NOT NULL,  -- 'sentiment', 'classification', 'extraction', 'themes'
    result_json     JSON NOT NULL,
    sentiment       ENUM('positive','neutral','negative') NULL,
    category        VARCHAR(128) NULL,
    confidence      ENUM('high','medium','low') NULL,
    language        CHAR(5) NULL,
    period          CHAR(7) NOT NULL,
    created_at      DATETIME DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_tenant_period  (tenant_id, period),
    INDEX idx_source         (source_type, source_id),
    INDEX idx_sentiment      (tenant_id, sentiment, period)
);

Anti-Patterns

反模式

Never run NLP on personal health data without DPPA-compliant scrubbing first.
Never show verbatim quotes in theme reports without confirming the user has permission to see that feedback (RBAC check).
Never classify into too many categories (> 10) — accuracy degrades.
Never skip the validation step: parse the JSON output before storing it.
Never run entity extraction on an image without OCR first — the model needs text input, not an image file, unless using a vision-capable model.

See also:

```
ai-feature-spec
```
— Prompt design standards and output validation
```
ai-security
```
— PII scrubbing before NLP on personal data
```
ai-predictive-analytics
```
— Structured data prediction (classification, regression)
```
ai-analytics-dashboards
```
— Displaying sentiment and theme analytics
```
ai-cost-modeling
```
— Token cost for batch NLP processing

若未先进行符合DPPA标准的脱敏处理，切勿对个人健康数据运行NLP分析。
若未确认用户有权查看反馈，切勿在主题报告中显示原始引用（需进行RBAC权限检查）。
切勿设置过多分类（超过10个）——准确率会下降。
切勿跳过验证步骤：存储前需解析JSON输出。
若未先进行OCR，切勿对图片运行实体提取——模型需要文本输入而非图像文件，除非使用具备视觉能力的模型。

另请参阅：

```
ai-feature-spec
```
— 提示词设计标准与输出验证
```
ai-security
```
— 对个人数据进行NLP分析前的PII脱敏处理
```
ai-predictive-analytics
```
— 结构化数据预测（分类、回归）
```
ai-analytics-dashboards
```
— 情感与主题分析的仪表盘展示
```
ai-cost-modeling
```
— 批量NLP处理的Token成本计算