syncfusion-dotnet-smart-data-extraction

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Smart Data Extractor — Syncfusion

Smart Data Extractor — Syncfusion

Overview

概述

Extracts complete document structures from PDFs and images files using the Syncfusion SmartDataExtractor Library. This skill supports one operational mode — generating C# code for the user's project.
借助Syncfusion SmartDataExtractor Library从PDF和图片文件中提取完整的文档结构。 本Skill仅支持一种操作模式——为用户项目生成C#代码。

Key Capabilities

核心功能

  • Document structure extraction: Identify text elements, images, headers, footers, and tables (including regions, header rows, columns, cell boundaries, and merged cells).
  • File format support: Works with PDF documents and common image formats such as JPEG and PNG.
  • Table extraction: Specialized capability to extract tabular data.
  • Form recognition: Detects and processes structured form data.
  • Page-level control: Extract data from specific pages or defined page ranges.
  • Confidence threshold: Results are filtered based on a configurable confidence score (0.0–1.0).
  • 文档结构提取:识别文本元素、图片、页眉、页脚及表格(包括区域、表头行、列、单元格边界和合并单元格)。
  • 文件格式支持:兼容PDF文档及JPEG、PNG等常见图片格式。
  • 表格提取:具备提取表格数据的专属能力。
  • 表单识别:检测并处理结构化表单数据。
  • 页面级控制:从特定页面或指定页面范围提取数据。
  • 置信度阈值:基于可配置的置信度分数(0.0–1.0)过滤结果。

Prerequisites

前提条件

  • Install required runtime and library packages from NuGet before running extraction.
  • Syncfusion License:
    LICENSE.txt
    or env var
    SYNCFUSION_LICENSE_KEY
  • 在运行提取操作前,需从NuGet安装所需的运行时和库包。
  • Syncfusion许可证:需提供
    LICENSE.txt
    文件或环境变量
    SYNCFUSION_LICENSE_KEY

Quick Start Examples

快速开始示例

Example : Generate Code

示例:生成代码

User: "Write Program.cs code to extract the data from pdf and save as JSON." Result: C# code snippet displayed (no files created)
用户:"编写Program.cs代码以从PDF提取数据并保存为JSON格式。" **结果:**展示C#代码片段(不创建文件)

Mode

模式

Mode 1: Generate C# Code for the User's Project (default)

模式1:为用户项目生成C#代码(默认)

Use this mode when the user wants to view, write, review, refactor, or modify C# code related to Smart Data Extractor processing. Trigger keywords: "show me how", "how to", "how can I", "how do I", "provide code", "provide an example", "give an example", "demonstrate", "code snippet", "sample code", "example", "sample", "give me", "show me", "Program.cs", "example code", "generate code for", "codesnippet" .
Workflow:
当用户需要查看、编写、审阅、重构或修改与Smart Data Extractor处理相关的C#代码时,使用此模式。 触发关键词:"show me how", "how to", "how can I", "how do I", "provide code", "provide an example", "give an example", "demonstrate", "code snippet", "sample code", "example", "sample", "give me", "show me", "Program.cs", "example code", "generate code for", "codesnippet"。
工作流程:

Step 1 — Detect Application Type and Suggest Required NuGet Packages

步骤1 — 检测应用类型并推荐所需NuGet包

  • Inspect the workspace project files (
    .csproj
    ,
    web.config
    ,
    App.config
    ,
    Startup.cs
    ,
    Program.cs
    , etc.) and use the detection signals table in
    references/nuget-packages.md
    to determine the application type.
  • Based on the detected application type, identify the correct NuGet package(s) from
    references/nuget-packages.md
    and instruct the user to install them before generating any code. ONLY use package IDs and versions listed in
    references/nuget-packages.md
    — do not suggest, look up, or infer package names from external sources or common naming conventions.
  • Note: If the user's request is explicitly table-only (asks only to extract table data), recommend only the Table Extractor package listed in
    references/nuget-packages.md
    and review the ExtractTable section for the detected application type. Do not recommend or add the broader
    SmartDataExtractor
    package unless the user requests non-table extraction or JSON conversion features.
  • 检查工作区项目文件(
    .csproj
    web.config
    App.config
    Startup.cs
    Program.cs
    等),并使用
    references/nuget-packages.md
    中的检测信号表确定应用类型。
  • 根据检测到的应用类型,从
    references/nuget-packages.md
    中选择正确的NuGet包,并指导用户在生成代码前先安装这些包。仅可使用
    references/nuget-packages.md
    中列出的包ID和版本——不得从外部来源或通用命名规则中建议、查找或推断包名称。
  • 注意:如果用户明确仅请求提取表格数据(仅要求提取表格),则仅为检测到的应用类型推荐
    references/nuget-packages.md
    中列出的表格提取器包,并查看对应应用类型的ExtractTable章节。除非用户请求非表格提取或JSON转换功能,否则不得推荐或添加更通用的
    SmartDataExtractor
    包。

Step 2 — Generate Code from Reference Files Only

步骤2 — 仅从参考文件生成代码

Do NOT invent, guess, or suggest any API, method, property, class, or namespace not explicitly present in the reference files.
  • Read the relevant
    references/*.md
    file(s) for the requested feature
  • Build C# code strictly from the APIs and snippets found in those files
  • Select the correct snippet variant based on the app type detected in Step 1:
    • Windows-specific apps (WinForms, WPF, .NET Framework Console) → use Windows-specific snippets
    • Cross-platform apps (ASP.NET Core, .NET Core/.NET 5+ Console, Blazor, MAUI) → use cross-platform /
      .Net.Core
      snippets
    • After the
      using
      / namespace lines at the top of the generated code, always insert the license registration block from the Register License section in
      references/nuget-packages.md
    • Do not create or run any
      .csx
      script


不得发明、猜测或使用参考文件中未明确提及的任何API、方法、属性、类或命名空间。
  • 读取与所需功能相关的
    references/*.md
    文件
  • 严格基于这些文件中的API和代码片段构建C#代码
  • 根据步骤1中检测到的应用类型选择正确的代码片段变体:
    • Windows专属应用(WinForms、WPF、.NET Framework控制台应用)→ 使用Windows专属代码片段
    • 跨平台应用(ASP.NET Core、.NET Core/.NET 5+控制台应用、Blazor、MAUI)→ 使用跨平台/
      .Net.Core
      代码片段
    • 在生成代码顶部的
      using
      /命名空间行之后,务必插入
      references/nuget-packages.md
      中「注册许可证」章节的许可证注册代码块
    • 不得创建或运行任何
      .csx
      脚本


Code References

代码参考

All templates and snippets are in the
references/
folder:
FileContents
document-structure.mdQuick extractor setup and usage snippets
extract-data.mdExamples: ExtractDataAsJson, ExtractDataAsPdfStream,ExtractDataAsPdfDocument, async variants
extract-table.mdTable extraction examples (ExtractTableAsJson)
recognize-forms.mdrecognize form fields examples : FormRecognizeOptions, RecognizeFormAsPdfDocument,RecognizeFormAsPdfStream, RecognizeFormAsJson async variants
data-options.mdExplanation of
TableExtractionOptions
,
FormRecognizeOptions
,
ConfidenceThreshold
,
PageRange

所有模板和代码片段均位于
references/
文件夹中:
文件内容
document-structure.md快速提取器设置及使用代码片段
extract-data.md示例:ExtractDataAsJson、ExtractDataAsPdfStream、ExtractDataAsPdfDocument及异步变体
extract-table.md表格提取示例(ExtractTableAsJson)
recognize-forms.md表单字段识别示例:FormRecognizeOptions、RecognizeFormAsPdfDocument、RecognizeFormAsPdfStream、RecognizeFormAsJson及异步变体
data-options.md
TableExtractionOptions
FormRecognizeOptions
ConfidenceThreshold
PageRange
的说明

Rules

规则

  • Output files go in
    ./output/
    directory
  • Use license key from
    LICENSE.txt
    at workspace root
  • Don't use any API which is not in reference
  • Only use NuGet package IDs and versions defined in
    references/nuget-packages.md
    when recommending or adding packages.
  • For table-only extraction requests, recommend/install only the table extractor package from
    references/nuget-packages.md
    for the detected application type.
  • 输出文件保存至
    ./output/
    目录
  • 使用工作区根目录下
    LICENSE.txt
    中的许可证密钥
  • 不得使用参考文件中未提及的任何API
  • 推荐或添加包时,仅可使用
    references/nuget-packages.md
    中定义的NuGet包ID和版本。
  • 对于仅提取表格的请求,为检测到的应用类型推荐/安装
    references/nuget-packages.md
    中列出的表格提取器包。