conference-speaker-scraper

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Conference Speaker Scraper

会议演讲者爬取工具

Extract speaker names, titles, companies, and bios from conference website /speakers pages. Supports direct HTML scraping with multiple extraction strategies, plus Apify fallback for JS-heavy sites.

从会议网站的/speakers页面提取演讲者姓名、职位、所属公司及个人简介。支持采用多种提取策略的直接HTML爬取，针对JS渲染密集型网站可切换至Apify作为备选方案。

Quick Start

快速开始

Only dependency is

pip install requests

. No API key needed for direct scraping mode.

bash

undefined

仅需依赖

pip install requests

。直接爬取模式无需API密钥。

bash

undefined

Scrape speakers from a conference page

python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py
--url "https://example.com/speakers"

Use Apify for JS-heavy sites

python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py
--url "https://example.com/speakers" --mode apify

Custom conference name (otherwise inferred from URL)

python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py
--url "https://example.com/speakers" --conference "Sage Future 2026"

Output formats

python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output json # default python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output csv python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output summary

undefined

undefined

How It Works

工作原理

Direct Mode (default)

直接模式（默认）

Fetches the page HTML and tries multiple extraction strategies in order, using whichever returns the most results:

Strategy A -- CSS class hints: Looks for speaker cards with class names containing "speaker", "presenter", "faculty", "panelist", "team-member"
Strategy B -- Heading + paragraph patterns: Looks for repeated
```
<h2>
```
/
```
<h3>
```
+
```
<p>
```
structures
Strategy C -- JSON-LD structured data: Checks for
```
<script type="application/ld+json">
```
with speaker data
Strategy D -- Platform embeds: Detects Sched.com/Sessionize patterns used by many conferences

获取页面HTML并依次尝试多种提取策略，采用返回结果最多的策略：

策略A -- CSS类提示：查找类名包含"speaker"、"presenter"、"faculty"、"panelist"、"team-member"的演讲者卡片
策略B -- 标题+段落模式：查找重复的
```
<h2>
```
/
```
<h3>
```
+
```
<p>
```
结构
策略C -- JSON-LD结构化数据：检查包含演讲者数据的
```
<script type="application/ld+json">
```
标签
策略D -- 平台嵌入：检测众多会议使用的Sched.com/Sessionize模式

Apify Mode

Apify模式

Uses

apify/cheerio-scraper

actor with a custom page function that targets common speaker card selectors. Standard POST/poll/GET dataset pattern.

使用

apify/cheerio-scraper

执行器及自定义页面函数，定位常见演讲者卡片选择器。采用标准的POST/轮询/GET数据集模式。

CLI Reference

CLI参考

Flag	Default	Description
`--url`	required	Conference speakers page URL
`--conference`	inferred	Conference name (otherwise inferred from URL domain)
`--mode`	direct	`direct` (HTML scraping) or `apify` (Apify cheerio scraper)
`--output`	json	Output format: `json` , `csv` , or `summary`
`--token`	env var	Apify token (only needed for apify mode)
`--timeout`	300	Max seconds for Apify run

标识	默认值	描述
`--url`	必填	会议演讲者页面URL
`--conference`	自动推断	会议名称（否则从URL域名自动推断）
`--mode`	direct	`direct` （HTML爬取）或 `apify` （Apify cheerio爬取器）
`--output`	json	输出格式： `json` 、 `csv` 或 `summary`
`--token`	环境变量	Apify令牌（仅Apify模式需要）
`--timeout`	300	Apify运行的最长秒数

Output Schema

输出 Schema

json

{
  "name": "Jane Smith",
  "title": "VP of Finance",
  "company": "Acme Corp",
  "bio": "Jane leads the finance transformation at...",
  "linkedin_url": "https://linkedin.com/in/janesmith",
  "image_url": "https://...",
  "conference": "Sage Future 2026",
  "source_url": "https://sagefuture2026.com/speakers"
}

json

{
  "name": "Jane Smith",
  "title": "VP of Finance",
  "company": "Acme Corp",
  "bio": "Jane leads the finance transformation at...",
  "linkedin_url": "https://linkedin.com/in/janesmith",
  "image_url": "https://...",
  "conference": "Sage Future 2026",
  "source_url": "https://sagefuture2026.com/speakers"
}

Cost

成本说明

Direct mode: Free (no API, no tokens)
Apify mode: Uses
```
apify/cheerio-scraper
```
-- minimal Apify credits

直接模式：免费（无需API，无需令牌）
Apify模式：使用
```
apify/cheerio-scraper
```
-- 消耗少量Apify积分

Testing Notes

测试说明

HTML scraping is inherently fragile across conference sites. The multi-strategy approach maximizes coverage, but JS-heavy sites will require Apify mode. When direct scraping returns 0 results, try

--mode apify

HTML爬取在不同会议网站上本质上存在不稳定性。多策略方法可最大程度覆盖场景，但JS渲染密集型网站需使用Apify模式。当直接爬取返回0条结果时，尝试使用

--mode apify

。