conference-speaker-scraper
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseConference Speaker Scraper
会议演讲者爬取工具
Extract speaker names, titles, companies, and bios from conference website /speakers pages. Supports direct HTML scraping with multiple extraction strategies, plus Apify fallback for JS-heavy sites.
从会议网站的/speakers页面提取演讲者姓名、职位、所属公司及个人简介。支持采用多种提取策略的直接HTML爬取,针对JS渲染密集型网站可切换至Apify作为备选方案。
Quick Start
快速开始
Only dependency is . No API key needed for direct scraping mode.
pip install requestsbash
undefined仅需依赖 。直接爬取模式无需API密钥。
pip install requestsbash
undefinedScrape speakers from a conference page
Scrape speakers from a conference page
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py
--url "https://example.com/speakers"
--url "https://example.com/speakers"
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py
--url "https://example.com/speakers"
--url "https://example.com/speakers"
Use Apify for JS-heavy sites
Use Apify for JS-heavy sites
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py
--url "https://example.com/speakers" --mode apify
--url "https://example.com/speakers" --mode apify
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py
--url "https://example.com/speakers" --mode apify
--url "https://example.com/speakers" --mode apify
Custom conference name (otherwise inferred from URL)
Custom conference name (otherwise inferred from URL)
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py
--url "https://example.com/speakers" --conference "Sage Future 2026"
--url "https://example.com/speakers" --conference "Sage Future 2026"
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py
--url "https://example.com/speakers" --conference "Sage Future 2026"
--url "https://example.com/speakers" --conference "Sage Future 2026"
Output formats
Output formats
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output json # default
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output csv
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output summary
undefinedpython3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output json # default
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output csv
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output summary
undefinedHow It Works
工作原理
Direct Mode (default)
直接模式(默认)
Fetches the page HTML and tries multiple extraction strategies in order, using whichever returns the most results:
- Strategy A -- CSS class hints: Looks for speaker cards with class names containing "speaker", "presenter", "faculty", "panelist", "team-member"
- Strategy B -- Heading + paragraph patterns: Looks for repeated /
<h2>+<h3>structures<p> - Strategy C -- JSON-LD structured data: Checks for with speaker data
<script type="application/ld+json"> - Strategy D -- Platform embeds: Detects Sched.com/Sessionize patterns used by many conferences
获取页面HTML并依次尝试多种提取策略,采用返回结果最多的策略:
- 策略A -- CSS类提示:查找类名包含"speaker"、"presenter"、"faculty"、"panelist"、"team-member"的演讲者卡片
- 策略B -- 标题+段落模式:查找重复的/
<h2>+<h3>结构<p> - 策略C -- JSON-LD结构化数据:检查包含演讲者数据的标签
<script type="application/ld+json"> - 策略D -- 平台嵌入:检测众多会议使用的Sched.com/Sessionize模式
Apify Mode
Apify模式
Uses actor with a custom page function that targets common speaker card selectors. Standard POST/poll/GET dataset pattern.
apify/cheerio-scraper使用执行器及自定义页面函数,定位常见演讲者卡片选择器。采用标准的POST/轮询/GET数据集模式。
apify/cheerio-scraperCLI Reference
CLI参考
| Flag | Default | Description |
|---|---|---|
| required | Conference speakers page URL |
| inferred | Conference name (otherwise inferred from URL domain) |
| direct | |
| json | Output format: |
| env var | Apify token (only needed for apify mode) |
| 300 | Max seconds for Apify run |
| 标识 | 默认值 | 描述 |
|---|---|---|
| 必填 | 会议演讲者页面URL |
| 自动推断 | 会议名称(否则从URL域名自动推断) |
| direct | |
| json | 输出格式: |
| 环境变量 | Apify令牌(仅Apify模式需要) |
| 300 | Apify运行的最长秒数 |
Output Schema
输出 Schema
json
{
"name": "Jane Smith",
"title": "VP of Finance",
"company": "Acme Corp",
"bio": "Jane leads the finance transformation at...",
"linkedin_url": "https://linkedin.com/in/janesmith",
"image_url": "https://...",
"conference": "Sage Future 2026",
"source_url": "https://sagefuture2026.com/speakers"
}json
{
"name": "Jane Smith",
"title": "VP of Finance",
"company": "Acme Corp",
"bio": "Jane leads the finance transformation at...",
"linkedin_url": "https://linkedin.com/in/janesmith",
"image_url": "https://...",
"conference": "Sage Future 2026",
"source_url": "https://sagefuture2026.com/speakers"
}Cost
成本说明
- Direct mode: Free (no API, no tokens)
- Apify mode: Uses -- minimal Apify credits
apify/cheerio-scraper
- 直接模式:免费(无需API,无需令牌)
- Apify模式:使用-- 消耗少量Apify积分
apify/cheerio-scraper
Testing Notes
测试说明
HTML scraping is inherently fragile across conference sites. The multi-strategy approach maximizes coverage, but JS-heavy sites will require Apify mode. When direct scraping returns 0 results, try .
--mode apifyHTML爬取在不同会议网站上本质上存在不稳定性。多策略方法可最大程度覆盖场景,但JS渲染密集型网站需使用Apify模式。当直接爬取返回0条结果时,尝试使用。
--mode apify