conference-speaker-scraper

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Conference Speaker Scraper

会议演讲者爬取工具

Extract speaker names, titles, companies, and bios from conference website /speakers pages. Supports direct HTML scraping with multiple extraction strategies, plus Apify fallback for JS-heavy sites.
从会议网站的/speakers页面提取演讲者姓名、职位、所属公司及个人简介。支持采用多种提取策略的直接HTML爬取,针对JS渲染密集型网站可切换至Apify作为备选方案。

Quick Start

快速开始

Only dependency is
pip install requests
. No API key needed for direct scraping mode.
bash
undefined
仅需依赖
pip install requests
。直接爬取模式无需API密钥。
bash
undefined

Scrape speakers from a conference page

Scrape speakers from a conference page

python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py
--url "https://example.com/speakers"
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py
--url "https://example.com/speakers"

Use Apify for JS-heavy sites

Use Apify for JS-heavy sites

python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py
--url "https://example.com/speakers" --mode apify
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py
--url "https://example.com/speakers" --mode apify

Custom conference name (otherwise inferred from URL)

Custom conference name (otherwise inferred from URL)

python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py
--url "https://example.com/speakers" --conference "Sage Future 2026"
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py
--url "https://example.com/speakers" --conference "Sage Future 2026"

Output formats

Output formats

python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output json # default python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output csv python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output summary
undefined
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output json # default python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output csv python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output summary
undefined

How It Works

工作原理

Direct Mode (default)

直接模式(默认)

Fetches the page HTML and tries multiple extraction strategies in order, using whichever returns the most results:
  1. Strategy A -- CSS class hints: Looks for speaker cards with class names containing "speaker", "presenter", "faculty", "panelist", "team-member"
  2. Strategy B -- Heading + paragraph patterns: Looks for repeated
    <h2>
    /
    <h3>
    +
    <p>
    structures
  3. Strategy C -- JSON-LD structured data: Checks for
    <script type="application/ld+json">
    with speaker data
  4. Strategy D -- Platform embeds: Detects Sched.com/Sessionize patterns used by many conferences
获取页面HTML并依次尝试多种提取策略,采用返回结果最多的策略:
  1. 策略A -- CSS类提示:查找类名包含"speaker"、"presenter"、"faculty"、"panelist"、"team-member"的演讲者卡片
  2. 策略B -- 标题+段落模式:查找重复的
    <h2>
    /
    <h3>
    +
    <p>
    结构
  3. 策略C -- JSON-LD结构化数据:检查包含演讲者数据的
    <script type="application/ld+json">
    标签
  4. 策略D -- 平台嵌入:检测众多会议使用的Sched.com/Sessionize模式

Apify Mode

Apify模式

Uses
apify/cheerio-scraper
actor with a custom page function that targets common speaker card selectors. Standard POST/poll/GET dataset pattern.
使用
apify/cheerio-scraper
执行器及自定义页面函数,定位常见演讲者卡片选择器。采用标准的POST/轮询/GET数据集模式。

CLI Reference

CLI参考

FlagDefaultDescription
--url
requiredConference speakers page URL
--conference
inferredConference name (otherwise inferred from URL domain)
--mode
direct
direct
(HTML scraping) or
apify
(Apify cheerio scraper)
--output
jsonOutput format:
json
,
csv
, or
summary
--token
env varApify token (only needed for apify mode)
--timeout
300Max seconds for Apify run
标识默认值描述
--url
必填会议演讲者页面URL
--conference
自动推断会议名称(否则从URL域名自动推断)
--mode
direct
direct
(HTML爬取)或
apify
(Apify cheerio爬取器)
--output
json输出格式:
json
csv
summary
--token
环境变量Apify令牌(仅Apify模式需要)
--timeout
300Apify运行的最长秒数

Output Schema

输出 Schema

json
{
  "name": "Jane Smith",
  "title": "VP of Finance",
  "company": "Acme Corp",
  "bio": "Jane leads the finance transformation at...",
  "linkedin_url": "https://linkedin.com/in/janesmith",
  "image_url": "https://...",
  "conference": "Sage Future 2026",
  "source_url": "https://sagefuture2026.com/speakers"
}
json
{
  "name": "Jane Smith",
  "title": "VP of Finance",
  "company": "Acme Corp",
  "bio": "Jane leads the finance transformation at...",
  "linkedin_url": "https://linkedin.com/in/janesmith",
  "image_url": "https://...",
  "conference": "Sage Future 2026",
  "source_url": "https://sagefuture2026.com/speakers"
}

Cost

成本说明

  • Direct mode: Free (no API, no tokens)
  • Apify mode: Uses
    apify/cheerio-scraper
    -- minimal Apify credits
  • 直接模式:免费(无需API,无需令牌)
  • Apify模式:使用
    apify/cheerio-scraper
    -- 消耗少量Apify积分

Testing Notes

测试说明

HTML scraping is inherently fragile across conference sites. The multi-strategy approach maximizes coverage, but JS-heavy sites will require Apify mode. When direct scraping returns 0 results, try
--mode apify
.
HTML爬取在不同会议网站上本质上存在不稳定性。多策略方法可最大程度覆盖场景,但JS渲染密集型网站需使用Apify模式。当直接爬取返回0条结果时,尝试使用
--mode apify