data-journalism
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData journalism methodology
数据新闻方法论
Systematic approaches for finding, analyzing and presenting data in journalism.
系统性地寻找、分析和呈现新闻数据的方法。
Story structure for data journalism
数据新闻的报道结构
Data journalism framework
数据新闻框架
markdown
The framework for data journalism was established by Philip Meyer, a journalist for Knight-Ridder, Harvard Nieman Fellow and professor at UNC-Chapel Hill. In his book <i>The New Precision Journalism</i>, which outlines his ideas, Meyer encourages journalists to treat journalism "as if it were a science" by adopting the scientific method:
- Making observation(s) / formulating a questiom
- Researching the question / Collect, store and retrieve data
- Formulate a hypothesis
- Test the hypothesis, using both qualitative (interviews, documents etc.) and quantitative (data analysis etc.) methods
- Analyze the results and reduce them to the most important findings
- Present them to the audience
This process should be thought of as iterative, rather than sequential.markdown
数据新闻的框架由Philip Meyer确立,他是Knight-Ridder的记者、哈佛大学尼曼研究员,同时也是北卡罗来纳大学教堂山分校的教授。在他的著作《新精确新闻》中,他阐述了自己的理念,鼓励记者采用科学方法,将新闻业“视为一门科学”:
- 观察现象 / 提出问题
- 研究问题 / 收集、存储和检索数据
- 提出假设
- 使用定性(访谈、文档等)和定量(数据分析等)方法检验假设
- 分析结果,提炼出最重要的发现
- 向受众呈现结果
这个过程应被视为迭代式的,而非线性顺序的。The data story arc
数据新闻叙事弧线
1. The hook (nut graf)
1. 钩子(核心段落)
- What's the key finding(s)?
- Why should readers care?
- What's the human impact?
- 核心发现是什么?
- 读者为什么需要关注?
- 对人有什么影响?
2. The evidence
2. 证据
- Show the data
- Explain the methodology
- Acknowledge limitations
- 展示数据
- 解释研究方法
- 说明局限性
3. The context
3. 背景
- How does this compare to past?
- How does this compare to elsewhere?
- What's the trend?
- 与过去相比有何变化?
- 与其他地区相比有何不同?
- 趋势是什么?
4. The human element
4. 人文元素
- Individual examples that illustrate the data
- Expert interpretation
- Affected voices
- 能体现数据的个体案例
- 专家解读
- 受影响者的声音
5. The implications
5. 启示
- What does this mean going forward?
- What questions remain?
- What actions could result?
- 这对未来意味着什么?
- 还有哪些问题待解答?
- 可能引发哪些行动?
6. The methodology box
6. 方法说明框
- Where did data come from?
- How was it analyzed?
- What are the limitations?
- How can readers explore further?
undefined- 数据来源何处?
- 如何进行分析?
- 存在哪些局限性?
- 读者如何进一步探索?
undefinedMethodology documentation template
方法文档模板
markdown
undefinedmarkdown
undefinedHow we did this analysis
我们的分析流程
Data sources
数据来源
[List all data sources with links and access dates]
[列出所有数据来源及链接和访问日期]
Time period
时间范围
[Specify exactly what time period is covered]
[明确说明涵盖的时间周期]
Definitions
定义
[Define key terms and how you operationalized them]
[定义关键术语及操作化方式]
Analysis steps
分析步骤
- [First step of analysis]
- [Second step]
- [Continue...]
- [分析第一步]
- [第二步]
- [后续步骤...]
Limitations
局限性
- [Limitation 1]
- [Limitation 2]
- [局限性1]
- [局限性2]
What we excluded and why
排除内容及原因
- [排除类别]:[原因]
Verification
验证方式
[How findings were verified/checked]
[如何验证/检查研究结果]
Code and data availability
代码与数据获取
[Link to GitHub repo if sharing code/data]
[若分享代码/数据,提供GitHub仓库链接]
Contact
联系方式
[How readers can reach you with questions]
undefined[读者咨询的联系方式]
undefinedData acquisition
数据获取
Public data sources
公共数据来源
markdown
undefinedmarkdown
undefinedFederal data sources
联邦数据来源
General
通用类
- Data.gov - Federal open data portal
- Census Bureau (census.gov) - Demographics, economic data
- BLS (bls.gov) - Employment, inflation, wages
- BEA (bea.gov) - GDP, economic accounts
- Federal Reserve (federalreserve.gov) - Financial data
- SEC EDGAR - Corporate filings
- Data.gov - 联邦开放数据门户
- Census Bureau (census.gov) - 人口统计、经济数据
- BLS (bls.gov) - 就业、通胀、薪资数据
- BEA (bea.gov) - GDP、经济账户数据
- Federal Reserve (federalreserve.gov) - 金融数据
- SEC EDGAR - 企业备案文件
Specific domains
特定领域
- EPA (epa.gov/data) - Environmental data
- FDA (fda.gov/data) - Drug approvals, recalls, adverse events
- CDC WONDER - Health statistics
- NHTSA - Vehicle safety data
- DOT - Transportation statistics
- FEC - Campaign finance
- USASpending.gov - Federal contracts and grants
- EPA (epa.gov/data) - 环境数据
- FDA (fda.gov/data) - 药品审批、召回、不良事件数据
- CDC WONDER - 健康统计数据
- NHTSA - 车辆安全数据
- DOT - 交通统计数据
- FEC - 竞选财务数据
- USASpending.gov - 联邦合同与拨款数据
State and local
州及地方数据
- State open data portals (search: "[state] open data")
- Socrata-powered sites (many cities/states)
- OpenStreets, municipal GIS portals
- State comptroller/auditor reports
undefined- 州开放数据门户(搜索:"[州名] open data")
- Socrata支持的站点(众多城市/州使用)
- OpenStreets、市政GIS门户
- 州审计长/审计报告
undefinedData request strategies
数据请求策略
markdown
undefinedmarkdown
undefinedGetting data that isn't public
获取非公开数据
Public records request (ie. FOIA) for datasets
公共记录请求(如FOIA)获取数据集
- Request databases, not just documents
- Ask for data dictionary/schema
- Request in native format (CSV, SQL dump)
- Specify field-level needs
- 请求数据库,而非仅文档
- 索要数据字典/模式
- 要求提供原生格式(CSV、SQL导出文件)
- 明确字段级需求
Building your own dataset
自建数据集
- Scraping public information
- Crowdsourcing from readers
- Systematic document review
- Surveys (with proper methodology)
- 爬取公开信息
- 向读者众包数据
- 系统性文档审查
- 采用规范方法开展调查
Commercial data sources (for newsrooms)
新闻室商用数据来源
- LexisNexis
- Refinitiv
- Bloomberg
- Industry-specific databases
undefined- LexisNexis
- Refinitiv
- Bloomberg
- 行业特定数据库
undefinedData cleaning and preparation
数据清洗与预处理
Common data problems
常见数据问题
python
from typing import Any
import pandas as pd
import numpy as np
from rapidfuzz import fuzz
from itertools import combinationspython
from typing import Any
import pandas as pd
import numpy as np
from rapidfuzz import fuzz
from itertools import combinationsInflation adjustment
Inflation adjustment
import cpi
import wbdata
def standardize_name(name: Any) -> str | None:
"""Standardize name format to 'First Last'."""
if pd.isna(name):
return None
name = str(name).strip().upper()
# Handle "LAST, FIRST" format
if ',' in name:
parts = name.split(',')
name = f"{parts[1].strip()} {parts[0].strip()}"
return name
def parse_date(date_str: Any) -> pd.Timestamp | None:
"""Parse dates in various formats."""
if pd.isna(date_str):
return None
formats = [
'%m/%d/%Y', '%Y-%m-%d', '%B %d, %Y',
'%d-%b-%y', '%m-%d-%Y', '%Y/%m/%d'
]
for fmt in formats:
try:
return pd.to_datetime(date_str, format=fmt)
except:
continue
# Fall back to pandas parser
try:
return pd.to_datetime(date_str)
except:
return Nonedef handle_missing(df:pd.DataFrame, thresh:int | None, per_thresh:float | None, required_col:str | None) -> pd.DataFrame:
'''Handles Dataframes with too many missing values, defined by the user.'''
if thresh and data_clean.isna().sum() >= thresh:
return df.dropna(subset=[required_col]).reset_index(drop=True).copy()
elif per_thresh and (data_clean.isna().sum() / len(data_clean) * 100) >= per_thresh:
return df.dropna(subset=[required_col]).reset_index(drop=True).copy()
else:
return dfdef handle_duplicates(df:pd.DataFrame, thresh=int | None)
'''Handle duplicate rows of data.'''
if thresh and df.duplicated().sum() >= thresh:
return df.drop_duplicates().reset_index(drop=True).copy()
else:
return df
def flag_similar_names(df: pd.DataFrame, name_col: str, threshold: int = 85) -> pd.DataFrame:
"""Flag rows that have potential duplicate names using vectorized comparison."""
names = df[name_col].dropna().unique()
# Use combinations() to avoid nested loop and duplicate comparisons
dup_names: set[Any] = {
name
for name1, name2 in combinations(names, 2)
if fuzz.ratio(str(name1).lower(), str(name2).lower()) >= threshold
for name in (name1, name2)
}
df['has_similar_name'] = df[name_col].isin(dup_names)
return dfdef flag_outliers(series: pd.Series, method: str = 'iqr', threshold: float = 1.5) -> pd.Series:
"""Flag statistical outliers."""
if method == 'iqr':
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - threshold * IQR
upper = Q3 + threshold * IQR
return (series < lower) | (series > upper)
elif method == 'zscore':
z_scores = np.abs((series - series.mean()) / series.std())
return z_scores > threshold
import cpi
import wbdata
def standardize_name(name: Any) -> str | None:
"""Standardize name format to 'First Last'."""
if pd.isna(name):
return None
name = str(name).strip().upper()
# Handle "LAST, FIRST" format
if ',' in name:
parts = name.split(',')
name = f"{parts[1].strip()} {parts[0].strip()}"
return name
def parse_date(date_str: Any) -> pd.Timestamp | None:
"""Parse dates in various formats."""
if pd.isna(date_str):
return None
formats = [
'%m/%d/%Y', '%Y-%m-%d', '%B %d, %Y',
'%d-%b-%y', '%m-%d-%Y', '%Y/%m/%d'
]
for fmt in formats:
try:
return pd.to_datetime(date_str, format=fmt)
except:
continue
# Fall back to pandas parser
try:
return pd.to_datetime(date_str)
except:
return Nonedef handle_missing(df:pd.DataFrame, thresh:int | None, per_thresh:float | None, required_col:str | None) -> pd.DataFrame:
'''Handles Dataframes with too many missing values, defined by the user.'''
if thresh and data_clean.isna().sum() >= thresh:
return df.dropna(subset=[required_col]).reset_index(drop=True).copy()
elif per_thresh and (data_clean.isna().sum() / len(data_clean) * 100) >= per_thresh:
return df.dropna(subset=[required_col]).reset_index(drop=True).copy()
else:
return dfdef handle_duplicates(df:pd.DataFrame, thresh=int | None)
'''Handle duplicate rows of data.'''
if thresh and df.duplicated().sum() >= thresh:
return df.drop_duplicates().reset_index(drop=True).copy()
else:
return df
def flag_similar_names(df: pd.DataFrame, name_col: str, threshold: int = 85) -> pd.DataFrame:
"""Flag rows that have potential duplicate names using vectorized comparison."""
names = df[name_col].dropna().unique()
# Use combinations() to avoid nested loop and duplicate comparisons
dup_names: set[Any] = {
name
for name1, name2 in combinations(names, 2)
if fuzz.ratio(str(name1).lower(), str(name2).lower()) >= threshold
for name in (name1, name2)
}
df['has_similar_name'] = df[name_col].isin(dup_names)
return dfdef flag_outliers(series: pd.Series, method: str = 'iqr', threshold: float = 1.5) -> pd.Series:
"""Flag statistical outliers."""
if method == 'iqr':
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - threshold * IQR
upper = Q3 + threshold * IQR
return (series < lower) | (series > upper)
elif method == 'zscore':
z_scores = np.abs((series - series.mean()) / series.std())
return z_scores > threshold
use descriptive variable names and chain methods
use descriptive variable names and chain methods
data_clean = (pd
# Load messy data — raw_data is a placeholder
# Be sure to use the right reader for the filetype
.read_csv('..data/raw/raw_data.csv')
# DATA TYPE CORRECTIONS
# Ensure proper types for analysis
.assign(# Convert to numeric (handling errors)
amount = lambda x: pd.to_numeric(x['amount'], errors='coerce'),
# Convert to categorical (saves memory, enables ordering)
status = lambda x: pd.to_Categorical(x['status']))
.assign(
# INCONSISTENT FORMATTING
# Problem: Names in different formats
# ie. "SMITH, JOHN" vs "John Smith" vs "smith john"
name_clean = lambda x: standaridize_name(x['name']),
# DATE INCONSISTENCIES
# Problem: Dates in multiple formats
# ie. "01/15/2024", "2024-01-15", "January 15, 2024", "15-Jan-24"
parse_date = lambda x: parse_date(x['date']),
# OUTLIERS
# Identify potential data entry errors
amount_outlier = lambda x: flag_outliers(x['amount']),
)
# Fuzzy duplicates (similar but not identical)
# Use record linkage or manual review
.pipe(find_similar_names, name_col='name_clean', threshold=85)
# MISSING VALUES
# Strategy depends on context
# First check missing value patterns
.pipe(handle_missing, thresh=None, per_thresh=None)
# DUPLICATES — Find and handle duplicates
.pipe(handle_duplicates, thresh=None)
.reset_index(drop=True)
.copy())undefineddata_clean = (pd
# Load messy data — raw_data is a placeholder
# Be sure to use the right reader for the filetype
.read_csv('..data/raw/raw_data.csv')
# DATA TYPE CORRECTIONS
# Ensure proper types for analysis
.assign(# Convert to numeric (handling errors)
amount = lambda x: pd.to_numeric(x['amount'], errors='coerce'),
# Convert to categorical (saves memory, enables ordering)
status = lambda x: pd.to_Categorical(x['status']))
.assign(
# INCONSISTENT FORMATTING
# Problem: Names in different formats
# ie. "SMITH, JOHN" vs "John Smith" vs "smith john"
name_clean = lambda x: standaridize_name(x['name']),
# DATE INCONSISTENCIES
# Problem: Dates in multiple formats
# ie. "01/15/2024", "2024-01-15", "January 15, 2024", "15-Jan-24"
parse_date = lambda x: parse_date(x['date']),
# OUTLIERS
# Identify potential data entry errors
amount_outlier = lambda x: flag_outliers(x['amount']),
)
# Fuzzy duplicates (similar but not identical)
# Use record linkage or manual review
.pipe(find_similar_names, name_col='name_clean', threshold=85)
# MISSING VALUES
# Strategy depends on context
# First check missing value patterns
.pipe(handle_missing, thresh=None, per_thresh=None)
# DUPLICATES — Find and handle duplicates
.pipe(handle_duplicates, thresh=None)
.reset_index(drop=True)
.copy())undefinedData validation checklist
数据验证清单
markdown
undefinedmarkdown
undefinedPre-analysis data validation
分析前的数据验证
Structural checks
结构检查
- Row count matches expected
- Column count and names correct
- Data types appropriate
- No unexpected null columns
- 行数与预期一致
- 列数及列名正确
- 数据类型合适
- 无意外的全空列
Content checks
内容检查
- Date ranges make sense
- Numeric values within expected bounds
- Categorical values match expected options
- Geographic data resolves correctly
- IDs are unique where expected
- 日期范围合理
- 数值在预期范围内
- 分类值与预期选项匹配
- 地理数据解析正确
- 预期唯一的ID无重复
Consistency checks
一致性检查
- Totals add up to expected values
- Cross-tabulations balance
- Related fields are consistent
- Time series is continuous
- 总计与预期值相符
- 交叉制表平衡
- 关联字段一致
- 时间序列连续
Source verification
来源验证
- Can trace back to original source
- Methodology documented
- Known limitations noted
- Update frequency understood
undefined- 可追溯至原始来源
- 研究方法已记录
- 已知局限性已标注
- 了解数据更新频率
undefinedStatistical analysis for journalism
新闻统计分析
Basic statistics with context
带背景的基础统计
python
undefinedpython
undefinedEssential statistics for any dataset
Essential statistics for any dataset
def describe_for_journalism(df: pd.DataFrame, col: str) -> pd.DataFrame:
"""Generate journalist-friendly statistics."""
stats = df[col].describe(percentiles=[0.25, 0.5, 0.75, 0.9, 0.99])
# Add skewness to the describe() output
stats['skewness'] = df[col].skew()
return stats.to_frame(name=col)def describe_for_journalism(df: pd.DataFrame, col: str) -> pd.DataFrame:
"""Generate journalist-friendly statistics."""
stats = df[col].describe(percentiles=[0.25, 0.5, 0.75, 0.9, 0.99])
# Add skewness to the describe() output
stats['skewness'] = df[col].skew()
return stats.to_frame(name=col)Example interpretation
Example interpretation
stats = describe_for_journalism(salaries, 'salary')
print(f""" ANALYSIS
We analyzed {stats['count']:,} salary records.
The median salary is ${stats['median']:,.0f}, meaning half of workers
earn more and half earn less.
The average salary is ${stats['mean']:,.0f}, which is
{'higher' if stats['mean'] > stats['median'] else 'lower'} than the median,
indicating the distribution is {'right-skewed (pulled up by high earners)'
if stats['skewness'] > 0 else 'left-skewed'}.
The top 10% of earners make at least ${stats['90th_percentile']:,.0f}.
The top 1% make at least ${stats['99th_percentile']:,.0f}.
""")
undefinedstats = describe_for_journalism(salaries, 'salary')
print(f""" ANALYSIS
We analyzed {stats['count']:,} salary records.
The median salary is ${stats['median']:,.0f}, meaning half of workers
earn more and half earn less.
The average salary is ${stats['mean']:,.0f}, which is
{'higher' if stats['mean'] > stats['median'] else 'lower'} than the median,
indicating the distribution is {'right-skewed (pulled up by high earners)'
if stats['skewness'] > 0 else 'left-skewed'}.
The top 10% of earners make at least ${stats['90th_percentile']:,.0f}.
The top 1% make at least ${stats['99th_percentile']:,.0f}.
""")
undefinedComparisons and context
对比与背景
python
undefinedpython
undefinedCalculate change metrics for a column
Calculate change metrics for a column
def calculate_change(df: pd.DataFrame, col: str, periods: int = 1) -> pd.DataFrame:
"""Add change metrics to a DataFrame using built-in pandas methods.
Args:
df: Input DataFrame
col: Column to calculate changes for
periods: Number of rows to look back (1=previous row, 12=year-over-year for monthly)
"""
return df.assign(
absolute_change=df[col].diff(periods),
percent_change=df[col].pct_change(periods) * 100,
direction=np.sign(df[col].diff(periods)).map({1: 'increased', -1: 'decreased', 0: 'unchanged'})
)def calculate_change(df: pd.DataFrame, col: str, periods: int = 1) -> pd.DataFrame:
"""Add change metrics to a DataFrame using built-in pandas methods.
Args:
df: Input DataFrame
col: Column to calculate changes for
periods: Number of rows to look back (1=previous row, 12=year-over-year for monthly)
"""
return df.assign(
absolute_change=df[col].diff(periods),
percent_change=df[col].pct_change(periods) * 100,
direction=np.sign(df[col].diff(periods)).map({1: 'increased', -1: 'decreased', 0: 'unchanged'})
)Usage:
Usage:
changes = data_clean.pipe(calculate_change, 'revenue', periods=12) # Year-over-year for monthly data
changes = data_clean.pipe(calculate_change, 'revenue', periods=12) # Year-over-year for monthly data
Per capita calculations (essential for fair comparisons)
Per capita calculations (essential for fair comparisons)
def per_capita(value: float, population: float, multiplier: int = 100000) -> float:
"""Calculate per capita rate."""
return (value / population) * multiplier # Per 100,000 is standard
def per_capita(value: float, population: float, multiplier: int = 100000) -> float:
"""Calculate per capita rate."""
return (value / population) * multiplier # Per 100,000 is standard
Example: Crime rates
Example: Crime rates
city_a = {'crimes': 5000, 'population': 100000}
city_b = {'crimes': 8000, 'population': 500000}
rate_a = per_capita(city_a['crimes'], city_a['population'])
rate_b = per_capita(city_b['crimes'], city_b['population'])
print(f"City A: {rate_a:.1f} crimes per 100,000 residents")
print(f"City B: {rate_b:.1f} crimes per 100,000 residents")
city_a = {'crimes': 5000, 'population': 100000}
city_b = {'crimes': 8000, 'population': 500000}
rate_a = per_capita(city_a['crimes'], city_a['population'])
rate_b = per_capita(city_b['crimes'], city_b['population'])
print(f"City A: {rate_a:.1f} crimes per 100,000 residents")
print(f"City B: {rate_b:.1f} crimes per 100,000 residents")
City A actually has higher crime rate despite fewer total crimes!
City A actually has higher crime rate despite fewer total crimes!
def adjust_for_inflation(
amount: float | pd.Series,
from_year: int | pd.Series,
to_year: int,
country: str = 'US'
) -> float | pd.Series:
"""Adjust dollar amounts for inflation. Works with scalars or Series for .assign().
Args:
amount: Value(s) to adjust
from_year: Original year(s) of the amount
to_year: Target year to adjust to
country: ISO 2-letter country code (default 'US'). US uses BLS data via cpi package,
others use World Bank CPI data (FP.CPI.TOTL indicator)
"""
if country == 'US':
# Use cpi package for US (more accurate, from BLS)
if isinstance(from_year, pd.Series):
return pd.Series([cpi.inflate(amt, yr, to=to_year)
for amt, yr in zip(amount, from_year)], index=amount.index)
return cpi.inflate(amount, from_year, to=to_year)
else:
# Use World Bank data for other countries
cpi_data = wbdata.get_dataframe(
{'FP.CPI.TOTL': 'cpi'},
country=country
)['cpi'].to_dict()
from_cpi = pd.Series(from_year).map(cpi_data) if isinstance(from_year, pd.Series) else cpi_data[from_year]
to_cpi = cpi_data[to_year]
return amount * (to_cpi / from_cpi)def adjust_for_inflation(
amount: float | pd.Series,
from_year: int | pd.Series,
to_year: int,
country: str = 'US'
) -> float | pd.Series:
"""Adjust dollar amounts for inflation. Works with scalars or Series for .assign().
Args:
amount: Value(s) to adjust
from_year: Original year(s) of the amount
to_year: Target year to adjust to
country: ISO 2-letter country code (default 'US'). US uses BLS data via cpi package,
others use World Bank CPI data (FP.CPI.TOTL indicator)
"""
if country == 'US':
# Use cpi package for US (more accurate, from BLS)
if isinstance(from_year, pd.Series):
return pd.Series([cpi.inflate(amt, yr, to=to_year)
for amt, yr in zip(amount, from_year)], index=amount.index)
return cpi.inflate(amount, from_year, to=to_year)
else:
# Use World Bank data for other countries
cpi_data = wbdata.get_dataframe(
{'FP.CPI.TOTL': 'cpi'},
country=country
)['cpi'].to_dict()
from_cpi = pd.Series(from_year).map(cpi_data) if isinstance(from_year, pd.Series) else cpi_data[from_year]
to_cpi = cpi_data[to_year]
return amount * (to_cpi / from_cpi)Usage:
Usage:
adjust_for_inflation(100, 2020, 2024) # US by default
adjust_for_inflation(100, 2020, 2024) # US by default
adjust_for_inflation(100, 2020, 2024, country='GB') # UK
adjust_for_inflation(100, 2020, 2024, country='GB') # UK
df.assign(inf_adjust24=lambda x: adjust_for_inflation(x['amount'], x['year'], 2024, country='DE'))
df.assign(inf_adjust24=lambda x: adjust_for_inflation(x['amount'], x['year'], 2024, country='DE'))
Always adjust when comparing dollars across years!
Always adjust when comparing dollars across years!
undefinedundefinedCorrelation vs causation
相关性与因果性
markdown
undefinedmarkdown
undefinedReporting correlations responsibly
负责任地报道相关性
What you CAN say
可以表述的内容
- "X and Y are correlated"
- "As X increases, Y tends to increase"
- "Areas with higher X also tend to have higher Y"
- "X is associated with Y"
- "X与Y相关"
- "X增加时,Y往往随之增加"
- "X值较高的地区Y值也往往较高"
- "X与Y存在关联"
What you CANNOT say (without more evidence)
不可以表述的内容(无更多证据时)
- "X causes Y"
- "X leads to Y"
- "Y happens because of X"
- "X导致Y"
- "X引发Y"
- "Y因X而发生"
Questions to ask before implying causation
暗示因果关系前需问的问题
- Is there a plausible mechanism?
- Does the timing make sense (cause before effect)?
- Is there a dose-response relationship?
- Has the finding been replicated?
- Have confounding variables been controlled?
- Are there alternative explanations?
- 存在合理的作用机制吗?
- 时间逻辑合理吗(因先于果)?
- 存在剂量-反应关系吗?
- 研究结果是否可重复?
- 是否控制了混杂变量?
- 是否有其他解释?
Red flags for spurious correlations
虚假相关性的危险信号
- Extremely high correlation (r > 0.95) with unrelated things
- No logical connection between variables
- Third variable could explain both
- Small sample size with high variance
undefined- 极高相关性(r > 0.95)但变量无关
- 变量间无逻辑关联
- 存在第三变量可同时解释两者
- 小样本且方差大
undefinedData visualization
数据可视化
Chart selection guide
图表选择指南
markdown
undefinedmarkdown
undefinedChoosing the right chart
选择合适的图表
Comparison
对比类
- Bar chart: Compare categories
- Grouped bar: Compare categories across groups
- Bullet chart: Actual vs target
- 柱状图:对比不同类别
- 分组柱状图:跨组对比类别
- 子弹图:实际值与目标值对比
Change over time
时间变化类
- Line chart: Trends over time
- Area chart: Cumulative totals over time
- Slope chart: Change between two points
- 折线图:时间趋势
- 面积图:累计值随时间变化
- 斜率图:两点间的变化
Distribution
分布类
- Histogram: Distribution of one variable
- Box plot: Compare distributions across groups
- Violin plot: Detailed distribution shape
- 直方图:单变量分布
- 箱线图:跨组对比分布
- 小提琴图:详细分布形态
Relationship
关系类
- Scatter plot: Relationship between two variables
- Bubble chart: Three variables (x, y, size)
- Connected scatter: Change in relationship over time
- 散点图:双变量关系
- 气泡图:三变量(x、y、大小)
- 连接散点图:关系随时间的变化
Composition
构成类
- Pie chart: Parts of a whole (almost never use, max 5 slices, prefer donut charts)
- Donut chart: Parts of a whole
- Stacked bar: Parts of whole across categories
- Treemap: Hierarchical composition
- 饼图:整体的组成部分(尽量避免使用,最多5个切片,优先选用环形图)
- 环形图:整体的组成部分
- 堆叠柱状图:跨类别展示整体构成
- 树状图:层级构成
Geographic
地理类
- Choropleth: Values by region (use normalized data!)
- Dot map: Individual locations
- Proportional symbol: Magnitude at locations
undefined- 分级统计图:按区域展示数值(务必使用归一化数据!)
- 点地图:展示个体位置
- 比例符号图:按位置展示数值大小
undefinedExploratory interactive visualizations with Plotly Express
用Plotly Express制作探索性交互式可视化
python
import plotly.express as pxpython
import plotly.express as pxSet default template for all charts
Set default template for all charts
px.defaults.template = 'simple_white'
def create_bar_chart(
data: pd.DataFrame,
title: str,
source: str,
desc: str = '',
x_val: str,
y_val: str,
x_lab: str | None,
y_lab: str | None
) -> px.bar:
"""Create a bar chart."""
fig = px.bar(
data,
x=x_val,
y=y_val,
text=desc,
title=title,
labels={'category': (x_lab if x_lab else x_val), 'value': (y_lab if y_lab else y_val)}
)
return figpx.defaults.template = 'simple_white'
def create_bar_chart(
data: pd.DataFrame,
title: str,
source: str,
desc: str = '',
x_val: str,
y_val: str,
x_lab: str | None,
y_lab: str | None
) -> px.bar:
"""Create a bar chart."""
fig = px.bar(
data,
x=x_val,
y=y_val,
text=desc,
title=title,
labels={'category': (x_lab if x_lab else x_val), 'value': (y_lab if y_lab else y_val)}
)
return figExample
Example
fig = create_bar_chart(
data,
title='Annual Widget Production',
source='Department of Widgets, 2024',
desc='The widget department increased its production dramatically starting in 2014.',
x_val='year',
y_val='widgets_prod',
x_lab='Year',
y_label='Units produced'
)
fig.show() # Interactive display
undefinedfig = create_bar_chart(
data,
title='Annual Widget Production',
source='Department of Widgets, 2024',
desc='The widget department increased its production dramatically starting in 2014.',
x_val='year',
y_val='widgets_prod',
x_lab='Year',
y_label='Units produced'
)
fig.show() # Interactive display
undefinedPublication-ready automated data visualizations with Datawrapper
用Datawrapper制作可发布的自动化数据可视化
python
import pandas as pd
import datawrapper as dwpython
import pandas as pd
import datawrapper as dwAuthentication: Set DATAWRAPPER_ACCESS_TOKEN environment variable,
Authentication: Set DATAWRAPPER_ACCESS_TOKEN environment variable,
or read from file and pass to create()
or read from file and pass to create()
with open('datawrapper_api_key.txt', 'r') as f:
api_key = f.read().strip()
with open('datawrapper_api_key.txt', 'r') as f:
api_key = f.read().strip()
read in your data
read in your data
data = pd.read_csv('../data/raw/data.csv')
data = pd.read_csv('../data/raw/data.csv')
Create a bar chart using the new OOP API
Create a bar chart using the new OOP API
chart = dw.BarChart(
title='My Bar Chart Title',
intro='Subtitle or description text',
data=data,
# Formatting options
value_label_format=dw.NumberFormat.ONE_DECIMAL,
show_value_labels=True,
value_label_alignment='left',
sort_bars=True, # sort by value
reverse_order=False,
# Source attribution
source_name='Your Data Source',
source_url='https://example.com',
byline='Your Name',
# Optional: custom base color
base_color='#1d81a2')
chart = dw.BarChart(
title='My Bar Chart Title',
intro='Subtitle or description text',
data=data,
# Formatting options
value_label_format=dw.NumberFormat.ONE_DECIMAL,
show_value_labels=True,
value_label_alignment='left',
sort_bars=True, # sort by value
reverse_order=False,
# Source attribution
source_name='Your Data Source',
source_url='https://example.com',
byline='Your Name',
# Optional: custom base color
base_color='#1d81a2')
Create and publish (uses DATAWRAPPER_ACCESS_TOKEN env var, or pass token)
Create and publish (uses DATAWRAPPER_ACCESS_TOKEN env var, or pass token)
chart.create(access_token=api_key)
chart.publish()
chart.create(access_token=api_key)
chart.publish()
Get chart URL and embed code
Get chart URL and embed code
print(f"Chart ID: {chart.chart_id}")
print(f"Chart URL: https://datawrapper.dwcdn.net/{chart.chart_id}")
iframe_code = chart.get_iframe_code(responsive=True)
print(f"Chart ID: {chart.chart_id}")
print(f"Chart URL: https://datawrapper.dwcdn.net/{chart.chart_id}")
iframe_code = chart.get_iframe_code(responsive=True)
Update existing chart with new data (for live-updating charts)
Update existing chart with new data (for live-updating charts)
existing_chart = dw.get_chart('YOUR_CHART_ID') # retrieve by ID
existing_chart.data = new_df # assign new DataFrame
existing_chart.title = 'Updated Title' # modify properties
existing_chart.update() # push changes to Datawrapper
existing_chart.publish() # republish to make live
existing_chart = dw.get_chart('YOUR_CHART_ID') # retrieve by ID
existing_chart.data = new_df # assign new DataFrame
existing_chart.title = 'Updated Title' # modify properties
existing_chart.update() # push changes to Datawrapper
existing_chart.publish() # republish to make live
Optional — Export chart as image
Optional — Export chart as image
chart.export(filepath='chart.png', width=800, height=600)
#view chart
chart
undefinedchart.export(filepath='chart.png', width=800, height=600)
#view chart
chart
undefinedAvoiding misleading visualizations
避免误导性可视化
markdown
undefinedmarkdown
undefinedChart integrity checklist
图表完整性清单
Axes
坐标轴
- Y-axis starts at zero (for bar charts)
- Axis labels are clear
- Scale is appropriate (not truncated to exaggerate)
- Both axes labeled with units
- 柱状图Y轴从0开始
- 坐标轴标签清晰
- 刻度合适(未截断以夸大差异)
- 双轴均标注单位
Data representation
数据呈现
- All data points visible
- Colors are distinguishable (including colorblind)
- Proportions are accurate
- 3D effects not distorting perception
- 所有数据点可见
- 颜色可区分(包括色盲友好)
- 比例准确
- 无3D效果干扰感知
Context
背景信息
- Title describes what's shown, not conclusion
- Time period clearly stated
- Source cited
- Sample size/methodology noted if relevant
- Uncertainty shown where appropriate
- 标题描述内容而非结论
- 时间周期明确
- 来源已标注
- 样本量/研究方法相关时已说明
- 必要时展示不确定性
Honesty
真实性
- Cherry-picking dates avoided
- Outliers explained, not hidden
- Dual axes justified (usually avoid)
- Annotations don't mislead
undefined- 避免选择性选取日期
- 异常值已解释而非隐藏
- 双轴使用合理(尽量避免)
- 注释无误导性
undefinedWorking with geospatial data
地理空间数据处理
Geocoding data
地理编码
U.S. Census Geocoder
美国人口普查地理编码器
Best for: U.S. addresses only. Returns Census geography (tract, block, FIPS codes) along with coordinates—essential for joining with Census demographic data.
Pros: Completely free with no API key required. Returns Census geographies (state/county FIPS, tract, block) that let you join with ACS/decennial Census data. Good match rates for standard U.S. addresses.
Cons: Limited to 10,000 addresses per batch. U.S. addresses only. Slower than commercial alternatives. Lower match rates for non-standard addresses (PO boxes, rural routes, new construction).
Use when: You need to geocode nicely formatted U.S. addresses or you don't have budget for a paid service.
python
undefined**最佳适用场景:**仅美国地址。返回普查地理信息( tract、block、FIPS编码)及坐标——这是与人口普查人口统计数据关联的关键。
**优点:**完全免费,无需API密钥。返回可关联ACS/十年一次人口普查数据的普查地理信息(州/县FIPS、tract、block)。标准美国地址匹配率高。
**缺点:**批量处理上限为10,000条地址。仅支持美国地址。速度慢于商业服务。非标准地址(邮政信箱、乡村路线、新建建筑)匹配率较低。
**使用时机:**需处理格式规范的美国地址,或无预算使用付费服务时。
python
undefinedpip install censusbatchgeocoder
pip install censusbatchgeocoder
import censusbatchgeocoder
import pandas as pd
import censusbatchgeocoder
import pandas as pd
DataFrame must have columns: id, address, city, state, zipcode
DataFrame必须包含列:id, address, city, state, zipcode
(state and zipcode are optional but improve match rates)
(state和zipcode可选,但可提升匹配率)
def census_geocode(
df: pd.DataFrame,
id_col: str = 'id',
address_col: str = 'address',
city_col: str = 'city',
state_col: str = 'state',
zipcode_col: str = 'zipcode',
chunk_size: int = 9999
) -> pd.DataFrame:
"""
Geocode a DataFrame using the U.S. Census batch geocoder.
Automatically handles datasets larger than 10,000 rows by chunking.
Returns DataFrame with: latitude, longitude, state_fips, county_fips,
tract, block, is_match, is_exact, returned_address, geocoded_address
"""
# Rename columns to expected format
col_map = {id_col: 'id', address_col: 'address', city_col: 'city'}
if state_col and state_col in df.columns:
col_map[state_col] = 'state'
if zipcode_col and zipcode_col in df.columns:
col_map[zipcode_col] = 'zipcode'
renamed_df = df.rename(columns=col_map)
records = renamed_df.to_dict('records')
# Small dataset: geocode directly
if len(records) <= chunk_size:
results = censusbatchgeocoder.geocode(records)
return pd.DataFrame(results)
# Large dataset: process in chunks to stay under 10,000 limit
all_results = []
for i in range(0, len(records), chunk_size):
chunk = records[i:i + chunk_size]
print(f"Geocoding rows {i:,} to {i + len(chunk):,} of {len(records):,}...")
try:
results = censusbatchgeocoder.geocode(chunk)
all_results.extend(results)
except Exception as e:
print(f"Error on chunk starting at {i}: {e}")
for record in chunk:
all_results.append({**record, 'is_match': 'No_Match', 'latitude': None, 'longitude': None})
return pd.DataFrame(all_results)def census_geocode(
df: pd.DataFrame,
id_col: str = 'id',
address_col: str = 'address',
city_col: str = 'city',
state_col: str = 'state',
zipcode_col: str = 'zipcode',
chunk_size: int = 9999
) -> pd.DataFrame:
"""
Geocode a DataFrame using the U.S. Census batch geocoder.
Automatically handles datasets larger than 10,000 rows by chunking.
Returns DataFrame with: latitude, longitude, state_fips, county_fips,
tract, block, is_match, is_exact, returned_address, geocoded_address
"""
# Rename columns to expected format
col_map = {id_col: 'id', address_col: 'address', city_col: 'city'}
if state_col and state_col in df.columns:
col_map[state_col] = 'state'
if zipcode_col and zipcode_col in df.columns:
col_map[zipcode_col] = 'zipcode'
renamed_df = df.rename(columns=col_map)
records = renamed_df.to_dict('records')
# Small dataset: geocode directly
if len(records) <= chunk_size:
results = censusbatchgeocoder.geocode(records)
return pd.DataFrame(results)
# Large dataset: process in chunks to stay under 10,000 limit
all_results = []
for i in range(0, len(records), chunk_size):
chunk = records[i:i + chunk_size]
print(f"Geocoding rows {i:,} to {i + len(chunk):,} of {len(records):,}...")
try:
results = censusbatchgeocoder.geocode(chunk)
all_results.extend(results)
except Exception as e:
print(f"Error on chunk starting at {i}: {e}")
for record in chunk:
all_results.append({**record, 'is_match': 'No_Match', 'latitude': None, 'longitude': None})
return pd.DataFrame(all_results)Usage:
Usage:
geocoded = (pd
.read_csv('../data/raw/addresses.csv')
.assign(id=lambda x: x.index)
.pipe(census_geocode,
id_col='id',
address_col='street',
city_col='city'.
state_col='state',
zipcode_col='zip'))
undefinedgeocoded = (pd
.read_csv('../data/raw/addresses.csv')
.assign(id=lambda x: x.index)
.pipe(census_geocode,
id_col='id',
address_col='street',
city_col='city',
state_col='state',
zipcode_col='zip'))
undefinedGoogle Maps Geocoder
谷歌地图地理编码器
Best for: International addresses, high match rates, and messy/non-standard address formats.
Pros: Excellent match rates even for poorly formatted addresses. Works worldwide. Fast and reliable. Returns rich metadata (place types, address components, place IDs).
Cons: Costs money ($5 per 1,000 requests after free tier). Requires API key and billing account. Does not return Census geography—you'd need to do a separate spatial join.
Use when: You need to geocode international addresses, have messy address data that the Census geocoder can't match, or need the highest possible match rate and have budget for it.
python
import googlemaps
from typing import Optional
def geocode_address_google(address: str, api_key: str) -> Optional[dict]:
"""
Geocode address using Google Maps API.
Requires API key with Geocoding API enabled.
"""
gmaps = googlemaps.Client(key=api_key)
result = gmaps.geocode(address)
if result:
location = result[0]['geometry']['location']
return {
'formatted_address': result[0]['formatted_address'],
'lat': location['lat'],
'lon': location['lng'],
'place_id': result[0]['place_id']
}
return None**最佳适用场景:**国际地址、高匹配率、杂乱/非标准地址格式。
**优点:**即使是格式混乱的地址,匹配率也极高。支持全球地址。快速可靠。返回丰富元数据(地点类型、地址组件、地点ID)。
**缺点:**收费(免费额度后每1,000次请求5美元)。需要API密钥和计费账户。不返回普查地理信息——需单独进行空间关联。
**使用时机:**需处理国际地址,或人口普查地理编码器无法匹配的杂乱地址,或预算充足且需最高匹配率时。
python
import googlemaps
from typing import Optional
def geocode_address_google(address: str, api_key: str) -> Optional[dict]:
"""
Geocode address using Google Maps API.
Requires API key with Geocoding API enabled.
"""
gmaps = googlemaps.Client(key=api_key)
result = gmaps.geocode(address)
if result:
location = result[0]['geometry']['location']
return {
'formatted_address': result[0]['formatted_address'],
'lat': location['lat'],
'lon': location['lng'],
'place_id': result[0]['place_id']
}
return NoneBatch geocode a DataFrame
Batch geocode a DataFrame
def batch_geocode(df: pd.DataFrame, address_col: str, api_key: str) -> pd.DataFrame:
gmaps = googlemaps.Client(key=api_key)
results = []
for address in df[address_col]:
try:
result = gmaps.geocode(address)
if result:
loc = result[0]['geometry']['location']
results.append({'lat': loc['lat'], 'lon': loc['lng']})
else:
results.append({'lat': None, 'lon': None})
except Exception:
results.append({'lat': None, 'lon': None})
return pd.concat([df, pd.DataFrame(results)], axis=1)undefineddef batch_geocode(df: pd.DataFrame, address_col: str, api_key: str) -> pd.DataFrame:
gmaps = googlemaps.Client(key=api_key)
results = []
for address in df[address_col]:
try:
result = gmaps.geocode(address)
if result:
loc = result[0]['geometry']['location']
results.append({'lat': loc['lat'], 'lon': loc['lng']})
else:
results.append({'lat': None, 'lon': None})
except Exception:
results.append({'lat': None, 'lon': None})
return pd.concat([df, pd.DataFrame(results)], axis=1)undefinedGeopandas
GeoPandas使用
python
import geopandas as gpd
import pandas as pd
from shapely.geometry import Pointpython
import geopandas as gpd
import pandas as pd
from shapely.geometry import PointRead data from various formats
Read data from various formats
gdf = gpd.read_file('data.geojson') # GeoJSON
gdf = gpd.read_file('data.shp') # Shapefile
gdf = gpd.read_file('https://example.com/data.geojson') # From URL
gdf = gpd.read_parquet('data.parquet') # GeoParquet (fast!)
gdf = gpd.read_file('data.geojson') # GeoJSON
gdf = gpd.read_file('data.shp') # Shapefile
gdf = gpd.read_file('https://example.com/data.geojson') # From URL
gdf = gpd.read_parquet('data.parquet') # GeoParquet (fast!)
Transform DataFrame with lat/lon to GeoDataFrame
Transform DataFrame with lat/lon to GeoDataFrame
df = pd.read_csv('locations.csv')
geometry = [Point(xy) for xy in zip(df['longitude'], df['latitude'])]
gdf = gpd.GeoDataFrame(df, geometry=geometry)
df = pd.read_csv('locations.csv')
geometry = [Point(xy) for xy in zip(df['longitude'], df['latitude'])]
gdf = gpd.GeoDataFrame(df, geometry=geometry)
Set CRS (Coordinate Reference System)
Set CRS (Coordinate Reference System)
EPSG:4326 = WGS84 (standard latitude, longitude)
EPSG:4326 = WGS84 (standard latitude, longitude)
gdf = gdf.set_crs('EPSG:4326')
gdf = gdf.set_crs('EPSG:4326')
Transform to different CRS (for area/distance calculations, use projected CRS)
Transform to different CRS (for area/distance calculations, use projected CRS)
gdf_projected = gdf.to_crs('EPSG:3857') # Web Mercator, for distance in meters
gdf_projected = gdf.to_crs('EPSG:3857') # Web Mercator, for distance in meters
Basic spatial operations
Basic spatial operations
#Find the area of a shape
gdf['area'] = gdf_projected.geometry.area
#Find the center of a shape
gdf['centroid'] = gdf.geometry.centroid
#Draw a 1km boundary around a point
gdf['buffer_1km'] = gdf_projected.geometry.buffer(1000) #when set to CRS 3857
#Find the area of a shape
gdf['area'] = gdf_projected.geometry.area
#Find the center of a shape
gdf['centroid'] = gdf.geometry.centroid
#Draw a 1km boundary around a point
gdf['buffer_1km'] = gdf_projected.geometry.buffer(1000) #when set to CRS 3857
Spatial join: find points within polygons
Spatial join: find points within polygons
points = gpd.read_file('points.geojson')
polygons = gpd.read_file('boundaries.geojson')
joined = gpd.sjoin(points, polygons, predicate='within')
points = gpd.read_file('points.geojson')
polygons = gpd.read_file('boundaries.geojson')
joined = gpd.sjoin(points, polygons, predicate='within')
Dissolve: merge geometries by attribute
Dissolve: merge geometries by attribute
dissolved = gdf.dissolve(by='state', aggfunc='sum')
dissolved = gdf.dissolve(by='state', aggfunc='sum')
Export to various formats
Export to various formats
gdf.to_parquet('output.parquet') # GeoParquet (recommended)
gdf.to_file('output.geojson', driver='GeoJSON') #for tools that dont support GeoParquet
undefinedgdf.to_parquet('output.parquet') # GeoParquet (recommended)
gdf.to_file('output.geojson', driver='GeoJSON') #for tools that dont support GeoParquet
undefinedGeo-Visualization with .explore()
, lonboard
and Datawrapper
.explore()lonboard用.explore()
、lonboard
和Datawrapper进行地理可视化
.explore()lonboard.explore()
.explore().explore()
.explore()Best for: Quick exploration and prototyping during data analysis.
Pros: Built into GeoPandas—method is available on any GeoDataFrame. Great for exploratory data analysis—checking that your data looks right, exploring spatial patterns, and iterating quickly on map designs.
Cons: Becomes slow with large datasets (>100k features). Limited customization compared to dedicated mapping libraries. Requires extra dependencies to be installed.
Use when: You're in the middle of analysis and want to quickly visualize your GeoDataFrame without switching tools.
Required dependencies:
bash
pip install folium mapclassify matplotlib- - Required for
foliumto work at all (renders the interactive map).explore() - - Required when using
mapclassifyparameter for classification (e.g., 'naturalbreaks', 'quantiles', 'equalinterval')scheme= - - Required for colormap (
matplotlib) supportcmap=
python
import geopandas as gpd**最佳适用场景:**数据分析过程中的快速探索和原型制作。
**优点:**内置在GeoPandas中——任何GeoDataFrame都可调用该方法。非常适合探索性数据分析——检查数据是否正确、探索空间模式、快速迭代地图设计。
**缺点:**大数据集(>10万要素)时速度变慢。与专用地图库相比,自定义选项有限。需要安装额外依赖。
**使用时机:**分析过程中,无需切换工具即可快速可视化GeoDataFrame时。
所需依赖:
bash
pip install folium mapclassify matplotlib- -
folium运行的核心依赖(渲染交互式地图).explore() - - 使用
mapclassify参数进行分类时所需(如'naturalbreaks'、'quantiles'、'equalinterval')scheme= - - 颜色映射(
matplotlib)支持所需cmap=
python
import geopandas as gpdfolium, mapclassify, and matplotlib must be installed but don't need to be imported
folium, mapclassify, and matplotlib must be installed but don't need to be imported
geopandas imports them automatically when you call .explore()
geopandas imports them automatically when you call .explore()
Basic interactive map (uses folium under the hood)
Basic interactive map (uses folium under the hood)
gdf.explore()
gdf.explore()
Choropleth map with customization
Choropleth map with customization
(requires mapclassify for scheme parameter)
(requires mapclassify for scheme parameter)
gdf.explore(
column='population', # Column for color scale
cmap='YlOrRd', # Matplotlib colormap
scheme='naturalbreaks', # Classification scheme (needs mapclassify)
k=5, # Number of bins
legend=True,
tooltip=['name', 'population'], # Columns to show on hover
popup=True, # Show all columns on click
tiles='CartoDB positron', # Background tiles
style_kwds={'color': 'black', 'weight': 0.5} # Border style
)
undefinedgdf.explore(
column='population', # Column for color scale
cmap='YlOrRd', # Matplotlib colormap
scheme='naturalbreaks', # Classification scheme (needs mapclassify)
k=5, # Number of bins
legend=True,
tooltip=['name', 'population'], # Columns to show on hover
popup=True, # Show all columns on click
tiles='CartoDB positron', # Background tiles
style_kwds={'color': 'black', 'weight': 0.5} # Border style
)
undefinedlonboard
lonboardlonboard
lonboardBest for: Large datasets and high-performance visualization in Jupyter notebooks.
Pros: GPU-accelerated rendering via deck.gl can handle millions of points smoothly. Excellent interactivity—pan, zoom, and hover work fluidly even with massive datasets. Native support for GeoArrow format for efficient data transfer.
Cons: Requires separate installation (). Styling options are more technical (RGBA arrays, deck.gl conventions).
pip install lonboardUse when: You have large point datasets (crime incidents, sensor readings, business locations) or need smooth interactivity with 100k+ features.
python
import geopandas as gpd
from lonboard import viz, Map, ScatterplotLayer, PolygonLayer**最佳适用场景:**Jupyter笔记本中的大数据集和高性能可视化。
**优点:**通过deck.gl实现GPU加速渲染,可流畅处理数百万个点。交互性极佳——即使是超大数据集,平移、缩放、悬停操作也流畅。原生支持GeoArrow格式,实现高效数据传输。
**缺点:**需单独安装()。样式选项更偏向技术化(RGBA数组、deck.gl约定)。
pip install lonboard**使用时机:**处理大型点数据集(犯罪事件、传感器读数、商业地点),或需10万+要素的流畅交互时。
python
import geopandas as gpd
from lonboard import viz, Map, ScatterplotLayer, PolygonLayerQuick visualization (auto-detects geometry type)
Quick visualization (auto-detects geometry type)
viz(gdf)
viz(gdf)
Custom ScatterplotLayer for points
Custom ScatterplotLayer for points
layer = ScatterplotLayer.from_geopandas(
gdf,
get_radius=100,
get_fill_color=[255, 0, 0, 200], # RGBA
pickable=True
)
m = Map(layer)
m
layer = ScatterplotLayer.from_geopandas(
gdf,
get_radius=100,
get_fill_color=[255, 0, 0, 200], # RGBA
pickable=True
)
m = Map(layer)
m
PolygonLayer with color based on column
PolygonLayer with color based on column
from lonboard.colormap import apply_continuous_cmap
import matplotlib.pyplot as plt
colors = apply_continuous_cmap(gdf['value'], plt.cm.viridis)
layer = PolygonLayer.from_geopandas(
gdf,
get_fill_color=colors,
get_line_color=[0, 0, 0, 100],
pickable=True
)
Map(layer)
undefinedfrom lonboard.colormap import apply_continuous_cmap
import matplotlib.pyplot as plt
colors = apply_continuous_cmap(gdf['value'], plt.cm.viridis)
layer = PolygonLayer.from_geopandas(
gdf,
get_fill_color=colors,
get_line_color=[0, 0, 0, 100],
pickable=True
)
Map(layer)
undefinedDatawrapper
Datawrapper
Best for: Publication-ready choropleth and proportional symbol maps for articles and reports.
Pros: Beautiful, professional defaults out of the box. Generates embeddable, responsive iframes that work in any CMS. Readers can interact (hover, click) without running any code. Accessible and mobile-friendly. Easy to update data programmatically for updating data.
Cons: Requires a Datawrapper account (free tier available). Limited to Datawrapper's supported boundary files—you can't bring arbitrary geometries. Less flexibility for custom visualizations.
Use when: You need a polished map for publication. Ideal for choropleth maps showing statistics by region (unemployment by state, COVID cases by county, election results by district). Your audience will view the map in a browser, not a notebook.
Unlike or , you don't pass raw geometry—instead you match your data to Datawrapper's built-in boundary files using standard codes (FIPS, ISO, etc.).
.explore()lonboardpython
import datawrapper as dw
import pandas as pd**最佳适用场景:**为文章和报告制作可发布的分级统计图和比例符号图。
**优点:**默认样式美观专业。生成可嵌入的响应式iframe,适用于任何CMS。读者无需运行代码即可交互(悬停、点击)。无障碍且适配移动端。可通过代码更新数据,实现动态更新。
**缺点:**需要Datawrapper账户(提供免费版)。受限于Datawrapper支持的边界文件——无法导入任意几何图形。自定义可视化的灵活性较低。
**使用时机:**需要制作可发布的精美地图时。非常适合展示区域统计数据的分级统计图(各州失业率、各县新冠病例、各选区选举结果)。受众将在浏览器中查看地图,而非笔记本。
与或不同,无需传入原始几何图形——而是使用标准编码(FIPS、ISO等)将数据与Datawrapper内置的边界文件匹配。
.explore()lonboardpython
import datawrapper as dw
import pandas as pdRead API key
Read API key
with open('datawrapper_api_key.txt', 'r') as f:
api_key = f.read().strip()
with open('datawrapper_api_key.txt', 'r') as f:
api_key = f.read().strip()
Prepare data with location codes that match Datawrapper's boundaries
Prepare data with location codes that match Datawrapper's boundaries
For US states: use 2-letter abbreviations or FIPS codes
For US states: use 2-letter abbreviations or FIPS codes
For countries: use ISO 3166-1 alpha-2 codes
For countries: use ISO 3166-1 alpha-2 codes
df = pd.DataFrame({
'state': ['AL', 'AK', 'AZ', 'AR', 'CA'], # State abbreviations
'unemployment_rate': [4.9, 3.2, 7.1, 4.2, 5.8]
})
df = pd.DataFrame({
'state': ['AL', 'AK', 'AZ', 'AR', 'CA'], # State abbreviations
'unemployment_rate': [4.9, 3.2, 7.1, 4.2, 5.8]
})
Create a choropleth map
Create a choropleth map
chart = dw.ChoroplethMap(
title='Unemployment Rate by State',
intro='Percentage of labor force unemployed, 2024',
data=df,
# Map configuration
basemap='us-states', # Built-in US states boundaries
basemap_key='state', # Column in your data with location codes
value_column='unemployment_rate',
# Styling
color_palette='YlOrRd', # Color scheme
legend_title='Unemployment %',
# Attribution
source_name='Bureau of Labor Statistics',
source_url='https://www.bls.gov/',
byline='Your Name')
chart = dw.ChoroplethMap(
title='Unemployment Rate by State',
intro='Percentage of labor force unemployed, 2024',
data=df,
# Map configuration
basemap='us-states', # Built-in US states boundaries
basemap_key='state', # Column in your data with location codes
value_column='unemployment_rate',
# Styling
color_palette='YlOrRd', # Color scheme
legend_title='Unemployment %',
# Attribution
source_name='Bureau of Labor Statistics',
source_url='https://www.bls.gov/',
byline='Your Name')
Create and publish
Create and publish
chart.create(access_token=api_key)
chart.publish()
chart.create(access_token=api_key)
chart.publish()
Get embed code for your article
Get embed code for your article
iframe = chart.get_iframe_code(responsive=True)
print(f"Chart URL: https://datawrapper.dwcdn.net/{chart.chart_id}")
iframe = chart.get_iframe_code(responsive=True)
print(f"Chart URL: https://datawrapper.dwcdn.net/{chart.chart_id}")
Update with new data (for live-updating maps)
Update with new data (for live-updating maps)
new_df = pd.DataFrame({...}) # Updated data
existing_chart = dw.get_chart('YOUR_CHART_ID')
existing_chart.data = new_df
existing_chart.update()
existing_chart.publish()
**Available Datawrapper basemaps include:**
- `us-states`, `us-counties`, `us-congressional-districts`
- `world`, `europe`, `africa`, `asia`
- Country-specific maps (e.g., `germany-states`, `uk-constituencies`)new_df = pd.DataFrame({...}) # Updated data
existing_chart = dw.get_chart('YOUR_CHART_ID')
existing_chart.data = new_df
existing_chart.title = 'Updated Title' # modify properties
existing_chart.update()
existing_chart.publish()
**Datawrapper可用底图包括:**
- `us-states`, `us-counties`, `us-congressional-districts`
- `world`, `europe`, `africa`, `asia`
- 国家特定地图(如`germany-states`, `uk-constituencies`)Learning resources
学习资源
- NICAR (Investigative Reporters & Editors)
- Knight Center for Journalism in the Americas
- Data Journalism Handbook (datajournalism.com)
- Flowing Data (flowingdata.com)
- The Pudding (pudding.cool) - examples
- Sigma Awards (https://www.sigmaawards.org/) - examples
- NICAR(调查记者与编辑协会)
- 美洲新闻骑士中心
- 《数据新闻手册》(datajournalism.com)
- Flowing Data(flowingdata.com)
- The Pudding(pudding.cool) - 案例参考
- Sigma Awards(https://www.sigmaawards.org/) - 案例参考