data-journalism

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data journalism methodology

数据新闻方法论

Systematic approaches for finding, analyzing and presenting data in journalism.
系统性地寻找、分析和呈现新闻数据的方法。

Story structure for data journalism

数据新闻的报道结构

Data journalism framework

数据新闻框架

markdown

The framework for data journalism was established by Philip Meyer, a journalist for Knight-Ridder, Harvard Nieman Fellow and professor at UNC-Chapel Hill. In his book <i>The New Precision Journalism</i>, which outlines his ideas, Meyer encourages journalists to treat journalism "as if it were a science" by adopting the scientific method:
- Making observation(s) / formulating a questiom
- Researching the question / Collect, store and retrieve data
- Formulate a hypothesis
- Test the hypothesis, using both qualitative (interviews, documents etc.) and quantitative (data analysis etc.) methods
- Analyze the results and reduce them to the most important findings
- Present them to the audience

This process should be thought of as iterative, rather than sequential.
markdown
数据新闻的框架由Philip Meyer确立,他是Knight-Ridder的记者、哈佛大学尼曼研究员,同时也是北卡罗来纳大学教堂山分校的教授。在他的著作《新精确新闻》中,他阐述了自己的理念,鼓励记者采用科学方法,将新闻业“视为一门科学”:
- 观察现象 / 提出问题
- 研究问题 / 收集、存储和检索数据
- 提出假设
- 使用定性(访谈、文档等)和定量(数据分析等)方法检验假设
- 分析结果,提炼出最重要的发现
- 向受众呈现结果

这个过程应被视为迭代式的,而非线性顺序的。

The data story arc

数据新闻叙事弧线

1. The hook (nut graf)

1. 钩子(核心段落)

  • What's the key finding(s)?
  • Why should readers care?
  • What's the human impact?
  • 核心发现是什么?
  • 读者为什么需要关注?
  • 对人有什么影响?

2. The evidence

2. 证据

  • Show the data
  • Explain the methodology
  • Acknowledge limitations
  • 展示数据
  • 解释研究方法
  • 说明局限性

3. The context

3. 背景

  • How does this compare to past?
  • How does this compare to elsewhere?
  • What's the trend?
  • 与过去相比有何变化?
  • 与其他地区相比有何不同?
  • 趋势是什么?

4. The human element

4. 人文元素

  • Individual examples that illustrate the data
  • Expert interpretation
  • Affected voices
  • 能体现数据的个体案例
  • 专家解读
  • 受影响者的声音

5. The implications

5. 启示

  • What does this mean going forward?
  • What questions remain?
  • What actions could result?
  • 这对未来意味着什么?
  • 还有哪些问题待解答?
  • 可能引发哪些行动?

6. The methodology box

6. 方法说明框

  • Where did data come from?
  • How was it analyzed?
  • What are the limitations?
  • How can readers explore further?
undefined
  • 数据来源何处?
  • 如何进行分析?
  • 存在哪些局限性?
  • 读者如何进一步探索?
undefined

Methodology documentation template

方法文档模板

markdown
undefined
markdown
undefined

How we did this analysis

我们的分析流程

Data sources

数据来源

[List all data sources with links and access dates]
[列出所有数据来源及链接和访问日期]

Time period

时间范围

[Specify exactly what time period is covered]
[明确说明涵盖的时间周期]

Definitions

定义

[Define key terms and how you operationalized them]
[定义关键术语及操作化方式]

Analysis steps

分析步骤

  1. [First step of analysis]
  2. [Second step]
  3. [Continue...]
  1. [分析第一步]
  2. [第二步]
  3. [后续步骤...]

Limitations

局限性

  • [Limitation 1]
  • [Limitation 2]
  • [局限性1]
  • [局限性2]

What we excluded and why

排除内容及原因

  • [排除类别]:[原因]

Verification

验证方式

[How findings were verified/checked]
[如何验证/检查研究结果]

Code and data availability

代码与数据获取

[Link to GitHub repo if sharing code/data]
[若分享代码/数据,提供GitHub仓库链接]

Contact

联系方式

[How readers can reach you with questions]
undefined
[读者咨询的联系方式]
undefined

Data acquisition

数据获取

Public data sources

公共数据来源

markdown
undefined
markdown
undefined

Federal data sources

联邦数据来源

General

通用类

  • Data.gov - Federal open data portal
  • Census Bureau (census.gov) - Demographics, economic data
  • BLS (bls.gov) - Employment, inflation, wages
  • BEA (bea.gov) - GDP, economic accounts
  • Federal Reserve (federalreserve.gov) - Financial data
  • SEC EDGAR - Corporate filings
  • Data.gov - 联邦开放数据门户
  • Census Bureau (census.gov) - 人口统计、经济数据
  • BLS (bls.gov) - 就业、通胀、薪资数据
  • BEA (bea.gov) - GDP、经济账户数据
  • Federal Reserve (federalreserve.gov) - 金融数据
  • SEC EDGAR - 企业备案文件

Specific domains

特定领域

  • EPA (epa.gov/data) - Environmental data
  • FDA (fda.gov/data) - Drug approvals, recalls, adverse events
  • CDC WONDER - Health statistics
  • NHTSA - Vehicle safety data
  • DOT - Transportation statistics
  • FEC - Campaign finance
  • USASpending.gov - Federal contracts and grants
  • EPA (epa.gov/data) - 环境数据
  • FDA (fda.gov/data) - 药品审批、召回、不良事件数据
  • CDC WONDER - 健康统计数据
  • NHTSA - 车辆安全数据
  • DOT - 交通统计数据
  • FEC - 竞选财务数据
  • USASpending.gov - 联邦合同与拨款数据

State and local

州及地方数据

  • State open data portals (search: "[state] open data")
  • Socrata-powered sites (many cities/states)
  • OpenStreets, municipal GIS portals
  • State comptroller/auditor reports
undefined
  • 州开放数据门户(搜索:"[州名] open data")
  • Socrata支持的站点(众多城市/州使用)
  • OpenStreets、市政GIS门户
  • 州审计长/审计报告
undefined

Data request strategies

数据请求策略

markdown
undefined
markdown
undefined

Getting data that isn't public

获取非公开数据

Public records request (ie. FOIA) for datasets

公共记录请求(如FOIA)获取数据集

  • Request databases, not just documents
  • Ask for data dictionary/schema
  • Request in native format (CSV, SQL dump)
  • Specify field-level needs
  • 请求数据库,而非仅文档
  • 索要数据字典/模式
  • 要求提供原生格式(CSV、SQL导出文件)
  • 明确字段级需求

Building your own dataset

自建数据集

  • Scraping public information
  • Crowdsourcing from readers
  • Systematic document review
  • Surveys (with proper methodology)
  • 爬取公开信息
  • 向读者众包数据
  • 系统性文档审查
  • 采用规范方法开展调查

Commercial data sources (for newsrooms)

新闻室商用数据来源

  • LexisNexis
  • Refinitiv
  • Bloomberg
  • Industry-specific databases
undefined
  • LexisNexis
  • Refinitiv
  • Bloomberg
  • 行业特定数据库
undefined

Data cleaning and preparation

数据清洗与预处理

Common data problems

常见数据问题

python
from typing import Any

import pandas as pd
import numpy as np
from rapidfuzz import fuzz
from itertools import combinations
python
from typing import Any

import pandas as pd
import numpy as np
from rapidfuzz import fuzz
from itertools import combinations

Inflation adjustment

Inflation adjustment

import cpi import wbdata
def standardize_name(name: Any) -> str | None: """Standardize name format to 'First Last'.""" if pd.isna(name): return None name = str(name).strip().upper() # Handle "LAST, FIRST" format if ',' in name: parts = name.split(',') name = f"{parts[1].strip()} {parts[0].strip()}" return name
def parse_date(date_str: Any) -> pd.Timestamp | None: """Parse dates in various formats.""" if pd.isna(date_str): return None
formats = [
    '%m/%d/%Y', '%Y-%m-%d', '%B %d, %Y',
    '%d-%b-%y', '%m-%d-%Y', '%Y/%m/%d'
]

for fmt in formats:
    try:
        return pd.to_datetime(date_str, format=fmt)
    except:
        continue

# Fall back to pandas parser
try:
    return pd.to_datetime(date_str)
except:
    return None
def handle_missing(df:pd.DataFrame, thresh:int | None, per_thresh:float | None, required_col:str | None) -> pd.DataFrame: '''Handles Dataframes with too many missing values, defined by the user.''' if thresh and data_clean.isna().sum() >= thresh: return df.dropna(subset=[required_col]).reset_index(drop=True).copy()
elif per_thresh and (data_clean.isna().sum() / len(data_clean) * 100) >= per_thresh:
    return df.dropna(subset=[required_col]).reset_index(drop=True).copy()

else:
    return df
def handle_duplicates(df:pd.DataFrame, thresh=int | None) '''Handle duplicate rows of data.''' if thresh and df.duplicated().sum() >= thresh: return df.drop_duplicates().reset_index(drop=True).copy() else: return df
def flag_similar_names(df: pd.DataFrame, name_col: str, threshold: int = 85) -> pd.DataFrame: """Flag rows that have potential duplicate names using vectorized comparison."""
names = df[name_col].dropna().unique()

# Use combinations() to avoid nested loop and duplicate comparisons
dup_names: set[Any] = {
    name
    for name1, name2 in combinations(names, 2)
    if fuzz.ratio(str(name1).lower(), str(name2).lower()) >= threshold
    for name in (name1, name2)
}

df['has_similar_name'] = df[name_col].isin(dup_names)
return df
def flag_outliers(series: pd.Series, method: str = 'iqr', threshold: float = 1.5) -> pd.Series: """Flag statistical outliers.""" if method == 'iqr': Q1 = series.quantile(0.25) Q3 = series.quantile(0.75) IQR = Q3 - Q1 lower = Q1 - threshold * IQR upper = Q3 + threshold * IQR return (series < lower) | (series > upper) elif method == 'zscore': z_scores = np.abs((series - series.mean()) / series.std()) return z_scores > threshold
import cpi import wbdata
def standardize_name(name: Any) -> str | None: """Standardize name format to 'First Last'.""" if pd.isna(name): return None name = str(name).strip().upper() # Handle "LAST, FIRST" format if ',' in name: parts = name.split(',') name = f"{parts[1].strip()} {parts[0].strip()}" return name
def parse_date(date_str: Any) -> pd.Timestamp | None: """Parse dates in various formats.""" if pd.isna(date_str): return None
formats = [
    '%m/%d/%Y', '%Y-%m-%d', '%B %d, %Y',
    '%d-%b-%y', '%m-%d-%Y', '%Y/%m/%d'
]

for fmt in formats:
    try:
        return pd.to_datetime(date_str, format=fmt)
    except:
        continue

# Fall back to pandas parser
try:
    return pd.to_datetime(date_str)
except:
    return None
def handle_missing(df:pd.DataFrame, thresh:int | None, per_thresh:float | None, required_col:str | None) -> pd.DataFrame: '''Handles Dataframes with too many missing values, defined by the user.''' if thresh and data_clean.isna().sum() >= thresh: return df.dropna(subset=[required_col]).reset_index(drop=True).copy()
elif per_thresh and (data_clean.isna().sum() / len(data_clean) * 100) >= per_thresh:
    return df.dropna(subset=[required_col]).reset_index(drop=True).copy()

else:
    return df
def handle_duplicates(df:pd.DataFrame, thresh=int | None) '''Handle duplicate rows of data.''' if thresh and df.duplicated().sum() >= thresh: return df.drop_duplicates().reset_index(drop=True).copy() else: return df
def flag_similar_names(df: pd.DataFrame, name_col: str, threshold: int = 85) -> pd.DataFrame: """Flag rows that have potential duplicate names using vectorized comparison."""
names = df[name_col].dropna().unique()

# Use combinations() to avoid nested loop and duplicate comparisons
dup_names: set[Any] = {
    name
    for name1, name2 in combinations(names, 2)
    if fuzz.ratio(str(name1).lower(), str(name2).lower()) >= threshold
    for name in (name1, name2)
}

df['has_similar_name'] = df[name_col].isin(dup_names)
return df
def flag_outliers(series: pd.Series, method: str = 'iqr', threshold: float = 1.5) -> pd.Series: """Flag statistical outliers.""" if method == 'iqr': Q1 = series.quantile(0.25) Q3 = series.quantile(0.75) IQR = Q3 - Q1 lower = Q1 - threshold * IQR upper = Q3 + threshold * IQR return (series < lower) | (series > upper) elif method == 'zscore': z_scores = np.abs((series - series.mean()) / series.std()) return z_scores > threshold

use descriptive variable names and chain methods

use descriptive variable names and chain methods

data_clean = (pd
        # Load messy data — raw_data is a placeholder
        # Be sure to use the right reader for the filetype
        .read_csv('..data/raw/raw_data.csv')

        # DATA TYPE CORRECTIONS
        # Ensure proper types for analysis
        .assign(# Convert to numeric (handling errors)
                amount = lambda x: pd.to_numeric(x['amount'], errors='coerce'),
                
                # Convert to categorical (saves memory, enables ordering)
                status = lambda x: pd.to_Categorical(x['status'])) 
        
        .assign(
                # INCONSISTENT FORMATTING
                # Problem: Names in different formats
                # ie. "SMITH, JOHN" vs "John Smith" vs "smith john"
                name_clean = lambda x: standaridize_name(x['name']),
                
                # DATE INCONSISTENCIES
                # Problem: Dates in multiple formats
                # ie. "01/15/2024", "2024-01-15", "January 15, 2024", "15-Jan-24"
                parse_date = lambda x: parse_date(x['date']),
                
                # OUTLIERS
                # Identify potential data entry errors
                amount_outlier = lambda x: flag_outliers(x['amount']),
                
                )
        
        # Fuzzy duplicates (similar but not identical)
        # Use record linkage or manual review
        .pipe(find_similar_names, name_col='name_clean', threshold=85)

        # MISSING VALUES
        # Strategy depends on context
        # First check missing value patterns
        .pipe(handle_missing, thresh=None, per_thresh=None)

        # DUPLICATES — Find and handle duplicates
        .pipe(handle_duplicates, thresh=None)
        
        .reset_index(drop=True)
        .copy())
undefined
data_clean = (pd
        # Load messy data — raw_data is a placeholder
        # Be sure to use the right reader for the filetype
        .read_csv('..data/raw/raw_data.csv')

        # DATA TYPE CORRECTIONS
        # Ensure proper types for analysis
        .assign(# Convert to numeric (handling errors)
                amount = lambda x: pd.to_numeric(x['amount'], errors='coerce'),
                
                # Convert to categorical (saves memory, enables ordering)
                status = lambda x: pd.to_Categorical(x['status'])) 
        
        .assign(
                # INCONSISTENT FORMATTING
                # Problem: Names in different formats
                # ie. "SMITH, JOHN" vs "John Smith" vs "smith john"
                name_clean = lambda x: standaridize_name(x['name']),
                
                # DATE INCONSISTENCIES
                # Problem: Dates in multiple formats
                # ie. "01/15/2024", "2024-01-15", "January 15, 2024", "15-Jan-24"
                parse_date = lambda x: parse_date(x['date']),
                
                # OUTLIERS
                # Identify potential data entry errors
                amount_outlier = lambda x: flag_outliers(x['amount']),
                
                )
        
        # Fuzzy duplicates (similar but not identical)
        # Use record linkage or manual review
        .pipe(find_similar_names, name_col='name_clean', threshold=85)

        # MISSING VALUES
        # Strategy depends on context
        # First check missing value patterns
        .pipe(handle_missing, thresh=None, per_thresh=None)

        # DUPLICATES — Find and handle duplicates
        .pipe(handle_duplicates, thresh=None)
        
        .reset_index(drop=True)
        .copy())
undefined

Data validation checklist

数据验证清单

markdown
undefined
markdown
undefined

Pre-analysis data validation

分析前的数据验证

Structural checks

结构检查

  • Row count matches expected
  • Column count and names correct
  • Data types appropriate
  • No unexpected null columns
  • 行数与预期一致
  • 列数及列名正确
  • 数据类型合适
  • 无意外的全空列

Content checks

内容检查

  • Date ranges make sense
  • Numeric values within expected bounds
  • Categorical values match expected options
  • Geographic data resolves correctly
  • IDs are unique where expected
  • 日期范围合理
  • 数值在预期范围内
  • 分类值与预期选项匹配
  • 地理数据解析正确
  • 预期唯一的ID无重复

Consistency checks

一致性检查

  • Totals add up to expected values
  • Cross-tabulations balance
  • Related fields are consistent
  • Time series is continuous
  • 总计与预期值相符
  • 交叉制表平衡
  • 关联字段一致
  • 时间序列连续

Source verification

来源验证

  • Can trace back to original source
  • Methodology documented
  • Known limitations noted
  • Update frequency understood
undefined
  • 可追溯至原始来源
  • 研究方法已记录
  • 已知局限性已标注
  • 了解数据更新频率
undefined

Statistical analysis for journalism

新闻统计分析

Basic statistics with context

带背景的基础统计

python
undefined
python
undefined

Essential statistics for any dataset

Essential statistics for any dataset

def describe_for_journalism(df: pd.DataFrame, col: str) -> pd.DataFrame: """Generate journalist-friendly statistics.""" stats = df[col].describe(percentiles=[0.25, 0.5, 0.75, 0.9, 0.99])
# Add skewness to the describe() output
stats['skewness'] = df[col].skew()

return stats.to_frame(name=col)
def describe_for_journalism(df: pd.DataFrame, col: str) -> pd.DataFrame: """Generate journalist-friendly statistics.""" stats = df[col].describe(percentiles=[0.25, 0.5, 0.75, 0.9, 0.99])
# Add skewness to the describe() output
stats['skewness'] = df[col].skew()

return stats.to_frame(name=col)

Example interpretation

Example interpretation

stats = describe_for_journalism(salaries, 'salary')

print(f""" ANALYSIS

We analyzed {stats['count']:,} salary records.
The median salary is ${stats['median']:,.0f}, meaning half of workers earn more and half earn less.
The average salary is ${stats['mean']:,.0f}, which is {'higher' if stats['mean'] > stats['median'] else 'lower'} than the median, indicating the distribution is {'right-skewed (pulled up by high earners)' if stats['skewness'] > 0 else 'left-skewed'}.
The top 10% of earners make at least ${stats['90th_percentile']:,.0f}. The top 1% make at least ${stats['99th_percentile']:,.0f}. """)
undefined
stats = describe_for_journalism(salaries, 'salary')

print(f""" ANALYSIS

We analyzed {stats['count']:,} salary records.
The median salary is ${stats['median']:,.0f}, meaning half of workers earn more and half earn less.
The average salary is ${stats['mean']:,.0f}, which is {'higher' if stats['mean'] > stats['median'] else 'lower'} than the median, indicating the distribution is {'right-skewed (pulled up by high earners)' if stats['skewness'] > 0 else 'left-skewed'}.
The top 10% of earners make at least ${stats['90th_percentile']:,.0f}. The top 1% make at least ${stats['99th_percentile']:,.0f}. """)
undefined

Comparisons and context

对比与背景

python
undefined
python
undefined

Calculate change metrics for a column

Calculate change metrics for a column

def calculate_change(df: pd.DataFrame, col: str, periods: int = 1) -> pd.DataFrame: """Add change metrics to a DataFrame using built-in pandas methods.
Args:
    df: Input DataFrame
    col: Column to calculate changes for
    periods: Number of rows to look back (1=previous row, 12=year-over-year for monthly)
"""
return df.assign(
    absolute_change=df[col].diff(periods),
    percent_change=df[col].pct_change(periods) * 100,
    direction=np.sign(df[col].diff(periods)).map({1: 'increased', -1: 'decreased', 0: 'unchanged'})
)
def calculate_change(df: pd.DataFrame, col: str, periods: int = 1) -> pd.DataFrame: """Add change metrics to a DataFrame using built-in pandas methods.
Args:
    df: Input DataFrame
    col: Column to calculate changes for
    periods: Number of rows to look back (1=previous row, 12=year-over-year for monthly)
"""
return df.assign(
    absolute_change=df[col].diff(periods),
    percent_change=df[col].pct_change(periods) * 100,
    direction=np.sign(df[col].diff(periods)).map({1: 'increased', -1: 'decreased', 0: 'unchanged'})
)

Usage:

Usage:

changes = data_clean.pipe(calculate_change, 'revenue', periods=12) # Year-over-year for monthly data

changes = data_clean.pipe(calculate_change, 'revenue', periods=12) # Year-over-year for monthly data

Per capita calculations (essential for fair comparisons)

Per capita calculations (essential for fair comparisons)

def per_capita(value: float, population: float, multiplier: int = 100000) -> float: """Calculate per capita rate.""" return (value / population) * multiplier # Per 100,000 is standard
def per_capita(value: float, population: float, multiplier: int = 100000) -> float: """Calculate per capita rate.""" return (value / population) * multiplier # Per 100,000 is standard

Example: Crime rates

Example: Crime rates

city_a = {'crimes': 5000, 'population': 100000} city_b = {'crimes': 8000, 'population': 500000}
rate_a = per_capita(city_a['crimes'], city_a['population']) rate_b = per_capita(city_b['crimes'], city_b['population'])
print(f"City A: {rate_a:.1f} crimes per 100,000 residents") print(f"City B: {rate_b:.1f} crimes per 100,000 residents")
city_a = {'crimes': 5000, 'population': 100000} city_b = {'crimes': 8000, 'population': 500000}
rate_a = per_capita(city_a['crimes'], city_a['population']) rate_b = per_capita(city_b['crimes'], city_b['population'])
print(f"City A: {rate_a:.1f} crimes per 100,000 residents") print(f"City B: {rate_b:.1f} crimes per 100,000 residents")

City A actually has higher crime rate despite fewer total crimes!

City A actually has higher crime rate despite fewer total crimes!

def adjust_for_inflation( amount: float | pd.Series, from_year: int | pd.Series, to_year: int, country: str = 'US' ) -> float | pd.Series: """Adjust dollar amounts for inflation. Works with scalars or Series for .assign().
Args:
    amount: Value(s) to adjust
    from_year: Original year(s) of the amount
    to_year: Target year to adjust to
    country: ISO 2-letter country code (default 'US'). US uses BLS data via cpi package,
             others use World Bank CPI data (FP.CPI.TOTL indicator)
"""
if country == 'US':
    # Use cpi package for US (more accurate, from BLS)
    if isinstance(from_year, pd.Series):
        return pd.Series([cpi.inflate(amt, yr, to=to_year) 
                        for amt, yr in zip(amount, from_year)], index=amount.index)
    return cpi.inflate(amount, from_year, to=to_year)
else:
    # Use World Bank data for other countries
    cpi_data = wbdata.get_dataframe(
        {'FP.CPI.TOTL': 'cpi'}, 
        country=country
    )['cpi'].to_dict()
    
    from_cpi = pd.Series(from_year).map(cpi_data) if isinstance(from_year, pd.Series) else cpi_data[from_year]
    to_cpi = cpi_data[to_year]
    return amount * (to_cpi / from_cpi)
def adjust_for_inflation( amount: float | pd.Series, from_year: int | pd.Series, to_year: int, country: str = 'US' ) -> float | pd.Series: """Adjust dollar amounts for inflation. Works with scalars or Series for .assign().
Args:
    amount: Value(s) to adjust
    from_year: Original year(s) of the amount
    to_year: Target year to adjust to
    country: ISO 2-letter country code (default 'US'). US uses BLS data via cpi package,
             others use World Bank CPI data (FP.CPI.TOTL indicator)
"""
if country == 'US':
    # Use cpi package for US (more accurate, from BLS)
    if isinstance(from_year, pd.Series):
        return pd.Series([cpi.inflate(amt, yr, to=to_year) 
                        for amt, yr in zip(amount, from_year)], index=amount.index)
    return cpi.inflate(amount, from_year, to=to_year)
else:
    # Use World Bank data for other countries
    cpi_data = wbdata.get_dataframe(
        {'FP.CPI.TOTL': 'cpi'}, 
        country=country
    )['cpi'].to_dict()
    
    from_cpi = pd.Series(from_year).map(cpi_data) if isinstance(from_year, pd.Series) else cpi_data[from_year]
    to_cpi = cpi_data[to_year]
    return amount * (to_cpi / from_cpi)

Usage:

Usage:

adjust_for_inflation(100, 2020, 2024) # US by default

adjust_for_inflation(100, 2020, 2024) # US by default

adjust_for_inflation(100, 2020, 2024, country='GB') # UK

adjust_for_inflation(100, 2020, 2024, country='GB') # UK

df.assign(inf_adjust24=lambda x: adjust_for_inflation(x['amount'], x['year'], 2024, country='DE'))

df.assign(inf_adjust24=lambda x: adjust_for_inflation(x['amount'], x['year'], 2024, country='DE'))

Always adjust when comparing dollars across years!

Always adjust when comparing dollars across years!

undefined
undefined

Correlation vs causation

相关性与因果性

markdown
undefined
markdown
undefined

Reporting correlations responsibly

负责任地报道相关性

What you CAN say

可以表述的内容

  • "X and Y are correlated"
  • "As X increases, Y tends to increase"
  • "Areas with higher X also tend to have higher Y"
  • "X is associated with Y"
  • "X与Y相关"
  • "X增加时,Y往往随之增加"
  • "X值较高的地区Y值也往往较高"
  • "X与Y存在关联"

What you CANNOT say (without more evidence)

不可以表述的内容(无更多证据时)

  • "X causes Y"
  • "X leads to Y"
  • "Y happens because of X"
  • "X导致Y"
  • "X引发Y"
  • "Y因X而发生"

Questions to ask before implying causation

暗示因果关系前需问的问题

  1. Is there a plausible mechanism?
  2. Does the timing make sense (cause before effect)?
  3. Is there a dose-response relationship?
  4. Has the finding been replicated?
  5. Have confounding variables been controlled?
  6. Are there alternative explanations?
  1. 存在合理的作用机制吗?
  2. 时间逻辑合理吗(因先于果)?
  3. 存在剂量-反应关系吗?
  4. 研究结果是否可重复?
  5. 是否控制了混杂变量?
  6. 是否有其他解释?

Red flags for spurious correlations

虚假相关性的危险信号

  • Extremely high correlation (r > 0.95) with unrelated things
  • No logical connection between variables
  • Third variable could explain both
  • Small sample size with high variance
undefined
  • 极高相关性(r > 0.95)但变量无关
  • 变量间无逻辑关联
  • 存在第三变量可同时解释两者
  • 小样本且方差大
undefined

Data visualization

数据可视化

Chart selection guide

图表选择指南

markdown
undefined
markdown
undefined

Choosing the right chart

选择合适的图表

Comparison

对比类

  • Bar chart: Compare categories
  • Grouped bar: Compare categories across groups
  • Bullet chart: Actual vs target
  • 柱状图:对比不同类别
  • 分组柱状图:跨组对比类别
  • 子弹图:实际值与目标值对比

Change over time

时间变化类

  • Line chart: Trends over time
  • Area chart: Cumulative totals over time
  • Slope chart: Change between two points
  • 折线图:时间趋势
  • 面积图:累计值随时间变化
  • 斜率图:两点间的变化

Distribution

分布类

  • Histogram: Distribution of one variable
  • Box plot: Compare distributions across groups
  • Violin plot: Detailed distribution shape
  • 直方图:单变量分布
  • 箱线图:跨组对比分布
  • 小提琴图:详细分布形态

Relationship

关系类

  • Scatter plot: Relationship between two variables
  • Bubble chart: Three variables (x, y, size)
  • Connected scatter: Change in relationship over time
  • 散点图:双变量关系
  • 气泡图:三变量(x、y、大小)
  • 连接散点图:关系随时间的变化

Composition

构成类

  • Pie chart: Parts of a whole (almost never use, max 5 slices, prefer donut charts)
  • Donut chart: Parts of a whole
  • Stacked bar: Parts of whole across categories
  • Treemap: Hierarchical composition
  • 饼图:整体的组成部分(尽量避免使用,最多5个切片,优先选用环形图)
  • 环形图:整体的组成部分
  • 堆叠柱状图:跨类别展示整体构成
  • 树状图:层级构成

Geographic

地理类

  • Choropleth: Values by region (use normalized data!)
  • Dot map: Individual locations
  • Proportional symbol: Magnitude at locations
undefined
  • 分级统计图:按区域展示数值(务必使用归一化数据!)
  • 点地图:展示个体位置
  • 比例符号图:按位置展示数值大小
undefined

Exploratory interactive visualizations with Plotly Express

用Plotly Express制作探索性交互式可视化

python
import plotly.express as px
python
import plotly.express as px

Set default template for all charts

Set default template for all charts

px.defaults.template = 'simple_white'
def create_bar_chart( data: pd.DataFrame, title: str, source: str, desc: str = '', x_val: str, y_val: str, x_lab: str | None, y_lab: str | None ) -> px.bar: """Create a bar chart."""
fig = px.bar(
    data, 
    x=x_val, 
    y=y_val,
    text=desc,
    title=title,
    labels={'category': (x_lab if x_lab else x_val), 'value': (y_lab if y_lab else y_val)}
)

return fig
px.defaults.template = 'simple_white'
def create_bar_chart( data: pd.DataFrame, title: str, source: str, desc: str = '', x_val: str, y_val: str, x_lab: str | None, y_lab: str | None ) -> px.bar: """Create a bar chart."""
fig = px.bar(
    data, 
    x=x_val, 
    y=y_val,
    text=desc,
    title=title,
    labels={'category': (x_lab if x_lab else x_val), 'value': (y_lab if y_lab else y_val)}
)

return fig

Example

Example

fig = create_bar_chart( data, title='Annual Widget Production', source='Department of Widgets, 2024', desc='The widget department increased its production dramatically starting in 2014.', x_val='year', y_val='widgets_prod', x_lab='Year', y_label='Units produced' )
fig.show() # Interactive display
undefined
fig = create_bar_chart( data, title='Annual Widget Production', source='Department of Widgets, 2024', desc='The widget department increased its production dramatically starting in 2014.', x_val='year', y_val='widgets_prod', x_lab='Year', y_label='Units produced' )
fig.show() # Interactive display
undefined

Publication-ready automated data visualizations with Datawrapper

用Datawrapper制作可发布的自动化数据可视化

python
import pandas as pd
import datawrapper as dw
python
import pandas as pd
import datawrapper as dw

Authentication: Set DATAWRAPPER_ACCESS_TOKEN environment variable,

Authentication: Set DATAWRAPPER_ACCESS_TOKEN environment variable,

or read from file and pass to create()

or read from file and pass to create()

with open('datawrapper_api_key.txt', 'r') as f: api_key = f.read().strip()
with open('datawrapper_api_key.txt', 'r') as f: api_key = f.read().strip()

read in your data

read in your data

data = pd.read_csv('../data/raw/data.csv')
data = pd.read_csv('../data/raw/data.csv')

Create a bar chart using the new OOP API

Create a bar chart using the new OOP API

chart = dw.BarChart( title='My Bar Chart Title', intro='Subtitle or description text', data=data,
# Formatting options
value_label_format=dw.NumberFormat.ONE_DECIMAL,
show_value_labels=True,
value_label_alignment='left',
sort_bars=True,  # sort by value
reverse_order=False,

# Source attribution
source_name='Your Data Source',
source_url='https://example.com',
byline='Your Name',

# Optional: custom base color
base_color='#1d81a2'
)
chart = dw.BarChart( title='My Bar Chart Title', intro='Subtitle or description text', data=data,
# Formatting options
value_label_format=dw.NumberFormat.ONE_DECIMAL,
show_value_labels=True,
value_label_alignment='left',
sort_bars=True,  # sort by value
reverse_order=False,

# Source attribution
source_name='Your Data Source',
source_url='https://example.com',
byline='Your Name',

# Optional: custom base color
base_color='#1d81a2'
)

Create and publish (uses DATAWRAPPER_ACCESS_TOKEN env var, or pass token)

Create and publish (uses DATAWRAPPER_ACCESS_TOKEN env var, or pass token)

chart.create(access_token=api_key) chart.publish()
chart.create(access_token=api_key) chart.publish()

Get chart URL and embed code

Get chart URL and embed code

print(f"Chart ID: {chart.chart_id}") print(f"Chart URL: https://datawrapper.dwcdn.net/{chart.chart_id}") iframe_code = chart.get_iframe_code(responsive=True)
print(f"Chart ID: {chart.chart_id}") print(f"Chart URL: https://datawrapper.dwcdn.net/{chart.chart_id}") iframe_code = chart.get_iframe_code(responsive=True)

Update existing chart with new data (for live-updating charts)

Update existing chart with new data (for live-updating charts)

existing_chart = dw.get_chart('YOUR_CHART_ID') # retrieve by ID existing_chart.data = new_df # assign new DataFrame existing_chart.title = 'Updated Title' # modify properties existing_chart.update() # push changes to Datawrapper existing_chart.publish() # republish to make live
existing_chart = dw.get_chart('YOUR_CHART_ID') # retrieve by ID existing_chart.data = new_df # assign new DataFrame existing_chart.title = 'Updated Title' # modify properties existing_chart.update() # push changes to Datawrapper existing_chart.publish() # republish to make live

Optional — Export chart as image

Optional — Export chart as image

chart.export(filepath='chart.png', width=800, height=600)
#view chart chart
undefined
chart.export(filepath='chart.png', width=800, height=600)
#view chart chart
undefined

Avoiding misleading visualizations

避免误导性可视化

markdown
undefined
markdown
undefined

Chart integrity checklist

图表完整性清单

Axes

坐标轴

  • Y-axis starts at zero (for bar charts)
  • Axis labels are clear
  • Scale is appropriate (not truncated to exaggerate)
  • Both axes labeled with units
  • 柱状图Y轴从0开始
  • 坐标轴标签清晰
  • 刻度合适(未截断以夸大差异)
  • 双轴均标注单位

Data representation

数据呈现

  • All data points visible
  • Colors are distinguishable (including colorblind)
  • Proportions are accurate
  • 3D effects not distorting perception
  • 所有数据点可见
  • 颜色可区分(包括色盲友好)
  • 比例准确
  • 无3D效果干扰感知

Context

背景信息

  • Title describes what's shown, not conclusion
  • Time period clearly stated
  • Source cited
  • Sample size/methodology noted if relevant
  • Uncertainty shown where appropriate
  • 标题描述内容而非结论
  • 时间周期明确
  • 来源已标注
  • 样本量/研究方法相关时已说明
  • 必要时展示不确定性

Honesty

真实性

  • Cherry-picking dates avoided
  • Outliers explained, not hidden
  • Dual axes justified (usually avoid)
  • Annotations don't mislead
undefined
  • 避免选择性选取日期
  • 异常值已解释而非隐藏
  • 双轴使用合理(尽量避免)
  • 注释无误导性
undefined

Working with geospatial data

地理空间数据处理

Geocoding data

地理编码

U.S. Census Geocoder

美国人口普查地理编码器

Best for: U.S. addresses only. Returns Census geography (tract, block, FIPS codes) along with coordinates—essential for joining with Census demographic data.
Pros: Completely free with no API key required. Returns Census geographies (state/county FIPS, tract, block) that let you join with ACS/decennial Census data. Good match rates for standard U.S. addresses.
Cons: Limited to 10,000 addresses per batch. U.S. addresses only. Slower than commercial alternatives. Lower match rates for non-standard addresses (PO boxes, rural routes, new construction).
Use when: You need to geocode nicely formatted U.S. addresses or you don't have budget for a paid service.
python
undefined
**最佳适用场景:**仅美国地址。返回普查地理信息( tract、block、FIPS编码)及坐标——这是与人口普查人口统计数据关联的关键。
**优点:**完全免费,无需API密钥。返回可关联ACS/十年一次人口普查数据的普查地理信息(州/县FIPS、tract、block)。标准美国地址匹配率高。
**缺点:**批量处理上限为10,000条地址。仅支持美国地址。速度慢于商业服务。非标准地址(邮政信箱、乡村路线、新建建筑)匹配率较低。
**使用时机:**需处理格式规范的美国地址,或无预算使用付费服务时。
python
undefined

pip install censusbatchgeocoder

pip install censusbatchgeocoder

import censusbatchgeocoder import pandas as pd
import censusbatchgeocoder import pandas as pd

DataFrame must have columns: id, address, city, state, zipcode

DataFrame必须包含列:id, address, city, state, zipcode

(state and zipcode are optional but improve match rates)

(state和zipcode可选,但可提升匹配率)

def census_geocode( df: pd.DataFrame, id_col: str = 'id', address_col: str = 'address', city_col: str = 'city', state_col: str = 'state', zipcode_col: str = 'zipcode', chunk_size: int = 9999 ) -> pd.DataFrame: """ Geocode a DataFrame using the U.S. Census batch geocoder. Automatically handles datasets larger than 10,000 rows by chunking.
Returns DataFrame with: latitude, longitude, state_fips, county_fips, 
tract, block, is_match, is_exact, returned_address, geocoded_address
"""
# Rename columns to expected format
col_map = {id_col: 'id', address_col: 'address', city_col: 'city'}
if state_col and state_col in df.columns:
    col_map[state_col] = 'state'
if zipcode_col and zipcode_col in df.columns:
    col_map[zipcode_col] = 'zipcode'

renamed_df = df.rename(columns=col_map)
records = renamed_df.to_dict('records')

# Small dataset: geocode directly
if len(records) <= chunk_size:
    results = censusbatchgeocoder.geocode(records)
    return pd.DataFrame(results)

# Large dataset: process in chunks to stay under 10,000 limit
all_results = []
for i in range(0, len(records), chunk_size):
    chunk = records[i:i + chunk_size]
    print(f"Geocoding rows {i:,} to {i + len(chunk):,} of {len(records):,}...")
    
    try:
        results = censusbatchgeocoder.geocode(chunk)
        all_results.extend(results)
    except Exception as e:
        print(f"Error on chunk starting at {i}: {e}")
        for record in chunk:
            all_results.append({**record, 'is_match': 'No_Match', 'latitude': None, 'longitude': None})

return pd.DataFrame(all_results)
def census_geocode( df: pd.DataFrame, id_col: str = 'id', address_col: str = 'address', city_col: str = 'city', state_col: str = 'state', zipcode_col: str = 'zipcode', chunk_size: int = 9999 ) -> pd.DataFrame: """ Geocode a DataFrame using the U.S. Census batch geocoder. Automatically handles datasets larger than 10,000 rows by chunking.
Returns DataFrame with: latitude, longitude, state_fips, county_fips, 
tract, block, is_match, is_exact, returned_address, geocoded_address
"""
# Rename columns to expected format
col_map = {id_col: 'id', address_col: 'address', city_col: 'city'}
if state_col and state_col in df.columns:
    col_map[state_col] = 'state'
if zipcode_col and zipcode_col in df.columns:
    col_map[zipcode_col] = 'zipcode'

renamed_df = df.rename(columns=col_map)
records = renamed_df.to_dict('records')

# Small dataset: geocode directly
if len(records) <= chunk_size:
    results = censusbatchgeocoder.geocode(records)
    return pd.DataFrame(results)

# Large dataset: process in chunks to stay under 10,000 limit
all_results = []
for i in range(0, len(records), chunk_size):
    chunk = records[i:i + chunk_size]
    print(f"Geocoding rows {i:,} to {i + len(chunk):,} of {len(records):,}...")
    
    try:
        results = censusbatchgeocoder.geocode(chunk)
        all_results.extend(results)
    except Exception as e:
        print(f"Error on chunk starting at {i}: {e}")
        for record in chunk:
            all_results.append({**record, 'is_match': 'No_Match', 'latitude': None, 'longitude': None})

return pd.DataFrame(all_results)

Usage:

Usage:

geocoded = (pd .read_csv('../data/raw/addresses.csv') .assign(id=lambda x: x.index) .pipe(census_geocode, id_col='id', address_col='street', city_col='city'. state_col='state', zipcode_col='zip'))
undefined
geocoded = (pd .read_csv('../data/raw/addresses.csv') .assign(id=lambda x: x.index) .pipe(census_geocode, id_col='id', address_col='street', city_col='city', state_col='state', zipcode_col='zip'))
undefined

Google Maps Geocoder

谷歌地图地理编码器

Best for: International addresses, high match rates, and messy/non-standard address formats.
Pros: Excellent match rates even for poorly formatted addresses. Works worldwide. Fast and reliable. Returns rich metadata (place types, address components, place IDs).
Cons: Costs money ($5 per 1,000 requests after free tier). Requires API key and billing account. Does not return Census geography—you'd need to do a separate spatial join.
Use when: You need to geocode international addresses, have messy address data that the Census geocoder can't match, or need the highest possible match rate and have budget for it.
python
import googlemaps
from typing import Optional

def geocode_address_google(address: str, api_key: str) -> Optional[dict]:
    """
    Geocode address using Google Maps API.
    Requires API key with Geocoding API enabled.
    """
    gmaps = googlemaps.Client(key=api_key)
    result = gmaps.geocode(address)
    
    if result:
        location = result[0]['geometry']['location']
        return {
            'formatted_address': result[0]['formatted_address'],
            'lat': location['lat'],
            'lon': location['lng'],
            'place_id': result[0]['place_id']
        }
    return None
**最佳适用场景:**国际地址、高匹配率、杂乱/非标准地址格式。
**优点:**即使是格式混乱的地址,匹配率也极高。支持全球地址。快速可靠。返回丰富元数据(地点类型、地址组件、地点ID)。
**缺点:**收费(免费额度后每1,000次请求5美元)。需要API密钥和计费账户。不返回普查地理信息——需单独进行空间关联。
**使用时机:**需处理国际地址,或人口普查地理编码器无法匹配的杂乱地址,或预算充足且需最高匹配率时。
python
import googlemaps
from typing import Optional

def geocode_address_google(address: str, api_key: str) -> Optional[dict]:
    """
    Geocode address using Google Maps API.
    Requires API key with Geocoding API enabled.
    """
    gmaps = googlemaps.Client(key=api_key)
    result = gmaps.geocode(address)
    
    if result:
        location = result[0]['geometry']['location']
        return {
            'formatted_address': result[0]['formatted_address'],
            'lat': location['lat'],
            'lon': location['lng'],
            'place_id': result[0]['place_id']
        }
    return None

Batch geocode a DataFrame

Batch geocode a DataFrame

def batch_geocode(df: pd.DataFrame, address_col: str, api_key: str) -> pd.DataFrame: gmaps = googlemaps.Client(key=api_key)
results = []
for address in df[address_col]:
    try:
        result = gmaps.geocode(address)
        if result:
            loc = result[0]['geometry']['location']
            results.append({'lat': loc['lat'], 'lon': loc['lng']})
        else:
            results.append({'lat': None, 'lon': None})
    except Exception:
        results.append({'lat': None, 'lon': None})

return pd.concat([df, pd.DataFrame(results)], axis=1)
undefined
def batch_geocode(df: pd.DataFrame, address_col: str, api_key: str) -> pd.DataFrame: gmaps = googlemaps.Client(key=api_key)
results = []
for address in df[address_col]:
    try:
        result = gmaps.geocode(address)
        if result:
            loc = result[0]['geometry']['location']
            results.append({'lat': loc['lat'], 'lon': loc['lng']})
        else:
            results.append({'lat': None, 'lon': None})
    except Exception:
        results.append({'lat': None, 'lon': None})

return pd.concat([df, pd.DataFrame(results)], axis=1)
undefined

Geopandas

GeoPandas使用

python
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point
python
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point

Read data from various formats

Read data from various formats

gdf = gpd.read_file('data.geojson') # GeoJSON gdf = gpd.read_file('data.shp') # Shapefile gdf = gpd.read_file('https://example.com/data.geojson') # From URL gdf = gpd.read_parquet('data.parquet') # GeoParquet (fast!)
gdf = gpd.read_file('data.geojson') # GeoJSON gdf = gpd.read_file('data.shp') # Shapefile gdf = gpd.read_file('https://example.com/data.geojson') # From URL gdf = gpd.read_parquet('data.parquet') # GeoParquet (fast!)

Transform DataFrame with lat/lon to GeoDataFrame

Transform DataFrame with lat/lon to GeoDataFrame

df = pd.read_csv('locations.csv') geometry = [Point(xy) for xy in zip(df['longitude'], df['latitude'])] gdf = gpd.GeoDataFrame(df, geometry=geometry)
df = pd.read_csv('locations.csv') geometry = [Point(xy) for xy in zip(df['longitude'], df['latitude'])] gdf = gpd.GeoDataFrame(df, geometry=geometry)

Set CRS (Coordinate Reference System)

Set CRS (Coordinate Reference System)

EPSG:4326 = WGS84 (standard latitude, longitude)

EPSG:4326 = WGS84 (standard latitude, longitude)

gdf = gdf.set_crs('EPSG:4326')
gdf = gdf.set_crs('EPSG:4326')

Transform to different CRS (for area/distance calculations, use projected CRS)

Transform to different CRS (for area/distance calculations, use projected CRS)

gdf_projected = gdf.to_crs('EPSG:3857') # Web Mercator, for distance in meters
gdf_projected = gdf.to_crs('EPSG:3857') # Web Mercator, for distance in meters

Basic spatial operations

Basic spatial operations

#Find the area of a shape gdf['area'] = gdf_projected.geometry.area
#Find the center of a shape gdf['centroid'] = gdf.geometry.centroid
#Draw a 1km boundary around a point gdf['buffer_1km'] = gdf_projected.geometry.buffer(1000) #when set to CRS 3857
#Find the area of a shape gdf['area'] = gdf_projected.geometry.area
#Find the center of a shape gdf['centroid'] = gdf.geometry.centroid
#Draw a 1km boundary around a point gdf['buffer_1km'] = gdf_projected.geometry.buffer(1000) #when set to CRS 3857

Spatial join: find points within polygons

Spatial join: find points within polygons

points = gpd.read_file('points.geojson') polygons = gpd.read_file('boundaries.geojson') joined = gpd.sjoin(points, polygons, predicate='within')
points = gpd.read_file('points.geojson') polygons = gpd.read_file('boundaries.geojson') joined = gpd.sjoin(points, polygons, predicate='within')

Dissolve: merge geometries by attribute

Dissolve: merge geometries by attribute

dissolved = gdf.dissolve(by='state', aggfunc='sum')
dissolved = gdf.dissolve(by='state', aggfunc='sum')

Export to various formats

Export to various formats

gdf.to_parquet('output.parquet') # GeoParquet (recommended) gdf.to_file('output.geojson', driver='GeoJSON') #for tools that dont support GeoParquet
undefined
gdf.to_parquet('output.parquet') # GeoParquet (recommended) gdf.to_file('output.geojson', driver='GeoJSON') #for tools that dont support GeoParquet
undefined

Geo-Visualization with
.explore()
,
lonboard
and Datawrapper

.explore()
lonboard
和Datawrapper进行地理可视化

.explore()

.explore()

Best for: Quick exploration and prototyping during data analysis.
Pros: Built into GeoPandas—method is available on any GeoDataFrame. Great for exploratory data analysis—checking that your data looks right, exploring spatial patterns, and iterating quickly on map designs.
Cons: Becomes slow with large datasets (>100k features). Limited customization compared to dedicated mapping libraries. Requires extra dependencies to be installed.
Use when: You're in the middle of analysis and want to quickly visualize your GeoDataFrame without switching tools.
Required dependencies:
bash
pip install folium mapclassify matplotlib
  • folium
    - Required for
    .explore()
    to work at all (renders the interactive map)
  • mapclassify
    - Required when using
    scheme=
    parameter for classification (e.g., 'naturalbreaks', 'quantiles', 'equalinterval')
  • matplotlib
    - Required for colormap (
    cmap=
    ) support
python
import geopandas as gpd
**最佳适用场景:**数据分析过程中的快速探索和原型制作。
**优点:**内置在GeoPandas中——任何GeoDataFrame都可调用该方法。非常适合探索性数据分析——检查数据是否正确、探索空间模式、快速迭代地图设计。
**缺点:**大数据集(>10万要素)时速度变慢。与专用地图库相比,自定义选项有限。需要安装额外依赖。
**使用时机:**分析过程中,无需切换工具即可快速可视化GeoDataFrame时。
所需依赖:
bash
pip install folium mapclassify matplotlib
  • folium
    -
    .explore()
    运行的核心依赖(渲染交互式地图)
  • mapclassify
    - 使用
    scheme=
    参数进行分类时所需(如'naturalbreaks'、'quantiles'、'equalinterval')
  • matplotlib
    - 颜色映射(
    cmap=
    )支持所需
python
import geopandas as gpd

folium, mapclassify, and matplotlib must be installed but don't need to be imported

folium, mapclassify, and matplotlib must be installed but don't need to be imported

geopandas imports them automatically when you call .explore()

geopandas imports them automatically when you call .explore()

Basic interactive map (uses folium under the hood)

Basic interactive map (uses folium under the hood)

gdf.explore()
gdf.explore()

Choropleth map with customization

Choropleth map with customization

(requires mapclassify for scheme parameter)

(requires mapclassify for scheme parameter)

gdf.explore( column='population', # Column for color scale cmap='YlOrRd', # Matplotlib colormap scheme='naturalbreaks', # Classification scheme (needs mapclassify) k=5, # Number of bins legend=True, tooltip=['name', 'population'], # Columns to show on hover popup=True, # Show all columns on click tiles='CartoDB positron', # Background tiles style_kwds={'color': 'black', 'weight': 0.5} # Border style )
undefined
gdf.explore( column='population', # Column for color scale cmap='YlOrRd', # Matplotlib colormap scheme='naturalbreaks', # Classification scheme (needs mapclassify) k=5, # Number of bins legend=True, tooltip=['name', 'population'], # Columns to show on hover popup=True, # Show all columns on click tiles='CartoDB positron', # Background tiles style_kwds={'color': 'black', 'weight': 0.5} # Border style )
undefined

lonboard

lonboard

Best for: Large datasets and high-performance visualization in Jupyter notebooks.
Pros: GPU-accelerated rendering via deck.gl can handle millions of points smoothly. Excellent interactivity—pan, zoom, and hover work fluidly even with massive datasets. Native support for GeoArrow format for efficient data transfer.
Cons: Requires separate installation (
pip install lonboard
). Styling options are more technical (RGBA arrays, deck.gl conventions).
Use when: You have large point datasets (crime incidents, sensor readings, business locations) or need smooth interactivity with 100k+ features.
python
import geopandas as gpd
from lonboard import viz, Map, ScatterplotLayer, PolygonLayer
**最佳适用场景:**Jupyter笔记本中的大数据集和高性能可视化。
**优点:**通过deck.gl实现GPU加速渲染,可流畅处理数百万个点。交互性极佳——即使是超大数据集,平移、缩放、悬停操作也流畅。原生支持GeoArrow格式,实现高效数据传输。
**缺点:**需单独安装(
pip install lonboard
)。样式选项更偏向技术化(RGBA数组、deck.gl约定)。
**使用时机:**处理大型点数据集(犯罪事件、传感器读数、商业地点),或需10万+要素的流畅交互时。
python
import geopandas as gpd
from lonboard import viz, Map, ScatterplotLayer, PolygonLayer

Quick visualization (auto-detects geometry type)

Quick visualization (auto-detects geometry type)

viz(gdf)
viz(gdf)

Custom ScatterplotLayer for points

Custom ScatterplotLayer for points

layer = ScatterplotLayer.from_geopandas( gdf, get_radius=100, get_fill_color=[255, 0, 0, 200], # RGBA pickable=True ) m = Map(layer) m
layer = ScatterplotLayer.from_geopandas( gdf, get_radius=100, get_fill_color=[255, 0, 0, 200], # RGBA pickable=True ) m = Map(layer) m

PolygonLayer with color based on column

PolygonLayer with color based on column

from lonboard.colormap import apply_continuous_cmap import matplotlib.pyplot as plt
colors = apply_continuous_cmap(gdf['value'], plt.cm.viridis) layer = PolygonLayer.from_geopandas( gdf, get_fill_color=colors, get_line_color=[0, 0, 0, 100], pickable=True ) Map(layer)
undefined
from lonboard.colormap import apply_continuous_cmap import matplotlib.pyplot as plt
colors = apply_continuous_cmap(gdf['value'], plt.cm.viridis) layer = PolygonLayer.from_geopandas( gdf, get_fill_color=colors, get_line_color=[0, 0, 0, 100], pickable=True ) Map(layer)
undefined

Datawrapper

Datawrapper

Best for: Publication-ready choropleth and proportional symbol maps for articles and reports.
Pros: Beautiful, professional defaults out of the box. Generates embeddable, responsive iframes that work in any CMS. Readers can interact (hover, click) without running any code. Accessible and mobile-friendly. Easy to update data programmatically for updating data.
Cons: Requires a Datawrapper account (free tier available). Limited to Datawrapper's supported boundary files—you can't bring arbitrary geometries. Less flexibility for custom visualizations.
Use when: You need a polished map for publication. Ideal for choropleth maps showing statistics by region (unemployment by state, COVID cases by county, election results by district). Your audience will view the map in a browser, not a notebook.
Unlike
.explore()
or
lonboard
, you don't pass raw geometry—instead you match your data to Datawrapper's built-in boundary files using standard codes (FIPS, ISO, etc.).
python
import datawrapper as dw
import pandas as pd
**最佳适用场景:**为文章和报告制作可发布的分级统计图和比例符号图。
**优点:**默认样式美观专业。生成可嵌入的响应式iframe,适用于任何CMS。读者无需运行代码即可交互(悬停、点击)。无障碍且适配移动端。可通过代码更新数据,实现动态更新。
**缺点:**需要Datawrapper账户(提供免费版)。受限于Datawrapper支持的边界文件——无法导入任意几何图形。自定义可视化的灵活性较低。
**使用时机:**需要制作可发布的精美地图时。非常适合展示区域统计数据的分级统计图(各州失业率、各县新冠病例、各选区选举结果)。受众将在浏览器中查看地图,而非笔记本。
.explore()
lonboard
不同,无需传入原始几何图形——而是使用标准编码(FIPS、ISO等)将数据与Datawrapper内置的边界文件匹配。
python
import datawrapper as dw
import pandas as pd

Read API key

Read API key

with open('datawrapper_api_key.txt', 'r') as f: api_key = f.read().strip()
with open('datawrapper_api_key.txt', 'r') as f: api_key = f.read().strip()

Prepare data with location codes that match Datawrapper's boundaries

Prepare data with location codes that match Datawrapper's boundaries

For US states: use 2-letter abbreviations or FIPS codes

For US states: use 2-letter abbreviations or FIPS codes

For countries: use ISO 3166-1 alpha-2 codes

For countries: use ISO 3166-1 alpha-2 codes

df = pd.DataFrame({ 'state': ['AL', 'AK', 'AZ', 'AR', 'CA'], # State abbreviations 'unemployment_rate': [4.9, 3.2, 7.1, 4.2, 5.8] })
df = pd.DataFrame({ 'state': ['AL', 'AK', 'AZ', 'AR', 'CA'], # State abbreviations 'unemployment_rate': [4.9, 3.2, 7.1, 4.2, 5.8] })

Create a choropleth map

Create a choropleth map

chart = dw.ChoroplethMap( title='Unemployment Rate by State', intro='Percentage of labor force unemployed, 2024', data=df,
# Map configuration
basemap='us-states',           # Built-in US states boundaries
basemap_key='state',           # Column in your data with location codes
value_column='unemployment_rate',

# Styling
color_palette='YlOrRd',        # Color scheme
legend_title='Unemployment %',

# Attribution
source_name='Bureau of Labor Statistics',
source_url='https://www.bls.gov/',
byline='Your Name'
)
chart = dw.ChoroplethMap( title='Unemployment Rate by State', intro='Percentage of labor force unemployed, 2024', data=df,
# Map configuration
basemap='us-states',           # Built-in US states boundaries
basemap_key='state',           # Column in your data with location codes
value_column='unemployment_rate',

# Styling
color_palette='YlOrRd',        # Color scheme
legend_title='Unemployment %',

# Attribution
source_name='Bureau of Labor Statistics',
source_url='https://www.bls.gov/',
byline='Your Name'
)

Create and publish

Create and publish

chart.create(access_token=api_key) chart.publish()
chart.create(access_token=api_key) chart.publish()

Get embed code for your article

Get embed code for your article

iframe = chart.get_iframe_code(responsive=True) print(f"Chart URL: https://datawrapper.dwcdn.net/{chart.chart_id}")
iframe = chart.get_iframe_code(responsive=True) print(f"Chart URL: https://datawrapper.dwcdn.net/{chart.chart_id}")

Update with new data (for live-updating maps)

Update with new data (for live-updating maps)

new_df = pd.DataFrame({...}) # Updated data existing_chart = dw.get_chart('YOUR_CHART_ID') existing_chart.data = new_df existing_chart.update() existing_chart.publish()

**Available Datawrapper basemaps include:**
- `us-states`, `us-counties`, `us-congressional-districts`
- `world`, `europe`, `africa`, `asia`
- Country-specific maps (e.g., `germany-states`, `uk-constituencies`)
new_df = pd.DataFrame({...}) # Updated data existing_chart = dw.get_chart('YOUR_CHART_ID') existing_chart.data = new_df existing_chart.title = 'Updated Title' # modify properties existing_chart.update() existing_chart.publish()

**Datawrapper可用底图包括:**
- `us-states`, `us-counties`, `us-congressional-districts`
- `world`, `europe`, `africa`, `asia`
- 国家特定地图(如`germany-states`, `uk-constituencies`)

Learning resources

学习资源

  • NICAR (Investigative Reporters & Editors)
  • Knight Center for Journalism in the Americas
  • Data Journalism Handbook (datajournalism.com)
  • Flowing Data (flowingdata.com)
  • The Pudding (pudding.cool) - examples
  • Sigma Awards (https://www.sigmaawards.org/) - examples
  • NICAR(调查记者与编辑协会)
  • 美洲新闻骑士中心
  • 《数据新闻手册》(datajournalism.com)
  • Flowing Data(flowingdata.com)
  • The Pudding(pudding.cool) - 案例参考
  • Sigma Awards(https://www.sigmaawards.org/) - 案例参考