Data journalism methodology

数据新闻方法论

Systematic approaches for finding, analyzing and presenting data in journalism.

系统性地寻找、分析和呈现新闻数据的方法。

Story structure for data journalism

数据新闻的报道结构

Data journalism framework

数据新闻框架

markdown


The framework for data journalism was established by Philip Meyer, a journalist for Knight-Ridder, Harvard Nieman Fellow and professor at UNC-Chapel Hill. In his book <i>The New Precision Journalism</i>, which outlines his ideas, Meyer encourages journalists to treat journalism "as if it were a science" by adopting the scientific method:
- Making observation(s) / formulating a questiom
- Researching the question / Collect, store and retrieve data
- Formulate a hypothesis
- Test the hypothesis, using both qualitative (interviews, documents etc.) and quantitative (data analysis etc.) methods
- Analyze the results and reduce them to the most important findings
- Present them to the audience

This process should be thought of as iterative, rather than sequential.

markdown

数据新闻的框架由Philip Meyer确立，他是Knight-Ridder的记者、哈佛大学尼曼研究员，同时也是北卡罗来纳大学教堂山分校的教授。在他的著作《新精确新闻》中，他阐述了自己的理念，鼓励记者采用科学方法，将新闻业“视为一门科学”：
- 观察现象 / 提出问题
- 研究问题 / 收集、存储和检索数据
- 提出假设
- 使用定性（访谈、文档等）和定量（数据分析等）方法检验假设
- 分析结果，提炼出最重要的发现
- 向受众呈现结果

这个过程应被视为迭代式的，而非线性顺序的。

The data story arc

数据新闻叙事弧线

1. The hook (nut graf)

1. 钩子（核心段落）

What's the key finding(s)?
Why should readers care?
What's the human impact?

核心发现是什么？
读者为什么需要关注？
对人有什么影响？

2. The evidence

2. 证据

Show the data
Explain the methodology
Acknowledge limitations

展示数据
解释研究方法
说明局限性

3. The context

3. 背景

How does this compare to past?
How does this compare to elsewhere?
What's the trend?

与过去相比有何变化？
与其他地区相比有何不同？
趋势是什么？

4. The human element

4. 人文元素

Individual examples that illustrate the data
Expert interpretation
Affected voices

能体现数据的个体案例
专家解读
受影响者的声音

5. The implications

5. 启示

What does this mean going forward?
What questions remain?
What actions could result?

这对未来意味着什么？
还有哪些问题待解答？
可能引发哪些行动？

6. The methodology box

6. 方法说明框

Where did data come from?
How was it analyzed?
What are the limitations?
How can readers explore further?

undefined

数据来源何处？
如何进行分析？
存在哪些局限性？
读者如何进一步探索？

undefined

Methodology documentation template

方法文档模板

markdown

undefined

markdown

undefined

How we did this analysis

我们的分析流程

Data sources

数据来源

[List all data sources with links and access dates]

[列出所有数据来源及链接和访问日期]

Time period

时间范围

[Specify exactly what time period is covered]

[明确说明涵盖的时间周期]

Definitions

定义

[Define key terms and how you operationalized them]

[定义关键术语及操作化方式]

Analysis steps

分析步骤

[First step of analysis]
[Second step]
[Continue...]

[分析第一步]
[第二步]
[后续步骤...]

Limitations

局限性

[Limitation 1]
[Limitation 2]

[局限性1]
[局限性2]

What we excluded and why

排除内容及原因

[排除类别]：[原因]

Verification

验证方式

[How findings were verified/checked]

[如何验证/检查研究结果]

Code and data availability

代码与数据获取

[Link to GitHub repo if sharing code/data]

[若分享代码/数据，提供GitHub仓库链接]

Contact

联系方式

[How readers can reach you with questions]

undefined

[读者咨询的联系方式]

undefined

Data acquisition

数据获取

Public data sources

公共数据来源

markdown

undefined

markdown

undefined

Federal data sources

联邦数据来源

General

通用类

Data.gov - Federal open data portal
Census Bureau (census.gov) - Demographics, economic data
BLS (bls.gov) - Employment, inflation, wages
BEA (bea.gov) - GDP, economic accounts
Federal Reserve (federalreserve.gov) - Financial data
SEC EDGAR - Corporate filings

Data.gov - 联邦开放数据门户
Census Bureau (census.gov) - 人口统计、经济数据
BLS (bls.gov) - 就业、通胀、薪资数据
BEA (bea.gov) - GDP、经济账户数据
Federal Reserve (federalreserve.gov) - 金融数据
SEC EDGAR - 企业备案文件

Specific domains

特定领域

EPA (epa.gov/data) - Environmental data
FDA (fda.gov/data) - Drug approvals, recalls, adverse events
CDC WONDER - Health statistics
NHTSA - Vehicle safety data
DOT - Transportation statistics
FEC - Campaign finance
USASpending.gov - Federal contracts and grants

EPA (epa.gov/data) - 环境数据
FDA (fda.gov/data) - 药品审批、召回、不良事件数据
CDC WONDER - 健康统计数据
NHTSA - 车辆安全数据
DOT - 交通统计数据
FEC - 竞选财务数据
USASpending.gov - 联邦合同与拨款数据

State and local

州及地方数据

State open data portals (search: "[state] open data")
Socrata-powered sites (many cities/states)
OpenStreets, municipal GIS portals
State comptroller/auditor reports

undefined

州开放数据门户（搜索："[州名] open data"）
Socrata支持的站点（众多城市/州使用）
OpenStreets、市政GIS门户
州审计长/审计报告

undefined

Data request strategies

数据请求策略

markdown

undefined

markdown

undefined

Getting data that isn't public

获取非公开数据

Public records request (ie. FOIA) for datasets

公共记录请求（如FOIA）获取数据集

Request databases, not just documents
Ask for data dictionary/schema
Request in native format (CSV, SQL dump)
Specify field-level needs

请求数据库，而非仅文档
索要数据字典/模式
要求提供原生格式（CSV、SQL导出文件）
明确字段级需求

Building your own dataset

自建数据集

Scraping public information
Crowdsourcing from readers
Systematic document review
Surveys (with proper methodology)

爬取公开信息
向读者众包数据
系统性文档审查
采用规范方法开展调查

Commercial data sources (for newsrooms)

新闻室商用数据来源

LexisNexis
Refinitiv
Bloomberg
Industry-specific databases

undefined

LexisNexis
Refinitiv
Bloomberg
行业特定数据库

undefined

Data cleaning and preparation

数据清洗与预处理

Common data problems

常见数据问题

python

from typing import Any

import pandas as pd
import numpy as np
from rapidfuzz import fuzz
from itertools import combinations

python

from typing import Any

import pandas as pd
import numpy as np
from rapidfuzz import fuzz
from itertools import combinations

Inflation adjustment

import cpi import wbdata

def standardize_name(name: Any) -> str | None: """Standardize name format to 'First Last'.""" if pd.isna(name): return None name = str(name).strip().upper() # Handle "LAST, FIRST" format if ',' in name: parts = name.split(',') name = f"{parts[1].strip()} {parts[0].strip()}" return name

def parse_date(date_str: Any) -> pd.Timestamp | None: """Parse dates in various formats.""" if pd.isna(date_str): return None

formats = [
    '%m/%d/%Y', '%Y-%m-%d', '%B %d, %Y',
    '%d-%b-%y', '%m-%d-%Y', '%Y/%m/%d'
]

for fmt in formats:
    try:
        return pd.to_datetime(date_str, format=fmt)
    except:
        continue

# Fall back to pandas parser
try:
    return pd.to_datetime(date_str)
except:
    return None

def handle_missing(df:pd.DataFrame, thresh:int | None, per_thresh:float | None, required_col:str | None) -> pd.DataFrame: '''Handles Dataframes with too many missing values, defined by the user.''' if thresh and data_clean.isna().sum() >= thresh: return df.dropna(subset=[required_col]).reset_index(drop=True).copy()

elif per_thresh and (data_clean.isna().sum() / len(data_clean) * 100) >= per_thresh:
    return df.dropna(subset=[required_col]).reset_index(drop=True).copy()

else:
    return df

def handle_duplicates(df:pd.DataFrame, thresh=int | None) '''Handle duplicate rows of data.''' if thresh and df.duplicated().sum() >= thresh: return df.drop_duplicates().reset_index(drop=True).copy() else: return df

def flag_similar_names(df: pd.DataFrame, name_col: str, threshold: int = 85) -> pd.DataFrame: """Flag rows that have potential duplicate names using vectorized comparison."""

names = df[name_col].dropna().unique()

# Use combinations() to avoid nested loop and duplicate comparisons
dup_names: set[Any] = {
    name
    for name1, name2 in combinations(names, 2)
    if fuzz.ratio(str(name1).lower(), str(name2).lower()) >= threshold
    for name in (name1, name2)
}

df['has_similar_name'] = df[name_col].isin(dup_names)
return df

def flag_outliers(series: pd.Series, method: str = 'iqr', threshold: float = 1.5) -> pd.Series: """Flag statistical outliers.""" if method == 'iqr': Q1 = series.quantile(0.25) Q3 = series.quantile(0.75) IQR = Q3 - Q1 lower = Q1 - threshold * IQR upper = Q3 + threshold * IQR return (series < lower) | (series > upper) elif method == 'zscore': z_scores = np.abs((series - series.mean()) / series.std()) return z_scores > threshold

import cpi import wbdata

def standardize_name(name: Any) -> str | None: """Standardize name format to 'First Last'.""" if pd.isna(name): return None name = str(name).strip().upper() # Handle "LAST, FIRST" format if ',' in name: parts = name.split(',') name = f"{parts[1].strip()} {parts[0].strip()}" return name

def parse_date(date_str: Any) -> pd.Timestamp | None: """Parse dates in various formats.""" if pd.isna(date_str): return None

formats = [
    '%m/%d/%Y', '%Y-%m-%d', '%B %d, %Y',
    '%d-%b-%y', '%m-%d-%Y', '%Y/%m/%d'
]

for fmt in formats:
    try:
        return pd.to_datetime(date_str, format=fmt)
    except:
        continue

# Fall back to pandas parser
try:
    return pd.to_datetime(date_str)
except:
    return None

def handle_missing(df:pd.DataFrame, thresh:int | None, per_thresh:float | None, required_col:str | None) -> pd.DataFrame: '''Handles Dataframes with too many missing values, defined by the user.''' if thresh and data_clean.isna().sum() >= thresh: return df.dropna(subset=[required_col]).reset_index(drop=True).copy()

elif per_thresh and (data_clean.isna().sum() / len(data_clean) * 100) >= per_thresh:
    return df.dropna(subset=[required_col]).reset_index(drop=True).copy()

else:
    return df

def handle_duplicates(df:pd.DataFrame, thresh=int | None) '''Handle duplicate rows of data.''' if thresh and df.duplicated().sum() >= thresh: return df.drop_duplicates().reset_index(drop=True).copy() else: return df

def flag_similar_names(df: pd.DataFrame, name_col: str, threshold: int = 85) -> pd.DataFrame: """Flag rows that have potential duplicate names using vectorized comparison."""

names = df[name_col].dropna().unique()

# Use combinations() to avoid nested loop and duplicate comparisons
dup_names: set[Any] = {
    name
    for name1, name2 in combinations(names, 2)
    if fuzz.ratio(str(name1).lower(), str(name2).lower()) >= threshold
    for name in (name1, name2)
}

df['has_similar_name'] = df[name_col].isin(dup_names)
return df

def flag_outliers(series: pd.Series, method: str = 'iqr', threshold: float = 1.5) -> pd.Series: """Flag statistical outliers.""" if method == 'iqr': Q1 = series.quantile(0.25) Q3 = series.quantile(0.75) IQR = Q3 - Q1 lower = Q1 - threshold * IQR upper = Q3 + threshold * IQR return (series < lower) | (series > upper) elif method == 'zscore': z_scores = np.abs((series - series.mean()) / series.std()) return z_scores > threshold

use descriptive variable names and chain methods

data_clean = (pd

        # Load messy data — raw_data is a placeholder
        # Be sure to use the right reader for the filetype
        .read_csv('..data/raw/raw_data.csv')

        # DATA TYPE CORRECTIONS
        # Ensure proper types for analysis
        .assign(# Convert to numeric (handling errors)
                amount = lambda x: pd.to_numeric(x['amount'], errors='coerce'),
                
                # Convert to categorical (saves memory, enables ordering)
                status = lambda x: pd.to_Categorical(x['status'])) 
        
        .assign(
                # INCONSISTENT FORMATTING
                # Problem: Names in different formats
                # ie. "SMITH, JOHN" vs "John Smith" vs "smith john"
                name_clean = lambda x: standaridize_name(x['name']),
                
                # DATE INCONSISTENCIES
                # Problem: Dates in multiple formats
                # ie. "01/15/2024", "2024-01-15", "January 15, 2024", "15-Jan-24"
                parse_date = lambda x: parse_date(x['date']),
                
                # OUTLIERS
                # Identify potential data entry errors
                amount_outlier = lambda x: flag_outliers(x['amount']),
                
                )
        
        # Fuzzy duplicates (similar but not identical)
        # Use record linkage or manual review
        .pipe(find_similar_names, name_col='name_clean', threshold=85)

        # MISSING VALUES
        # Strategy depends on context
        # First check missing value patterns
        .pipe(handle_missing, thresh=None, per_thresh=None)

        # DUPLICATES — Find and handle duplicates
        .pipe(handle_duplicates, thresh=None)
        
        .reset_index(drop=True)
        .copy())

undefined

data_clean = (pd

        # Load messy data — raw_data is a placeholder
        # Be sure to use the right reader for the filetype
        .read_csv('..data/raw/raw_data.csv')

        # DATA TYPE CORRECTIONS
        # Ensure proper types for analysis
        .assign(# Convert to numeric (handling errors)
                amount = lambda x: pd.to_numeric(x['amount'], errors='coerce'),
                
                # Convert to categorical (saves memory, enables ordering)
                status = lambda x: pd.to_Categorical(x['status'])) 
        
        .assign(
                # INCONSISTENT FORMATTING
                # Problem: Names in different formats
                # ie. "SMITH, JOHN" vs "John Smith" vs "smith john"
                name_clean = lambda x: standaridize_name(x['name']),
                
                # DATE INCONSISTENCIES
                # Problem: Dates in multiple formats
                # ie. "01/15/2024", "2024-01-15", "January 15, 2024", "15-Jan-24"
                parse_date = lambda x: parse_date(x['date']),
                
                # OUTLIERS
                # Identify potential data entry errors
                amount_outlier = lambda x: flag_outliers(x['amount']),
                
                )
        
        # Fuzzy duplicates (similar but not identical)
        # Use record linkage or manual review
        .pipe(find_similar_names, name_col='name_clean', threshold=85)

        # MISSING VALUES
        # Strategy depends on context
        # First check missing value patterns
        .pipe(handle_missing, thresh=None, per_thresh=None)

        # DUPLICATES — Find and handle duplicates
        .pipe(handle_duplicates, thresh=None)
        
        .reset_index(drop=True)
        .copy())

undefined

Data validation checklist

数据验证清单

markdown

undefined

markdown

undefined

Pre-analysis data validation

分析前的数据验证

Structural checks

结构检查

Row count matches expected
Column count and names correct
Data types appropriate
No unexpected null columns

行数与预期一致
列数及列名正确
数据类型合适
无意外的全空列

Content checks

内容检查

Date ranges make sense
Numeric values within expected bounds
Categorical values match expected options
Geographic data resolves correctly
IDs are unique where expected

Consistency checks

一致性检查

Totals add up to expected values
Cross-tabulations balance
Related fields are consistent
Time series is continuous

总计与预期值相符
交叉制表平衡
关联字段一致
时间序列连续

Source verification

来源验证

Can trace back to original source
Methodology documented
Known limitations noted
Update frequency understood

undefined

可追溯至原始来源
研究方法已记录
已知局限性已标注
了解数据更新频率

undefined

Statistical analysis for journalism

新闻统计分析

Basic statistics with context

带背景的基础统计

python

undefined

python

undefined

Essential statistics for any dataset

def describe_for_journalism(df: pd.DataFrame, col: str) -> pd.DataFrame: """Generate journalist-friendly statistics.""" stats = df[col].describe(percentiles=[0.25, 0.5, 0.75, 0.9, 0.99])

# Add skewness to the describe() output
stats['skewness'] = df[col].skew()

return stats.to_frame(name=col)

def describe_for_journalism(df: pd.DataFrame, col: str) -> pd.DataFrame: """Generate journalist-friendly statistics.""" stats = df[col].describe(percentiles=[0.25, 0.5, 0.75, 0.9, 0.99])

# Add skewness to the describe() output
stats['skewness'] = df[col].skew()

return stats.to_frame(name=col)

Example interpretation

stats = describe_for_journalism(salaries, 'salary')

print(f""" ANALYSIS

We analyzed {stats['count']:,} salary records.

The median salary is ${stats['median']:,.0f}, meaning half of workers earn more and half earn less.

The average salary is ${stats['mean']:,.0f}, which is {'higher' if stats['mean'] > stats['median'] else 'lower'} than the median, indicating the distribution is {'right-skewed (pulled up by high earners)' if stats['skewness'] > 0 else 'left-skewed'}.

The top 10% of earners make at least ${stats['90th_percentile']:,.0f}. The top 1% make at least ${stats['99th_percentile']:,.0f}. """)

undefined

stats = describe_for_journalism(salaries, 'salary')

print(f""" ANALYSIS

We analyzed {stats['count']:,} salary records.

The median salary is ${stats['median']:,.0f}, meaning half of workers earn more and half earn less.

The average salary is ${stats['mean']:,.0f}, which is {'higher' if stats['mean'] > stats['median'] else 'lower'} than the median, indicating the distribution is {'right-skewed (pulled up by high earners)' if stats['skewness'] > 0 else 'left-skewed'}.

The top 10% of earners make at least ${stats['90th_percentile']:,.0f}. The top 1% make at least ${stats['99th_percentile']:,.0f}. """)

undefined

Comparisons and context

对比与背景

python

undefined

python

undefined

Calculate change metrics for a column

def calculate_change(df: pd.DataFrame, col: str, periods: int = 1) -> pd.DataFrame: """Add change metrics to a DataFrame using built-in pandas methods.

Args:
    df: Input DataFrame
    col: Column to calculate changes for
    periods: Number of rows to look back (1=previous row, 12=year-over-year for monthly)
"""
return df.assign(
    absolute_change=df[col].diff(periods),
    percent_change=df[col].pct_change(periods) * 100,
    direction=np.sign(df[col].diff(periods)).map({1: 'increased', -1: 'decreased', 0: 'unchanged'})
)

def calculate_change(df: pd.DataFrame, col: str, periods: int = 1) -> pd.DataFrame: """Add change metrics to a DataFrame using built-in pandas methods.

Args:
    df: Input DataFrame
    col: Column to calculate changes for
    periods: Number of rows to look back (1=previous row, 12=year-over-year for monthly)
"""
return df.assign(
    absolute_change=df[col].diff(periods),
    percent_change=df[col].pct_change(periods) * 100,
    direction=np.sign(df[col].diff(periods)).map({1: 'increased', -1: 'decreased', 0: 'unchanged'})
)

Usage:

changes = data_clean.pipe(calculate_change, 'revenue', periods=12) # Year-over-year for monthly data

Per capita calculations (essential for fair comparisons)

def per_capita(value: float, population: float, multiplier: int = 100000) -> float: """Calculate per capita rate.""" return (value / population) * multiplier # Per 100,000 is standard

Example: Crime rates

city_a = {'crimes': 5000, 'population': 100000} city_b = {'crimes': 8000, 'population': 500000}

rate_a = per_capita(city_a['crimes'], city_a['population']) rate_b = per_capita(city_b['crimes'], city_b['population'])

print(f"City A: {rate_a:.1f} crimes per 100,000 residents") print(f"City B: {rate_b:.1f} crimes per 100,000 residents")

city_a = {'crimes': 5000, 'population': 100000} city_b = {'crimes': 8000, 'population': 500000}

rate_a = per_capita(city_a['crimes'], city_a['population']) rate_b = per_capita(city_b['crimes'], city_b['population'])

print(f"City A: {rate_a:.1f} crimes per 100,000 residents") print(f"City B: {rate_b:.1f} crimes per 100,000 residents")

City A actually has higher crime rate despite fewer total crimes!

def adjust_for_inflation( amount: float | pd.Series, from_year: int | pd.Series, to_year: int, country: str = 'US' ) -> float | pd.Series: """Adjust dollar amounts for inflation. Works with scalars or Series for .assign().

Args:
    amount: Value(s) to adjust
    from_year: Original year(s) of the amount
    to_year: Target year to adjust to
    country: ISO 2-letter country code (default 'US'). US uses BLS data via cpi package,
             others use World Bank CPI data (FP.CPI.TOTL indicator)
"""
if country == 'US':
    # Use cpi package for US (more accurate, from BLS)
    if isinstance(from_year, pd.Series):
        return pd.Series([cpi.inflate(amt, yr, to=to_year) 
                        for amt, yr in zip(amount, from_year)], index=amount.index)
    return cpi.inflate(amount, from_year, to=to_year)
else:
    # Use World Bank data for other countries
    cpi_data = wbdata.get_dataframe(
        {'FP.CPI.TOTL': 'cpi'}, 
        country=country
    )['cpi'].to_dict()
    
    from_cpi = pd.Series(from_year).map(cpi_data) if isinstance(from_year, pd.Series) else cpi_data[from_year]
    to_cpi = cpi_data[to_year]
    return amount * (to_cpi / from_cpi)

def adjust_for_inflation( amount: float | pd.Series, from_year: int | pd.Series, to_year: int, country: str = 'US' ) -> float | pd.Series: """Adjust dollar amounts for inflation. Works with scalars or Series for .assign().

Args:
    amount: Value(s) to adjust
    from_year: Original year(s) of the amount
    to_year: Target year to adjust to
    country: ISO 2-letter country code (default 'US'). US uses BLS data via cpi package,
             others use World Bank CPI data (FP.CPI.TOTL indicator)
"""
if country == 'US':
    # Use cpi package for US (more accurate, from BLS)
    if isinstance(from_year, pd.Series):
        return pd.Series([cpi.inflate(amt, yr, to=to_year) 
                        for amt, yr in zip(amount, from_year)], index=amount.index)
    return cpi.inflate(amount, from_year, to=to_year)
else:
    # Use World Bank data for other countries
    cpi_data = wbdata.get_dataframe(
        {'FP.CPI.TOTL': 'cpi'}, 
        country=country
    )['cpi'].to_dict()
    
    from_cpi = pd.Series(from_year).map(cpi_data) if isinstance(from_year, pd.Series) else cpi_data[from_year]
    to_cpi = cpi_data[to_year]
    return amount * (to_cpi / from_cpi)

Usage:

adjust_for_inflation(100, 2020, 2024) # US by default

adjust_for_inflation(100, 2020, 2024, country='GB') # UK

df.assign(inf_adjust24=lambda x: adjust_for_inflation(x['amount'], x['year'], 2024, country='DE'))

Always adjust when comparing dollars across years!

undefined

undefined

Correlation vs causation

Reporting correlations responsibly

负责任地报道相关性

What you CAN say

可以表述的内容

"X and Y are correlated"
"As X increases, Y tends to increase"
"Areas with higher X also tend to have higher Y"
"X is associated with Y"

"X与Y相关"
"X增加时，Y往往随之增加"
"X值较高的地区Y值也往往较高"
"X与Y存在关联"

What you CANNOT say (without more evidence)

不可以表述的内容（无更多证据时）

"X causes Y"
"X leads to Y"
"Y happens because of X"

"X导致Y"
"X引发Y"
"Y因X而发生"

Questions to ask before implying causation

暗示因果关系前需问的问题

Is there a plausible mechanism?
Does the timing make sense (cause before effect)?
Is there a dose-response relationship?
Has the finding been replicated?
Have confounding variables been controlled?
Are there alternative explanations?

存在合理的作用机制吗？
时间逻辑合理吗（因先于果）？
存在剂量-反应关系吗？
研究结果是否可重复？
是否控制了混杂变量？
是否有其他解释？

Red flags for spurious correlations

虚假相关性的危险信号

Extremely high correlation (r > 0.95) with unrelated things
No logical connection between variables
Third variable could explain both
Small sample size with high variance

undefined

极高相关性（r > 0.95）但变量无关
变量间无逻辑关联
存在第三变量可同时解释两者
小样本且方差大

undefined

Data visualization

数据可视化

Chart selection guide

图表选择指南

markdown

undefined

markdown

undefined

Choosing the right chart

选择合适的图表

Comparison

对比类

Bar chart: Compare categories
Grouped bar: Compare categories across groups
Bullet chart: Actual vs target

柱状图：对比不同类别
分组柱状图：跨组对比类别
子弹图：实际值与目标值对比

Change over time

时间变化类

Line chart: Trends over time
Area chart: Cumulative totals over time
Slope chart: Change between two points

折线图：时间趋势
面积图：累计值随时间变化
斜率图：两点间的变化

Distribution

分布类

Histogram: Distribution of one variable
Box plot: Compare distributions across groups
Violin plot: Detailed distribution shape

直方图：单变量分布
箱线图：跨组对比分布
小提琴图：详细分布形态

Relationship

关系类

Scatter plot: Relationship between two variables
Bubble chart: Three variables (x, y, size)
Connected scatter: Change in relationship over time

散点图：双变量关系
气泡图：三变量（x、y、大小）
连接散点图：关系随时间的变化

Composition

构成类

Pie chart: Parts of a whole (almost never use, max 5 slices, prefer donut charts)
Donut chart: Parts of a whole
Stacked bar: Parts of whole across categories
Treemap: Hierarchical composition

饼图：整体的组成部分（尽量避免使用，最多5个切片，优先选用环形图）
环形图：整体的组成部分
堆叠柱状图：跨类别展示整体构成
树状图：层级构成

Geographic

地理类

Choropleth: Values by region (use normalized data!)
Dot map: Individual locations
Proportional symbol: Magnitude at locations

undefined

分级统计图：按区域展示数值（务必使用归一化数据！）
点地图：展示个体位置
比例符号图：按位置展示数值大小

undefined

Exploratory interactive visualizations with Plotly Express

用Plotly Express制作探索性交互式可视化

python

import plotly.express as px

python

import plotly.express as px

Set default template for all charts

px.defaults.template = 'simple_white'

def create_bar_chart( data: pd.DataFrame, title: str, source: str, desc: str = '', x_val: str, y_val: str, x_lab: str | None, y_lab: str | None ) -> px.bar: """Create a bar chart."""

fig = px.bar(
    data, 
    x=x_val, 
    y=y_val,
    text=desc,
    title=title,
    labels={'category': (x_lab if x_lab else x_val), 'value': (y_lab if y_lab else y_val)}
)

return fig

px.defaults.template = 'simple_white'

def create_bar_chart( data: pd.DataFrame, title: str, source: str, desc: str = '', x_val: str, y_val: str, x_lab: str | None, y_lab: str | None ) -> px.bar: """Create a bar chart."""

fig = px.bar(
    data, 
    x=x_val, 
    y=y_val,
    text=desc,
    title=title,
    labels={'category': (x_lab if x_lab else x_val), 'value': (y_lab if y_lab else y_val)}
)

return fig

Example

fig = create_bar_chart( data, title='Annual Widget Production', source='Department of Widgets, 2024', desc='The widget department increased its production dramatically starting in 2014.', x_val='year', y_val='widgets_prod', x_lab='Year', y_label='Units produced' )

fig.show() # Interactive display

undefined

fig = create_bar_chart( data, title='Annual Widget Production', source='Department of Widgets, 2024', desc='The widget department increased its production dramatically starting in 2014.', x_val='year', y_val='widgets_prod', x_lab='Year', y_label='Units produced' )

fig.show() # Interactive display

undefined

Publication-ready automated data visualizations with Datawrapper

用Datawrapper制作可发布的自动化数据可视化

python

import pandas as pd
import datawrapper as dw

python

import pandas as pd
import datawrapper as dw

Authentication: Set DATAWRAPPER_ACCESS_TOKEN environment variable,

or read from file and pass to create()

with open('datawrapper_api_key.txt', 'r') as f: api_key = f.read().strip()

read in your data

data = pd.read_csv('../data/raw/data.csv')

Create a bar chart using the new OOP API

chart = dw.BarChart( title='My Bar Chart Title', intro='Subtitle or description text', data=data,

# Formatting options
value_label_format=dw.NumberFormat.ONE_DECIMAL,
show_value_labels=True,
value_label_alignment='left',
sort_bars=True,  # sort by value
reverse_order=False,

# Source attribution
source_name='Your Data Source',
source_url='https://example.com',
byline='Your Name',

# Optional: custom base color
base_color='#1d81a2'

)

chart = dw.BarChart( title='My Bar Chart Title', intro='Subtitle or description text', data=data,

# Formatting options
value_label_format=dw.NumberFormat.ONE_DECIMAL,
show_value_labels=True,
value_label_alignment='left',
sort_bars=True,  # sort by value
reverse_order=False,

# Source attribution
source_name='Your Data Source',
source_url='https://example.com',
byline='Your Name',

# Optional: custom base color
base_color='#1d81a2'

)

Create and publish (uses DATAWRAPPER_ACCESS_TOKEN env var, or pass token)

chart.create(access_token=api_key) chart.publish()

Get chart URL and embed code

print(f"Chart ID: {chart.chart_id}") print(f"Chart URL: https://datawrapper.dwcdn.net/{chart.chart_id}") iframe_code = chart.get_iframe_code(responsive=True)

Update existing chart with new data (for live-updating charts)

existing_chart = dw.get_chart('YOUR_CHART_ID') # retrieve by ID existing_chart.data = new_df # assign new DataFrame existing_chart.title = 'Updated Title' # modify properties existing_chart.update() # push changes to Datawrapper existing_chart.publish() # republish to make live

Optional — Export chart as image

chart.export(filepath='chart.png', width=800, height=600)

#view chart chart

undefined

chart.export(filepath='chart.png', width=800, height=600)

#view chart chart

undefined

Avoiding misleading visualizations

避免误导性可视化

markdown

undefined

markdown

undefined

Chart integrity checklist

图表完整性清单

Axes

坐标轴

Y-axis starts at zero (for bar charts)
Axis labels are clear
Scale is appropriate (not truncated to exaggerate)
Both axes labeled with units

柱状图Y轴从0开始
坐标轴标签清晰
刻度合适（未截断以夸大差异）
双轴均标注单位

Data representation

数据呈现

All data points visible
Colors are distinguishable (including colorblind)
Proportions are accurate
3D effects not distorting perception

所有数据点可见
颜色可区分（包括色盲友好）
比例准确
无3D效果干扰感知

Context

背景信息

Title describes what's shown, not conclusion
Time period clearly stated
Source cited
Sample size/methodology noted if relevant
Uncertainty shown where appropriate

Honesty

真实性

Cherry-picking dates avoided
Outliers explained, not hidden
Dual axes justified (usually avoid)
Annotations don't mislead

undefined

避免选择性选取日期
异常值已解释而非隐藏
双轴使用合理（尽量避免）
注释无误导性

undefined

Working with geospatial data

地理空间数据处理

Geocoding data

地理编码

U.S. Census Geocoder

美国人口普查地理编码器

Best for: U.S. addresses only. Returns Census geography (tract, block, FIPS codes) along with coordinates—essential for joining with Census demographic data.

Pros: Completely free with no API key required. Returns Census geographies (state/county FIPS, tract, block) that let you join with ACS/decennial Census data. Good match rates for standard U.S. addresses.

Cons: Limited to 10,000 addresses per batch. U.S. addresses only. Slower than commercial alternatives. Lower match rates for non-standard addresses (PO boxes, rural routes, new construction).

Use when: You need to geocode nicely formatted U.S. addresses or you don't have budget for a paid service.

python

undefined

**最佳适用场景：**仅美国地址。返回普查地理信息（ tract、block、FIPS编码）及坐标——这是与人口普查人口统计数据关联的关键。

**优点：**完全免费，无需API密钥。返回可关联ACS/十年一次人口普查数据的普查地理信息（州/县FIPS、tract、block）。标准美国地址匹配率高。

**缺点：**批量处理上限为10,000条地址。仅支持美国地址。速度慢于商业服务。非标准地址（邮政信箱、乡村路线、新建建筑）匹配率较低。

**使用时机：**需处理格式规范的美国地址，或无预算使用付费服务时。

python

undefined

pip install censusbatchgeocoder

import censusbatchgeocoder import pandas as pd

DataFrame must have columns: id, address, city, state, zipcode

DataFrame必须包含列：id, address, city, state, zipcode

(state and zipcode are optional but improve match rates)

(state和zipcode可选，但可提升匹配率)

def census_geocode( df: pd.DataFrame, id_col: str = 'id', address_col: str = 'address', city_col: str = 'city', state_col: str = 'state', zipcode_col: str = 'zipcode', chunk_size: int = 9999 ) -> pd.DataFrame: """ Geocode a DataFrame using the U.S. Census batch geocoder. Automatically handles datasets larger than 10,000 rows by chunking.

Returns DataFrame with: latitude, longitude, state_fips, county_fips, 
tract, block, is_match, is_exact, returned_address, geocoded_address
"""
# Rename columns to expected format
col_map = {id_col: 'id', address_col: 'address', city_col: 'city'}
if state_col and state_col in df.columns:
    col_map[state_col] = 'state'
if zipcode_col and zipcode_col in df.columns:
    col_map[zipcode_col] = 'zipcode'

renamed_df = df.rename(columns=col_map)
records = renamed_df.to_dict('records')

# Small dataset: geocode directly
if len(records) <= chunk_size:
    results = censusbatchgeocoder.geocode(records)
    return pd.DataFrame(results)

# Large dataset: process in chunks to stay under 10,000 limit
all_results = []
for i in range(0, len(records), chunk_size):
    chunk = records[i:i + chunk_size]
    print(f"Geocoding rows {i:,} to {i + len(chunk):,} of {len(records):,}...")
    
    try:
        results = censusbatchgeocoder.geocode(chunk)
        all_results.extend(results)
    except Exception as e:
        print(f"Error on chunk starting at {i}: {e}")
        for record in chunk:
            all_results.append({**record, 'is_match': 'No_Match', 'latitude': None, 'longitude': None})

return pd.DataFrame(all_results)

def census_geocode( df: pd.DataFrame, id_col: str = 'id', address_col: str = 'address', city_col: str = 'city', state_col: str = 'state', zipcode_col: str = 'zipcode', chunk_size: int = 9999 ) -> pd.DataFrame: """ Geocode a DataFrame using the U.S. Census batch geocoder. Automatically handles datasets larger than 10,000 rows by chunking.

Returns DataFrame with: latitude, longitude, state_fips, county_fips, 
tract, block, is_match, is_exact, returned_address, geocoded_address
"""
# Rename columns to expected format
col_map = {id_col: 'id', address_col: 'address', city_col: 'city'}
if state_col and state_col in df.columns:
    col_map[state_col] = 'state'
if zipcode_col and zipcode_col in df.columns:
    col_map[zipcode_col] = 'zipcode'

renamed_df = df.rename(columns=col_map)
records = renamed_df.to_dict('records')

# Small dataset: geocode directly
if len(records) <= chunk_size:
    results = censusbatchgeocoder.geocode(records)
    return pd.DataFrame(results)

# Large dataset: process in chunks to stay under 10,000 limit
all_results = []
for i in range(0, len(records), chunk_size):
    chunk = records[i:i + chunk_size]
    print(f"Geocoding rows {i:,} to {i + len(chunk):,} of {len(records):,}...")
    
    try:
        results = censusbatchgeocoder.geocode(chunk)
        all_results.extend(results)
    except Exception as e:
        print(f"Error on chunk starting at {i}: {e}")
        for record in chunk:
            all_results.append({**record, 'is_match': 'No_Match', 'latitude': None, 'longitude': None})

return pd.DataFrame(all_results)

Usage:

geocoded = (pd .read_csv('../data/raw/addresses.csv') .assign(id=lambda x: x.index) .pipe(census_geocode, id_col='id', address_col='street', city_col='city'. state_col='state', zipcode_col='zip'))

undefined

geocoded = (pd .read_csv('../data/raw/addresses.csv') .assign(id=lambda x: x.index) .pipe(census_geocode, id_col='id', address_col='street', city_col='city', state_col='state', zipcode_col='zip'))

undefined

Google Maps Geocoder

谷歌地图地理编码器

Best for: International addresses, high match rates, and messy/non-standard address formats.

Pros: Excellent match rates even for poorly formatted addresses. Works worldwide. Fast and reliable. Returns rich metadata (place types, address components, place IDs).

Cons: Costs money ($5 per 1,000 requests after free tier). Requires API key and billing account. Does not return Census geography—you'd need to do a separate spatial join.

Use when: You need to geocode international addresses, have messy address data that the Census geocoder can't match, or need the highest possible match rate and have budget for it.

python

import googlemaps
from typing import Optional

def geocode_address_google(address: str, api_key: str) -> Optional[dict]:
    """
    Geocode address using Google Maps API.
    Requires API key with Geocoding API enabled.
    """
    gmaps = googlemaps.Client(key=api_key)
    result = gmaps.geocode(address)
    
    if result:
        location = result[0]['geometry']['location']
        return {
            'formatted_address': result[0]['formatted_address'],
            'lat': location['lat'],
            'lon': location['lng'],
            'place_id': result[0]['place_id']
        }
    return None

**最佳适用场景：**国际地址、高匹配率、杂乱/非标准地址格式。

**优点：**即使是格式混乱的地址，匹配率也极高。支持全球地址。快速可靠。返回丰富元数据（地点类型、地址组件、地点ID）。

**缺点：**收费（免费额度后每1,000次请求5美元）。需要API密钥和计费账户。不返回普查地理信息——需单独进行空间关联。

**使用时机：**需处理国际地址，或人口普查地理编码器无法匹配的杂乱地址，或预算充足且需最高匹配率时。

python

import googlemaps
from typing import Optional

def geocode_address_google(address: str, api_key: str) -> Optional[dict]:
    """
    Geocode address using Google Maps API.
    Requires API key with Geocoding API enabled.
    """
    gmaps = googlemaps.Client(key=api_key)
    result = gmaps.geocode(address)
    
    if result:
        location = result[0]['geometry']['location']
        return {
            'formatted_address': result[0]['formatted_address'],
            'lat': location['lat'],
            'lon': location['lng'],
            'place_id': result[0]['place_id']
        }
    return None

Batch geocode a DataFrame

def batch_geocode(df: pd.DataFrame, address_col: str, api_key: str) -> pd.DataFrame: gmaps = googlemaps.Client(key=api_key)

results = []
for address in df[address_col]:
    try:
        result = gmaps.geocode(address)
        if result:
            loc = result[0]['geometry']['location']
            results.append({'lat': loc['lat'], 'lon': loc['lng']})
        else:
            results.append({'lat': None, 'lon': None})
    except Exception:
        results.append({'lat': None, 'lon': None})

return pd.concat([df, pd.DataFrame(results)], axis=1)

undefined

def batch_geocode(df: pd.DataFrame, address_col: str, api_key: str) -> pd.DataFrame: gmaps = googlemaps.Client(key=api_key)

results = []
for address in df[address_col]:
    try:
        result = gmaps.geocode(address)
        if result:
            loc = result[0]['geometry']['location']
            results.append({'lat': loc['lat'], 'lon': loc['lng']})
        else:
            results.append({'lat': None, 'lon': None})
    except Exception:
        results.append({'lat': None, 'lon': None})

return pd.concat([df, pd.DataFrame(results)], axis=1)

undefined

Geopandas

GeoPandas使用

python

import geopandas as gpd
import pandas as pd
from shapely.geometry import Point

python

import geopandas as gpd
import pandas as pd
from shapely.geometry import Point

Read data from various formats

gdf = gpd.read_file('data.geojson') # GeoJSON gdf = gpd.read_file('data.shp') # Shapefile gdf = gpd.read_file('https://example.com/data.geojson') # From URL gdf = gpd.read_parquet('data.parquet') # GeoParquet (fast!)

Transform DataFrame with lat/lon to GeoDataFrame

df = pd.read_csv('locations.csv') geometry = [Point(xy) for xy in zip(df['longitude'], df['latitude'])] gdf = gpd.GeoDataFrame(df, geometry=geometry)

Set CRS (Coordinate Reference System)

EPSG:4326 = WGS84 (standard latitude, longitude)

gdf = gdf.set_crs('EPSG:4326')

Transform to different CRS (for area/distance calculations, use projected CRS)

gdf_projected = gdf.to_crs('EPSG:3857') # Web Mercator, for distance in meters

Basic spatial operations

#Find the area of a shape gdf['area'] = gdf_projected.geometry.area

#Find the center of a shape gdf['centroid'] = gdf.geometry.centroid

#Draw a 1km boundary around a point gdf['buffer_1km'] = gdf_projected.geometry.buffer(1000) #when set to CRS 3857

#Find the area of a shape gdf['area'] = gdf_projected.geometry.area

#Find the center of a shape gdf['centroid'] = gdf.geometry.centroid

#Draw a 1km boundary around a point gdf['buffer_1km'] = gdf_projected.geometry.buffer(1000) #when set to CRS 3857

Spatial join: find points within polygons

points = gpd.read_file('points.geojson') polygons = gpd.read_file('boundaries.geojson') joined = gpd.sjoin(points, polygons, predicate='within')

Dissolve: merge geometries by attribute

dissolved = gdf.dissolve(by='state', aggfunc='sum')

Export to various formats

gdf.to_parquet('output.parquet') # GeoParquet (recommended) gdf.to_file('output.geojson', driver='GeoJSON') #for tools that dont support GeoParquet

undefined

gdf.to_parquet('output.parquet') # GeoParquet (recommended) gdf.to_file('output.geojson', driver='GeoJSON') #for tools that dont support GeoParquet

undefined

Geo-Visualization with

.explore()

,

lonboard

and Datawrapper

用

.explore()

、

lonboard

和Datawrapper进行地理可视化

.explore()

.explore()

Best for: Quick exploration and prototyping during data analysis.

Pros: Built into GeoPandas—method is available on any GeoDataFrame. Great for exploratory data analysis—checking that your data looks right, exploring spatial patterns, and iterating quickly on map designs.

Cons: Becomes slow with large datasets (>100k features). Limited customization compared to dedicated mapping libraries. Requires extra dependencies to be installed.

Use when: You're in the middle of analysis and want to quickly visualize your GeoDataFrame without switching tools.

Required dependencies:

bash

pip install folium mapclassify matplotlib

```
folium
```
- Required for
```
.explore()
```
to work at all (renders the interactive map)
```
mapclassify
```
- Required when using
```
scheme=
```
parameter for classification (e.g., 'naturalbreaks', 'quantiles', 'equalinterval')
```
matplotlib
```
- Required for colormap (
```
cmap=
```
) support

python

import geopandas as gpd

**最佳适用场景：**数据分析过程中的快速探索和原型制作。

**优点：**内置在GeoPandas中——任何GeoDataFrame都可调用该方法。非常适合探索性数据分析——检查数据是否正确、探索空间模式、快速迭代地图设计。

**缺点：**大数据集（>10万要素）时速度变慢。与专用地图库相比，自定义选项有限。需要安装额外依赖。

**使用时机：**分析过程中，无需切换工具即可快速可视化GeoDataFrame时。

所需依赖：

bash

pip install folium mapclassify matplotlib

```
folium
```
-
```
.explore()
```
运行的核心依赖（渲染交互式地图）
```
mapclassify
```
- 使用
```
scheme=
```
参数进行分类时所需（如'naturalbreaks'、'quantiles'、'equalinterval'）
```
matplotlib
```
- 颜色映射（
```
cmap=
```
）支持所需

python

import geopandas as gpd

folium, mapclassify, and matplotlib must be installed but don't need to be imported

geopandas imports them automatically when you call .explore()

Basic interactive map (uses folium under the hood)

gdf.explore()

Choropleth map with customization

(requires mapclassify for scheme parameter)

gdf.explore( column='population', # Column for color scale cmap='YlOrRd', # Matplotlib colormap scheme='naturalbreaks', # Classification scheme (needs mapclassify) k=5, # Number of bins legend=True, tooltip=['name', 'population'], # Columns to show on hover popup=True, # Show all columns on click tiles='CartoDB positron', # Background tiles style_kwds={'color': 'black', 'weight': 0.5} # Border style )

undefined

gdf.explore( column='population', # Column for color scale cmap='YlOrRd', # Matplotlib colormap scheme='naturalbreaks', # Classification scheme (needs mapclassify) k=5, # Number of bins legend=True, tooltip=['name', 'population'], # Columns to show on hover popup=True, # Show all columns on click tiles='CartoDB positron', # Background tiles style_kwds={'color': 'black', 'weight': 0.5} # Border style )

undefined

lonboard

lonboard

Best for: Large datasets and high-performance visualization in Jupyter notebooks.

Pros: GPU-accelerated rendering via deck.gl can handle millions of points smoothly. Excellent interactivity—pan, zoom, and hover work fluidly even with massive datasets. Native support for GeoArrow format for efficient data transfer.

Cons: Requires separate installation (

pip install lonboard

). Styling options are more technical (RGBA arrays, deck.gl conventions).

Use when: You have large point datasets (crime incidents, sensor readings, business locations) or need smooth interactivity with 100k+ features.

python

import geopandas as gpd
from lonboard import viz, Map, ScatterplotLayer, PolygonLayer

**最佳适用场景：**Jupyter笔记本中的大数据集和高性能可视化。

**优点：**通过deck.gl实现GPU加速渲染，可流畅处理数百万个点。交互性极佳——即使是超大数据集，平移、缩放、悬停操作也流畅。原生支持GeoArrow格式，实现高效数据传输。

**缺点：**需单独安装（

pip install lonboard

）。样式选项更偏向技术化（RGBA数组、deck.gl约定）。

**使用时机：**处理大型点数据集（犯罪事件、传感器读数、商业地点），或需10万+要素的流畅交互时。

python

import geopandas as gpd
from lonboard import viz, Map, ScatterplotLayer, PolygonLayer

Quick visualization (auto-detects geometry type)

viz(gdf)

Custom ScatterplotLayer for points

layer = ScatterplotLayer.from_geopandas( gdf, get_radius=100, get_fill_color=[255, 0, 0, 200], # RGBA pickable=True ) m = Map(layer) m

PolygonLayer with color based on column

from lonboard.colormap import apply_continuous_cmap import matplotlib.pyplot as plt

colors = apply_continuous_cmap(gdf['value'], plt.cm.viridis) layer = PolygonLayer.from_geopandas( gdf, get_fill_color=colors, get_line_color=[0, 0, 0, 100], pickable=True ) Map(layer)

undefined

from lonboard.colormap import apply_continuous_cmap import matplotlib.pyplot as plt

colors = apply_continuous_cmap(gdf['value'], plt.cm.viridis) layer = PolygonLayer.from_geopandas( gdf, get_fill_color=colors, get_line_color=[0, 0, 0, 100], pickable=True ) Map(layer)

undefined

Datawrapper

Best for: Publication-ready choropleth and proportional symbol maps for articles and reports.

Pros: Beautiful, professional defaults out of the box. Generates embeddable, responsive iframes that work in any CMS. Readers can interact (hover, click) without running any code. Accessible and mobile-friendly. Easy to update data programmatically for updating data.

Cons: Requires a Datawrapper account (free tier available). Limited to Datawrapper's supported boundary files—you can't bring arbitrary geometries. Less flexibility for custom visualizations.

Use when: You need a polished map for publication. Ideal for choropleth maps showing statistics by region (unemployment by state, COVID cases by county, election results by district). Your audience will view the map in a browser, not a notebook.

Unlike

.explore()

or

lonboard

, you don't pass raw geometry—instead you match your data to Datawrapper's built-in boundary files using standard codes (FIPS, ISO, etc.).

python

import datawrapper as dw
import pandas as pd

**最佳适用场景：**为文章和报告制作可发布的分级统计图和比例符号图。

**优点：**默认样式美观专业。生成可嵌入的响应式iframe，适用于任何CMS。读者无需运行代码即可交互（悬停、点击）。无障碍且适配移动端。可通过代码更新数据，实现动态更新。

**缺点：**需要Datawrapper账户（提供免费版）。受限于Datawrapper支持的边界文件——无法导入任意几何图形。自定义可视化的灵活性较低。

**使用时机：**需要制作可发布的精美地图时。非常适合展示区域统计数据的分级统计图（各州失业率、各县新冠病例、各选区选举结果）。受众将在浏览器中查看地图，而非笔记本。

与

.explore()

或

lonboard

不同，无需传入原始几何图形——而是使用标准编码（FIPS、ISO等）将数据与Datawrapper内置的边界文件匹配。

python

import datawrapper as dw
import pandas as pd

Read API key

with open('datawrapper_api_key.txt', 'r') as f: api_key = f.read().strip()

Prepare data with location codes that match Datawrapper's boundaries

For US states: use 2-letter abbreviations or FIPS codes

For countries: use ISO 3166-1 alpha-2 codes

df = pd.DataFrame({ 'state': ['AL', 'AK', 'AZ', 'AR', 'CA'], # State abbreviations 'unemployment_rate': [4.9, 3.2, 7.1, 4.2, 5.8] })

Create a choropleth map

chart = dw.ChoroplethMap( title='Unemployment Rate by State', intro='Percentage of labor force unemployed, 2024', data=df,

# Map configuration
basemap='us-states',           # Built-in US states boundaries
basemap_key='state',           # Column in your data with location codes
value_column='unemployment_rate',

# Styling
color_palette='YlOrRd',        # Color scheme
legend_title='Unemployment %',

# Attribution
source_name='Bureau of Labor Statistics',
source_url='https://www.bls.gov/',
byline='Your Name'

)

chart = dw.ChoroplethMap( title='Unemployment Rate by State', intro='Percentage of labor force unemployed, 2024', data=df,

# Map configuration
basemap='us-states',           # Built-in US states boundaries
basemap_key='state',           # Column in your data with location codes
value_column='unemployment_rate',

# Styling
color_palette='YlOrRd',        # Color scheme
legend_title='Unemployment %',

# Attribution
source_name='Bureau of Labor Statistics',
source_url='https://www.bls.gov/',
byline='Your Name'

)

Create and publish

chart.create(access_token=api_key) chart.publish()

Get embed code for your article

iframe = chart.get_iframe_code(responsive=True) print(f"Chart URL: https://datawrapper.dwcdn.net/{chart.chart_id}")

Update with new data (for live-updating maps)

new_df = pd.DataFrame({...}) # Updated data existing_chart = dw.get_chart('YOUR_CHART_ID') existing_chart.data = new_df existing_chart.update() existing_chart.publish()


**Available Datawrapper basemaps include:**
- `us-states`, `us-counties`, `us-congressional-districts`
- `world`, `europe`, `africa`, `asia`
- Country-specific maps (e.g., `germany-states`, `uk-constituencies`)

new_df = pd.DataFrame({...}) # Updated data existing_chart = dw.get_chart('YOUR_CHART_ID') existing_chart.data = new_df existing_chart.title = 'Updated Title' # modify properties existing_chart.update() existing_chart.publish()


**Datawrapper可用底图包括：**
- `us-states`, `us-counties`, `us-congressional-districts`
- `world`, `europe`, `africa`, `asia`
- 国家特定地图（如`germany-states`, `uk-constituencies`）

Learning resources

学习资源

NICAR (Investigative Reporters & Editors)
Knight Center for Journalism in the Americas
Data Journalism Handbook (datajournalism.com)
Flowing Data (flowingdata.com)
The Pudding (pudding.cool) - examples
Sigma Awards (https://www.sigmaawards.org/) - examples

NICAR（调查记者与编辑协会）
美洲新闻骑士中心
《数据新闻手册》（datajournalism.com）
Flowing Data（flowingdata.com）
The Pudding（pudding.cool） - 案例参考
Sigma Awards（https://www.sigmaawards.org/） - 案例参考

data-journalism

Original

Translation

Data journalism methodology

数据新闻方法论

Story structure for data journalism

数据新闻的报道结构

Data journalism framework

数据新闻框架

The data story arc

数据新闻叙事弧线

1. The hook (nut graf)

1. 钩子（核心段落）

2. The evidence

2. 证据

3. The context

3. 背景

4. The human element

4. 人文元素

5. The implications

5. 启示

6. The methodology box

6. 方法说明框

Methodology documentation template

方法文档模板

How we did this analysis

我们的分析流程

Data sources

数据来源

Time period

时间范围

Definitions

定义

Analysis steps

分析步骤

Limitations

局限性

What we excluded and why

排除内容及原因

Verification

验证方式

Code and data availability

代码与数据获取

Contact

联系方式

Data acquisition

数据获取

Public data sources

公共数据来源

Federal data sources

联邦数据来源

General

通用类

Specific domains

特定领域

State and local

州及地方数据

Data request strategies

数据请求策略

Getting data that isn't public

获取非公开数据

Public records request (ie. FOIA) for datasets

公共记录请求（如FOIA）获取数据集

Building your own dataset

自建数据集

Commercial data sources (for newsrooms)

新闻室商用数据来源

Data cleaning and preparation

数据清洗与预处理

Common data problems

常见数据问题

Inflation adjustment

Inflation adjustment

use descriptive variable names and chain methods

use descriptive variable names and chain methods

Data validation checklist

数据验证清单

Pre-analysis data validation

分析前的数据验证

Structural checks