financial-data-collector

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Financial Data Collector

财务数据收集器

Collect and validate real financial data for US public companies using free data sources. Output is a standardized JSON file ready for consumption by other financial skills.
使用免费数据源收集并验证美国上市公司的真实财务数据。 输出为标准化JSON文件,可直接供其他财务技能使用。

Critical Constraints

关键约束

NO FALLBACK values. If a field cannot be retrieved, set it to
null
with
_source: "missing"
. Never substitute defaults (e.g.,
beta or 1.0
). The downstream skill decides how to handle missing data.
Data source attribution is mandatory. Every data section must have a
_source
field.
CapEx sign convention: yfinance returns CapEx as negative (cash outflow). Preserve the original sign. Document the convention in output metadata. Do NOT flip signs.
yfinance FCF ≠ Investment bank FCF. yfinance FCF = Operating CF + CapEx (no SBC deduction). Flag this in output metadata so downstream DCF skills don't overstate FCF.
禁止使用默认替代值。如果无法获取某个字段,将其设置为
null
并标注
_source: "missing"
。 绝不允许使用默认值替代(例如
beta or 1.0
)。下游技能会自行决定如何处理缺失数据。
必须标注数据源。每个数据部分都必须包含
_source
字段。
CapEx符号约定:yfinance返回的CapEx为负值(代表现金流出)。请保留原始符号,并在输出元数据中说明该约定。不得翻转符号
yfinance FCF ≠ 投行FCF:yfinance计算的FCF = 经营现金流 + CapEx(未扣除股权激励费用SBC)。请在输出元数据中标记这一点,避免下游DCF技能高估FCF。

Workflow

工作流程

Step 1: Collect Data

步骤1:收集数据

Run the collection script:
bash
python scripts/collect_data.py TICKER [--years 5] [--output path/to/output.json]
The script collects in this priority:
  1. yfinance — market data, historical financials, beta, analyst estimates
  2. yfinance ^TNX — 10Y Treasury yield as risk-free rate proxy
  3. User supplement — for years where yfinance returns NaN (report to user, do not guess)
运行收集脚本:
bash
python scripts/collect_data.py TICKER [--years 5] [--output path/to/output.json]
脚本按以下优先级收集数据:
  1. yfinance — 市场数据、历史财务数据、贝塔系数、分析师预期
  2. yfinance ^TNX — 10年期美国国债收益率,作为无风险利率的替代指标
  3. 用户补充 — 对于yfinance返回NaN的年份(需告知用户,不得自行猜测)

Step 2: Validate Data

步骤2:验证数据

bash
python scripts/validate_data.py path/to/output.json
Checks: field completeness, cross-field consistency (Market Cap = Price × Shares), range sanity (WACC 5-20%, beta 0.3-3.0), sign conventions.
bash
python scripts/validate_data.py path/to/output.json
检查内容:字段完整性、跨字段一致性(市值 = 股价 × 股份数)、数值合理性(WACC 5-20%,贝塔系数0.3-3.0)、符号约定。

Step 3: Deliver JSON

步骤3:交付JSON

Single file:
{TICKER}_financial_data.json
. Schema in
references/output-schema.md
.
Do NOT create: README, CSV, summary reports, or any auxiliary files.
输出单个文件:
{TICKER}_financial_data.json
。完整 schema 请查看
references/output-schema.md
禁止生成:README、CSV、摘要报告或任何辅助文件。

Output Schema (Summary)

输出Schema(摘要)

json
{
  "ticker": "META",
  "company_name": "Meta Platforms, Inc.",
  "data_date": "2026-03-02",
  "currency": "USD",
  "unit": "millions_usd",
  "data_sources": { "market_data": "...", "2022_to_2024": "..." },
  "market_data": { "current_price": 648.18, "shares_outstanding_millions": 2187, "market_cap_millions": 1639607, "beta_5y_monthly": 1.284 },
  "income_statement": { "2024": { "revenue": 164501, "ebit": 69380, "tax_expense": ..., "net_income": ..., "_source": "yfinance" } },
  "cash_flow": { "2024": { "operating_cash_flow": ..., "capex": -37256, "depreciation_amortization": 15498, "free_cash_flow": ..., "change_in_nwc": ..., "_source": "yfinance" } },
  "balance_sheet": { "2024": { "total_debt": 30768, "cash_and_equivalents": 77815, "net_debt": -47047, "current_assets": ..., "current_liabilities": ..., "_source": "yfinance" } },
  "wacc_inputs": { "risk_free_rate": 0.0396, "beta": 1.284, "credit_rating": null, "_source": "yfinance + ^TNX" },
  "analyst_estimates": { "revenue_next_fy": 251113, "revenue_fy_after": 295558, "eps_next_fy": 29.59, "_source": "yfinance" },
  "metadata": { "_capex_convention": "negative = cash outflow", "_fcf_note": "yfinance FCF = OperatingCF + CapEx. Does NOT deduct SBC." }
}
Full schema with all field definitions:
references/output-schema.md
<correct_patterns>
json
{
  "ticker": "META",
  "company_name": "Meta Platforms, Inc.",
  "data_date": "2026-03-02",
  "currency": "USD",
  "unit": "millions_usd",
  "data_sources": { "market_data": "...", "2022_to_2024": "..." },
  "market_data": { "current_price": 648.18, "shares_outstanding_millions": 2187, "market_cap_millions": 1639607, "beta_5y_monthly": 1.284 },
  "income_statement": { "2024": { "revenue": 164501, "ebit": 69380, "tax_expense": ..., "net_income": ..., "_source": "yfinance" } },
  "cash_flow": { "2024": { "operating_cash_flow": ..., "capex": -37256, "depreciation_amortization": 15498, "free_cash_flow": ..., "change_in_nwc": ..., "_source": "yfinance" } },
  "balance_sheet": { "2024": { "total_debt": 30768, "cash_and_equivalents": 77815, "net_debt": -47047, "current_assets": ..., "current_liabilities": ..., "_source": "yfinance" } },
  "wacc_inputs": { "risk_free_rate": 0.0396, "beta": 1.284, "credit_rating": null, "_source": "yfinance + ^TNX" },
  "analyst_estimates": { "revenue_next_fy": 251113, "revenue_fy_after": 295558, "eps_next_fy": 29.59, "_source": "yfinance" },
  "metadata": { "_capex_convention": "negative = cash outflow", "_fcf_note": "yfinance FCF = OperatingCF + CapEx. Does NOT deduct SBC." }
}
包含所有字段定义的完整schema:
references/output-schema.md
<correct_patterns>

Handling Missing Years

缺失年份的处理

python
if pd.isna(revenue):
    result[year] = {"revenue": None, "_source": "yfinance returned NaN — supplement from 10-K"}
python
if pd.isna(revenue):
    result[year] = {"revenue": None, "_source": "yfinance returned NaN — supplement from 10-K"}

Report missing years to the user. Do NOT skip or fill with estimates.

Report missing years to the user. Do NOT skip or fill with estimates.

undefined
undefined

CapEx Sign Preservation

CapEx符号保留

python
capex = cash_flow.loc["Capital Expenditure", year_col]  # -37256.0
result["capex"] = float(capex)  # Preserve negative
python
capex = cash_flow.loc["Capital Expenditure", year_col]  # -37256.0
result["capex"] = float(capex)  # Preserve negative

Datetime Column Indexing

日期列索引

python
year_col = [c for c in financials.columns if c.year == target_year][0]
revenue = financials.loc["Total Revenue", year_col]
python
year_col = [c for c in financials.columns if c.year == target_year][0]
revenue = financials.loc["Total Revenue", year_col]

Field Name Guards

字段名称兼容处理

python
if "Total Revenue" in financials.index:
    revenue = financials.loc["Total Revenue", year_col]
elif "Revenue" in financials.index:
    revenue = financials.loc["Revenue", year_col]
else:
    revenue = None
</correct_patterns>
<common_mistakes>
python
if "Total Revenue" in financials.index:
    revenue = financials.loc["Total Revenue", year_col]
elif "Revenue" in financials.index:
    revenue = financials.loc["Revenue", year_col]
else:
    revenue = None
</correct_patterns>
<common_mistakes>

Mistake 1: Default Values for Missing Data

错误1:为缺失数据设置默认值

python
undefined
python
undefined

❌ WRONG

❌ WRONG

beta = info.get("beta", 1.0) growth = data.get("growth") or 0.02
beta = info.get("beta", 1.0) growth = data.get("growth") or 0.02

✅ RIGHT

✅ RIGHT

beta = info.get("beta") # May be None — that's OK
undefined
beta = info.get("beta") # May be None — that's OK
undefined

Mistake 2: Assuming All Years Have Data

错误2:假设所有年份都有数据

python
undefined
python
undefined

❌ WRONG — 2020-2021 may be NaN

❌ WRONG — 2020-2021 may be NaN

revenue = float(financials.loc["Total Revenue", year_col])
revenue = float(financials.loc["Total Revenue", year_col])

✅ RIGHT

✅ RIGHT

value = financials.loc["Total Revenue", year_col] revenue = float(value) if pd.notna(value) else None
undefined
value = financials.loc["Total Revenue", year_col] revenue = float(value) if pd.notna(value) else None
undefined

Mistake 3: Using yfinance FCF in DCF Models Directly

错误3:直接使用yfinance计算的FCF进行DCF建模

yfinance FCF does NOT deduct SBC. For mega-caps like META, SBC can be $20-30B/yr, making yfinance FCF ~30% higher than investment-bank FCF. Always flag this in output.
yfinance计算的FCF未扣除股权激励费用(SBC)。对于META这类大型科技公司,SBC每年可达200-300亿美元,导致yfinance计算的FCF比投行使用的FCF高出约30%。务必在输出中标记这一点。

Mistake 4: Flipping CapEx Sign

错误4:翻转CapEx符号

python
undefined
python
undefined

❌ WRONG — double-negation risk downstream

❌ WRONG — double-negation risk downstream

capex = abs(cash_flow.loc["Capital Expenditure", year_col])
capex = abs(cash_flow.loc["Capital Expenditure", year_col])

✅ RIGHT — preserve original, document convention

✅ RIGHT — preserve original, document convention

capex = float(cash_flow.loc["Capital Expenditure", year_col]) # -37256.0

</common_mistakes>
capex = float(cash_flow.loc["Capital Expenditure", year_col]) # -37256.0

</common_mistakes>

Known yfinance Pitfalls

已知yfinance问题

See
references/yfinance-pitfalls.md
for detailed field mapping and workarounds.
详细的字段映射和解决方案请查看
references/yfinance-pitfalls.md