llm-public-opinion-analytics
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLLM-Based Public Opinion Analytics Assistant
基于LLM的舆情分析助手
Overview
概述
This project is an intelligent public opinion analysis assistant that integrates real-time data from 15 mainstream platforms across 26 ranking lists with large language model (LLM) analysis capabilities. It provides conversational hot search queries, topic-specific searches, topic clustering, and sentiment analysis. The system supports:
- Real-time web scraping from platforms like Weibo, Bilibili, Douyin, Baidu, etc.
- LLM-powered content analysis (including video content extraction)
- Multi-channel push notifications (WeChat, Enterprise WeChat, Telegram, Email)
- Keyboard shortcuts for crawler control
- Quick data lookup and platform jumping
本项目是一款智能舆情分析助手,整合了来自15个主流平台共26个榜单的实时数据与大语言模型(LLM)分析能力。它支持对话式热搜查询、特定主题搜索、主题聚类以及情感分析。该系统具备以下功能:
- 实时爬取微博(Weibo)、哔哩哔哩(Bilibili)、抖音(Douyin)、百度(Baidu)等平台的数据
- 基于LLM的内容分析(包括视频内容提取)
- 多渠道推送通知(微信、企业微信、Telegram、邮件)
- 爬虫控制快捷键
- 快速数据查询与平台跳转
Installation
安装步骤
Prerequisites
前置条件
- Python Environment: Python 3.8+
- MySQL Database: MySQL 5.7+ or 8.0+
- Browser Driver: ChromeDriver or EdgeDriver
- Python环境:Python 3.8+
- MySQL数据库:MySQL 5.7+ 或 8.0+
- 浏览器驱动:ChromeDriver 或 EdgeDriver
Step 1: Browser Driver Setup
步骤1:浏览器驱动配置
Download the driver matching your browser version:
- Chrome: ChromeDriver Downloads
- Edge: EdgeDriver Downloads
Add the driver to your system PATH:
bash
undefinedmacOS/Linux
macOS/Linux
export PATH=$PATH:/path/to/driver/directory
export PATH=$PATH:/path/to/driver/directory
Windows: Add to System Environment Variables
Windows: 添加至系统环境变量
Verify installation:
```bash
chromedriver --version
验证安装:
```bash
chromedriver --versionor
或
msedgedriver --version
undefinedmsedgedriver --version
undefinedStep 2: Clone and Install Dependencies
步骤2:克隆项目并安装依赖
bash
git clone https://github.com/hmmnxkl/LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant.git
cd LLM-Based-Intelligent-Public-Opinion-Analytics-Assistantbash
git clone https://github.com/hmmnxkl/LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant.git
cd LLM-Based-Intelligent-Public-Opinion-Analytics-AssistantCreate virtual environment
创建虚拟环境
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
python -m venv venv
source venv/bin/activate # Windows系统执行:venv\Scripts\activate
Install dependencies
安装依赖
pip install -r requirements.txt
undefinedpip install -r requirements.txt
undefinedStep 3: Database Setup
步骤3:数据库配置
Create MySQL database and tables:
python
undefined创建MySQL数据库及表:
python
undefinedReference init.py for schema
参考init.py中的数据库结构
import mysql.connector
conn = mysql.connector.connect(
host=os.getenv('MYSQL_HOST', 'localhost'),
user=os.getenv('MYSQL_USER'),
password=os.getenv('MYSQL_PASSWORD')
)
cursor = conn.cursor()
cursor.execute("CREATE DATABASE IF NOT EXISTS hotsearch_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci")
cursor.execute("USE hotsearch_db")
import mysql.connector
conn = mysql.connector.connect(
host=os.getenv('MYSQL_HOST', 'localhost'),
user=os.getenv('MYSQL_USER'),
password=os.getenv('MYSQL_PASSWORD')
)
cursor = conn.cursor()
cursor.execute("CREATE DATABASE IF NOT EXISTS hotsearch_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci")
cursor.execute("USE hotsearch_db")
Create tables (see init.py for full schema)
创建表(完整结构请查看init.py)
cursor.execute("""
CREATE TABLE IF NOT EXISTS hot_search_items (
id INT AUTO_INCREMENT PRIMARY KEY,
platform VARCHAR(50),
title VARCHAR(500),
url TEXT,
rank_index INT,
heat_value VARCHAR(100),
collected_at DATETIME,
content TEXT,
sentiment VARCHAR(20),
INDEX idx_platform (platform),
INDEX idx_collected (collected_at)
)
""")
conn.commit()
undefinedcursor.execute("""
CREATE TABLE IF NOT EXISTS hot_search_items (
id INT AUTO_INCREMENT PRIMARY KEY,
platform VARCHAR(50),
title VARCHAR(500),
url TEXT,
rank_index INT,
heat_value VARCHAR(100),
collected_at DATETIME,
content TEXT,
sentiment VARCHAR(20),
INDEX idx_platform (platform),
INDEX idx_collected (collected_at)
)
""")
conn.commit()
undefinedStep 4: Environment Configuration
步骤4:环境配置
Create file in project root:
.envbash
undefined在项目根目录创建文件:
.envbash
undefinedMySQL Configuration
MySQL配置
MYSQL_HOST=localhost
MYSQL_PORT=3306
MYSQL_USER=your_mysql_user
MYSQL_PASSWORD=your_mysql_password
MYSQL_DATABASE=hotsearch_db
MYSQL_HOST=localhost
MYSQL_PORT=3306
MYSQL_USER=your_mysql_user
MYSQL_PASSWORD=your_mysql_password
MYSQL_DATABASE=hotsearch_db
LLM Configuration (OpenAI-compatible API)
LLM配置(兼容OpenAI的API)
OPENAI_API_KEY=your_api_key
OPENAI_API_BASE=https://api.openai.com/v1
MODEL_NAME=gpt-4
OPENAI_API_KEY=your_api_key
OPENAI_API_BASE=https://api.openai.com/v1
MODEL_NAME=gpt-4
Or use Huawei Pangu Model (local deployment)
或使用华为盘古大模型(本地部署)
PANGU_MODEL_PATH=/path/to/pangu/model
PANGU_MODEL_PATH=/path/to/pangu/model
PANGU_API_URL=http://localhost:8080
PANGU_API_URL=http://localhost:8080
Push Notification Channels
推送通知渠道
WeChat Work Bot
企业微信机器人
WECHAT_WORK_BOT_WEBHOOK=your_webhook_url
WECHAT_WORK_BOT_WEBHOOK=your_webhook_url
WeChat Work App
企业微信应用
WECHAT_WORK_CORP_ID=your_corp_id
WECHAT_WORK_AGENT_ID=your_agent_id
WECHAT_WORK_SECRET=your_secret
WECHAT_WORK_CORP_ID=your_corp_id
WECHAT_WORK_AGENT_ID=your_agent_id
WECHAT_WORK_SECRET=your_secret
Telegram
Telegram
TELEGRAM_BOT_TOKEN=your_bot_token
TELEGRAM_CHAT_ID=your_chat_id
TELEGRAM_BOT_TOKEN=your_bot_token
TELEGRAM_CHAT_ID=your_chat_id
Email (SMTP)
邮件(SMTP)
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_USER=your_email@gmail.com
SMTP_PASSWORD=your_app_password
SMTP_RECIPIENTS=recipient1@example.com,recipient2@example.com
undefinedSMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_USER=your_email@gmail.com
SMTP_PASSWORD=your_app_password
SMTP_RECIPIENTS=recipient1@example.com,recipient2@example.com
undefinedCore Components
核心组件
1. Web Scraping System (hotsearchcrawler/
)
hotsearchcrawler/1. 网页爬取系统(hotsearchcrawler/
)
hotsearchcrawler/The crawler cluster supports 15 platforms with 26 ranking lists:
python
undefined爬虫集群支持15个平台的26个榜单:
python
undefinedRun all spiders
运行所有爬虫
python run_spiders.py
python run_spiders.py
Test specific spider
测试特定爬虫
python runspider-test.py weibo # Test Weibo scraper
undefinedpython runspider-test.py weibo # 测试微博爬虫
undefinedCrawler Configuration
爬虫配置
Edit :
hotsearchcrawler/settings.pypython
undefined编辑:
hotsearchcrawler/settings.pypython
undefinedMySQL settings
MySQL设置
MYSQL_HOST = os.getenv('MYSQL_HOST', 'localhost')
MYSQL_PORT = int(os.getenv('MYSQL_PORT', 3306))
MYSQL_USER = os.getenv('MYSQL_USER')
MYSQL_PASSWORD = os.getenv('MYSQL_PASSWORD')
MYSQL_DATABASE = os.getenv('MYSQL_DATABASE', 'hotsearch_db')
MYSQL_HOST = os.getenv('MYSQL_HOST', 'localhost')
MYSQL_PORT = int(os.getenv('MYSQL_PORT', 3306))
MYSQL_USER = os.getenv('MYSQL_USER')
MYSQL_PASSWORD = os.getenv('MYSQL_PASSWORD')
MYSQL_DATABASE = os.getenv('MYSQL_DATABASE', 'hotsearch_db')
Optional: Platform-specific cookies
可选:平台专属Cookie
COOKIES = {
'weibo': 'your_weibo_cookies',
'bilibili': 'your_bilibili_cookies'
}
COOKIES = {
'weibo': 'your_weibo_cookies',
'bilibili': 'your_bilibili_cookies'
}
Crawler settings
爬虫设置
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 1
RANDOMIZE_DOWNLOAD_DELAY = True
undefinedCONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 1
RANDOMIZE_DOWNLOAD_DELAY = True
undefinedAvailable Platforms
支持的平台
- Social Media: Weibo, Douyin, Kuaishou
- Video: Bilibili, Tencent Video
- News: Baidu, Toutiao, Zhihu
- E-commerce: Taobao, JD.com
- Gaming: Steam, Tap Tap
- Others: Tieba, Douban, etc.
- 社交媒体:微博(Weibo)、抖音(Douyin)、快手(Kuaishou)
- 视频平台:哔哩哔哩(Bilibili)、腾讯视频(Tencent Video)
- 资讯平台:百度(Baidu)、头条(Toutiao)、知乎(Zhihu)
- 电商平台:淘宝(Taobao)、京东(JD.com)
- 游戏平台:Steam、Tap Tap
- 其他:贴吧(Tieba)、豆瓣(Douban)等
2. Analysis System (hotsearch_analysis_agent/
)
hotsearch_analysis_agent/2. 分析系统(hotsearch_analysis_agent/
)
hotsearch_analysis_agent/LLM-powered analysis engine for topic clustering, sentiment analysis, and report generation.
python
from hotsearch_analysis_agent.analyzer import HotSearchAnalyzer基于LLM的分析引擎,支持主题聚类、情感分析及报告生成。
python
from hotsearch_analysis_agent.analyzer import HotSearchAnalyzerInitialize analyzer
初始化分析器
analyzer = HotSearchAnalyzer(
api_key=os.getenv('OPENAI_API_KEY'),
api_base=os.getenv('OPENAI_API_BASE'),
model_name=os.getenv('MODEL_NAME', 'gpt-4')
)
analyzer = HotSearchAnalyzer(
api_key=os.getenv('OPENAI_API_KEY'),
api_base=os.getenv('OPENAI_API_BASE'),
model_name=os.getenv('MODEL_NAME', 'gpt-4')
)
Analyze topics
分析主题
topics = analyzer.fetch_topics(
platform='weibo',
start_date='2026-05-01',
end_date='2026-05-20'
)
topics = analyzer.fetch_topics(
platform='weibo',
start_date='2026-05-01',
end_date='2026-05-20'
)
Topic clustering
主题聚类
clusters = analyzer.cluster_topics(topics, n_clusters=5)
clusters = analyzer.cluster_topics(topics, n_clusters=5)
Sentiment analysis
情感分析
for topic in topics:
sentiment = analyzer.analyze_sentiment(topic['title'], topic['content'])
print(f"{topic['title']}: {sentiment}")
for topic in topics:
sentiment = analyzer.analyze_sentiment(topic['title'], topic['content'])
print(f"{topic['title']}: {sentiment}")
Generate report
生成报告
report = analyzer.generate_report(
query="人工智能与前沿科技",
platforms=['weibo', 'bilibili', 'zhihu'],
days=7
)
print(report)
undefinedreport = analyzer.generate_report(
query="人工智能与前沿科技",
platforms=['weibo', 'bilibili', 'zhihu'],
days=7
)
print(report)
undefinedCustom LLM Integration
自定义LLM集成
python
undefinedpython
undefinedUsing Huawei Pangu Model (local deployment)
使用华为盘古大模型(本地部署)
from hotsearch_analysis_agent.llm import PanguLLM
pangu = PanguLLM(
model_path=os.getenv('PANGU_MODEL_PATH'),
api_url=os.getenv('PANGU_API_URL')
)
response = pangu.generate(
prompt="分析以下新闻的情感倾向:\n{news_content}",
max_tokens=500
)
undefinedfrom hotsearch_analysis_agent.llm import PanguLLM
pangu = PanguLLM(
model_path=os.getenv('PANGU_MODEL_PATH'),
api_url=os.getenv('PANGU_API_URL')
)
response = pangu.generate(
prompt="分析以下新闻的情感倾向:\n{news_content}",
max_tokens=500
)
undefined3. Web Application (app.py
)
app.py3. Web应用(app.py
)
app.pyFastAPI-based web interface for interactive queries and control.
python
undefined基于FastAPI的Web界面,支持交互式查询与控制。
python
undefinedStart the web application
启动Web应用
python app.py
python app.py
Default runs on http://localhost:8000
undefinedundefinedAPI Endpoints
API接口
python
from fastapi import FastAPI
from hotsearch_analysis_agent.api import router
app = FastAPI()
app.include_router(router)python
from fastapi import FastAPI
from hotsearch_analysis_agent.api import router
app = FastAPI()
app.include_router(router)Example API calls
API调用示例
import httpx
import httpx
Query hot searches
查询热搜
response = httpx.get('http://localhost:8000/api/hot-search', params={
'platform': 'weibo',
'limit': 20
})
response = httpx.get('http://localhost:8000/api/hot-search', params={
'platform': 'weibo',
'limit': 20
})
Search by keyword
关键词搜索
response = httpx.post('http://localhost:8000/api/search', json={
'keyword': '人工智能',
'platforms': ['weibo', 'zhihu'],
'days': 7
})
response = httpx.post('http://localhost:8000/api/search', json={
'keyword': '人工智能',
'platforms': ['weibo', 'zhihu'],
'days': 7
})
Start crawler
启动爬虫
response = httpx.post('http://localhost:8000/api/crawler/start', json={
'platforms': ['weibo', 'bilibili']
})
response = httpx.post('http://localhost:8000/api/crawler/start', json={
'platforms': ['weibo', 'bilibili']
})
Stop crawler
停止爬虫
response = httpx.post('http://localhost:8000/api/crawler/stop')
undefinedresponse = httpx.post('http://localhost:8000/api/crawler/stop')
undefinedPush Notification System
推送通知系统
Configure and test multi-channel alerts:
python
undefined配置并测试多渠道告警:
python
undefinedtest_push_task.py
test_push_task.py
from hotsearch_analysis_agent.push import PushManager
manager = PushManager()
from hotsearch_analysis_agent.push import PushManager
manager = PushManager()
Configure push task
配置推送任务
task = {
'name': 'AI Tech Monitor',
'query': '人工智能',
'platforms': ['weibo', 'zhihu', 'bilibili'],
'schedule': '0 9,18 * * *', # Cron format: 9 AM and 6 PM daily
'channels': ['wechat_work', 'telegram', 'email'],
'min_heat': 100000 # Minimum heat value threshold
}
manager.create_task(task)
task = {
'name': 'AI Tech Monitor',
'query': '人工智能',
'platforms': ['weibo', 'zhihu', 'bilibili'],
'schedule': '0 9,18 * * *', # Cron格式:每日上午9点和下午6点
'channels': ['wechat_work', 'telegram', 'email'],
'min_heat': 100000 # 最低热度阈值
}
manager.create_task(task)
Test push manually
手动测试推送
report = """
report = """
AI Technology Hot Topics - 2026-05-20
AI技术热点话题 - 2026-05-20
Key Findings
核心发现
- GPT-6 context window leaked: 2M tokens
- DeepSeek V4 uses Huawei Ascend chips
- Chinese LLM API calls lead globally for 5 weeks
[Full report content...]
"""
- GPT-6上下文窗口泄露:2M tokens
- DeepSeek V4采用华为昇腾芯片
- 中国LLM API调用量连续5周全球领先
[完整报告内容...]
"""
Send to WeChat Work
发送至企业微信
manager.send_wechat_work(report)
manager.send_wechat_work(report)
Send to Telegram
发送至Telegram
manager.send_telegram(report)
manager.send_telegram(report)
Send email
发送邮件
manager.send_email(
subject="AI Technology Hot Topics - 2026-05-20",
content=report
)
undefinedmanager.send_email(
subject="AI技术热点话题 - 2026-05-20",
content=report
)
undefinedPush Channel Configuration
推送渠道配置
python
undefinedpython
undefinedWeChat Work Bot (Group Webhook)
企业微信机器人(群聊Webhook)
import requests
def send_wechat_work_bot(content):
webhook = os.getenv('WECHAT_WORK_BOT_WEBHOOK')
data = {
"msgtype": "markdown",
"markdown": {
"content": content
}
}
requests.post(webhook, json=data)
import requests
def send_wechat_work_bot(content):
webhook = os.getenv('WECHAT_WORK_BOT_WEBHOOK')
data = {
"msgtype": "markdown",
"markdown": {
"content": content
}
}
requests.post(webhook, json=data)
Telegram Bot
Telegram机器人
from telegram import Bot
def send_telegram(content):
bot = Bot(token=os.getenv('TELEGRAM_BOT_TOKEN'))
chat_id = os.getenv('TELEGRAM_CHAT_ID')
bot.send_message(chat_id=chat_id, text=content, parse_mode='Markdown')
from telegram import Bot
def send_telegram(content):
bot = Bot(token=os.getenv('TELEGRAM_BOT_TOKEN'))
chat_id = os.getenv('TELEGRAM_CHAT_ID')
bot.send_message(chat_id=chat_id, text=content, parse_mode='Markdown')
Email via SMTP
SMTP邮件发送
import smtplib
from email.mime.text import MIMEText
def send_email(subject, content):
msg = MIMEText(content, 'html', 'utf-8')
msg['Subject'] = subject
msg['From'] = os.getenv('SMTP_USER')
msg['To'] = os.getenv('SMTP_RECIPIENTS')
with smtplib.SMTP(os.getenv('SMTP_HOST'), int(os.getenv('SMTP_PORT'))) as server:
server.starttls()
server.login(os.getenv('SMTP_USER'), os.getenv('SMTP_PASSWORD'))
server.send_message(msg)undefinedimport smtplib
from email.mime.text import MIMEText
def send_email(subject, content):
msg = MIMEText(content, 'html', 'utf-8')
msg['Subject'] = subject
msg['From'] = os.getenv('SMTP_USER')
msg['To'] = os.getenv('SMTP_RECIPIENTS')
with smtplib.SMTP(os.getenv('SMTP_HOST'), int(os.getenv('SMTP_PORT'))) as server:
server.starttls()
server.login(os.getenv('SMTP_USER'), os.getenv('SMTP_PASSWORD'))
server.send_message(msg)undefinedCommon Usage Patterns
常见使用场景
Pattern 1: Daily Hot Topic Monitoring
场景1:每日热点话题监控
python
from datetime import datetime, timedelta
from hotsearch_analysis_agent.analyzer import HotSearchAnalyzer
from hotsearch_analysis_agent.push import PushManager
analyzer = HotSearchAnalyzer()
push_manager = PushManager()python
from datetime import datetime, timedelta
from hotsearch_analysis_agent.analyzer import HotSearchAnalyzer
from hotsearch_analysis_agent.push import PushManager
analyzer = HotSearchAnalyzer()
push_manager = PushManager()Get yesterday's hot topics
获取昨日热点话题
yesterday = datetime.now() - timedelta(days=1)
topics = analyzer.fetch_topics(
platforms=['weibo', 'zhihu', 'bilibili'],
start_date=yesterday.strftime('%Y-%m-%d'),
heat_threshold=50000
)
yesterday = datetime.now() - timedelta(days=1)
topics = analyzer.fetch_topics(
platforms=['weibo', 'zhihu', 'bilibili'],
start_date=yesterday.strftime('%Y-%m-%d'),
heat_threshold=50000
)
Cluster and analyze
聚类并分析
clusters = analyzer.cluster_topics(topics, n_clusters=5)
clusters = analyzer.cluster_topics(topics, n_clusters=5)
Generate report
生成报告
report = analyzer.generate_report_from_clusters(clusters)
report = analyzer.generate_report_from_clusters(clusters)
Push to all channels
推送至所有渠道
push_manager.broadcast(report, channels=['wechat_work', 'telegram', 'email'])
undefinedpush_manager.broadcast(report, channels=['wechat_work', 'telegram', 'email'])
undefinedPattern 2: Keyword Alert System
场景2:关键词告警系统
python
undefinedpython
undefinedMonitor specific keywords and send immediate alerts
监控特定关键词并即时发送告警
from hotsearch_analysis_agent.monitor import KeywordMonitor
monitor = KeywordMonitor(
keywords=['芯片', 'AI', '大模型', '华为'],
platforms=['weibo', 'toutiao', 'zhihu'],
check_interval=300 # Check every 5 minutes
)
def on_match(topic):
"""Callback when keyword is matched"""
alert = f"""
🔔 Keyword Alert: {topic['title']}
Platform: {topic['platform']}
Heat: {topic['heat_value']}
URL: {topic['url']}
"""
push_manager.send_telegram(alert)
monitor.start(callback=on_match)
undefinedfrom hotsearch_analysis_agent.monitor import KeywordMonitor
monitor = KeywordMonitor(
keywords=['芯片', 'AI', '大模型', '华为'],
platforms=['weibo', 'toutiao', 'zhihu'],
check_interval=300 # 每5分钟检查一次
)
def on_match(topic):
"""匹配到关键词时的回调函数"""
alert = f"""
🔔 关键词告警: {topic['title']}
平台: {topic['platform']}
热度: {topic['heat_value']}
链接: {topic['url']}
"""
push_manager.send_telegram(alert)
monitor.start(callback=on_match)
undefinedPattern 3: Deep Content Analysis
场景3:深度内容分析
python
undefinedpython
undefinedAnalyze news detail pages (including video content)
分析新闻详情页(包括视频内容)
from hotsearch_analysis_agent.content_extractor import ContentExtractor
extractor = ContentExtractor()
from hotsearch_analysis_agent.content_extractor import ContentExtractor
extractor = ContentExtractor()
Get detailed content from URL
从URL提取详细内容
url = 'https://www.bilibili.com/video/BV13pSoBBEvX/'
content = extractor.extract(url)
print(f"Title: {content['title']}")
print(f"Type: {content['type']}") # 'video' or 'article'
print(f"Content: {content['text'][:500]}...") # Extracted transcript/text
url = 'https://www.bilibili.com/video/BV13pSoBBEvX/'
content = extractor.extract(url)
print(f"标题: {content['title']}")
print(f"类型: {content['type']}") # 'video' 或 'article'
print(f"内容: {content['text'][:500]}...") # 提取的字幕/文本
Analyze sentiment
情感分析
sentiment = analyzer.analyze_sentiment(content['title'], content['text'])
print(f"Sentiment: {sentiment}")
sentiment = analyzer.analyze_sentiment(content['title'], content['text'])
print(f"情感倾向: {sentiment}")
Extract entities
提取实体
entities = analyzer.extract_entities(content['text'])
print(f"Entities: {entities}")
undefinedentities = analyzer.extract_entities(content['text'])
print(f"实体: {entities}")
undefinedPattern 4: Custom Report Generation
场景4:自定义报告生成
python
undefinedpython
undefinedGenerate custom analytical report
生成自定义分析报告
report_config = {
'title': '科技行业周报',
'query': '人工智能 OR 芯片 OR 量子计算',
'platforms': ['all'],
'date_range': 7,
'sections': [
'core_findings', # Key discoveries
'news_details', # Detailed news list
'trend_analysis', # Trend analysis
'entity_network' # Entity relationship graph
],
'output_format': 'markdown'
}
report = analyzer.generate_custom_report(**report_config)
report_config = {
'title': '科技行业周报',
'query': '人工智能 OR 芯片 OR 量子计算',
'platforms': ['all'],
'date_range': 7,
'sections': [
'core_findings', # 核心发现
'news_details', # 新闻详情列表
'trend_analysis', # 趋势分析
'entity_network' # 实体关系图
],
'output_format': 'markdown'
}
report = analyzer.generate_custom_report(**report_config)
Save to file
保存至文件
with open(f"report_{datetime.now().strftime('%Y%m%d')}.md", 'w', encoding='utf-8') as f:
f.write(report)
undefinedwith open(f"report_{datetime.now().strftime('%Y%m%d')}.md", 'w', encoding='utf-8') as f:
f.write(report)
undefinedTroubleshooting
故障排查
Issue 1: Browser Driver Errors
问题1:浏览器驱动错误
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATHSolution: Ensure ChromeDriver/EdgeDriver is in system PATH and matches browser version.
bash
undefinedselenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH解决方案:确保ChromeDriver/EdgeDriver已添加至系统PATH,且版本与浏览器匹配。
bash
undefinedCheck driver version
检查驱动版本
chromedriver --version
chromedriver --version
Check Chrome version
检查Chrome版本
google-chrome --version # Linux
google-chrome --version # Linux系统
or open chrome://version in browser
或在浏览器中打开chrome://version查看
Download matching version from https://chromedriver.chromium.org/
undefinedundefinedIssue 2: Database Connection Failures
问题2:数据库连接失败
mysql.connector.errors.ProgrammingError: Access denied for userSolution: Verify MySQL credentials in and ensure user has proper permissions.
.envsql
-- Grant permissions
GRANT ALL PRIVILEGES ON hotsearch_db.* TO 'your_user'@'localhost';
FLUSH PRIVILEGES;mysql.connector.errors.ProgrammingError: Access denied for user解决方案:验证中的MySQL凭据,确保用户拥有足够权限。
.envsql
undefinedIssue 3: LLM API Rate Limits
授予权限
openai.error.RateLimitError: Rate limit exceededSolution: Implement request throttling or switch to local model:
python
import time
from functools import wraps
def rate_limit(calls_per_minute=10):
min_interval = 60.0 / calls_per_minute
last_called = [0.0]
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
elapsed = time.time() - last_called[0]
wait_time = min_interval - elapsed
if wait_time > 0:
time.sleep(wait_time)
result = func(*args, **kwargs)
last_called[0] = time.time()
return result
return wrapper
return decorator
@rate_limit(calls_per_minute=10)
def call_llm(prompt):
return analyzer.generate(prompt)GRANT ALL PRIVILEGES ON hotsearch_db.* TO 'your_user'@'localhost';
FLUSH PRIVILEGES;
undefinedIssue 4: Crawler Being Blocked
问题3:LLM API速率限制
Solution: Rotate user agents and add delays:
python
undefinedopenai.error.RateLimitError: Rate limit exceeded解决方案:实现请求限流或切换至本地模型:
python
import time
from functools import wraps
def rate_limit(calls_per_minute=10):
min_interval = 60.0 / calls_per_minute
last_called = [0.0]
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
elapsed = time.time() - last_called[0]
wait_time = min_interval - elapsed
if wait_time > 0:
time.sleep(wait_time)
result = func(*args, **kwargs)
last_called[0] = time.time()
return result
return wrapper
return decorator
@rate_limit(calls_per_minute=10)
def call_llm(prompt):
return analyzer.generate(prompt)In hotsearchcrawler/settings.py
问题4:爬虫被拦截
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
DOWNLOAD_DELAY = 3
RANDOMIZE_DOWNLOAD_DELAY = True
CONCURRENT_REQUESTS_PER_DOMAIN = 2
undefined解决方案:轮换用户代理并添加延迟:
python
undefinedIssue 5: Encoding Issues with Chinese Text
在hotsearchcrawler/settings.py中配置
Solution: Ensure UTF-8 encoding throughout:
python
undefinedDOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
DOWNLOAD_DELAY = 3
RANDOMIZE_DOWNLOAD_DELAY = True
CONCURRENT_REQUESTS_PER_DOMAIN = 2
undefinedDatabase connection
问题5:中文文本编码问题
import mysql.connector
conn = mysql.connector.connect(
host=os.getenv('MYSQL_HOST'),
user=os.getenv('MYSQL_USER'),
password=os.getenv('MYSQL_PASSWORD'),
database=os.getenv('MYSQL_DATABASE'),
charset='utf8mb4',
collation='utf8mb4_unicode_ci'
)
解决方案:确保全程使用UTF-8编码:
python
undefinedFile operations
数据库连接
with open('report.md', 'w', encoding='utf-8') as f:
f.write(report)
undefinedimport mysql.connector
conn = mysql.connector.connect(
host=os.getenv('MYSQL_HOST'),
user=os.getenv('MYSQL_USER'),
password=os.getenv('MYSQL_PASSWORD'),
database=os.getenv('MYSQL_DATABASE'),
charset='utf8mb4',
collation='utf8mb4_unicode_ci'
)
Advanced Configuration
文件操作
Using Huawei Pangu Model (Local Deployment)
—
Download and deploy the model:
bash
undefinedwith open('report.md', 'w', encoding='utf-8') as f:
f.write(report)
undefined高级配置
Start model service
使用华为盘古大模型(本地部署)
python -m hotsearch_analysis_agent.llm.pangu_server --model_path /path/to/model --port 8080
Configure in code:
```python
from hotsearch_analysis_agent.llm import PanguLLM
analyzer = HotSearchAnalyzer(
llm=PanguLLM(api_url='http://localhost:8080')
)下载并部署模型:
bash
undefinedDistributed Crawling
—
启动模型服务
Scale up with multiple crawler instances:
bash
undefinedpython -m hotsearch_analysis_agent.llm.pangu_server --model_path /path/to/model --port 8080
在代码中配置:
```python
from hotsearch_analysis_agent.llm import PanguLLM
analyzer = HotSearchAnalyzer(
llm=PanguLLM(api_url='http://localhost:8080')
)Instance 1: Weibo, Zhihu
分布式爬取
python run_spiders.py --platforms weibo,zhihu
通过多个爬虫实例扩展规模:
bash
undefinedInstance 2: Bilibili, Douyin
实例1:微博、知乎
python run_spiders.py --platforms bilibili,douyin
python run_spiders.py --platforms weibo,zhihu
Instance 3: News platforms
实例2:哔哩哔哩、抖音
python run_spiders.py --platforms baidu,toutiao
undefinedpython run_spiders.py --platforms bilibili,douyin
Project Structure Reference
实例3:资讯平台
.
├── app.py # Web application entry
├── run_spiders.py # Crawler launcher
├── runspider-test.py # Crawler testing
├── test_push_task.py # Push notification testing
├── init.py # Database initialization
├── requirements.txt # Python dependencies
├── .env # Environment configuration
├── hotsearchcrawler/ # Crawler cluster
│ ├── spiders/ # Platform-specific spiders
│ ├── settings.py # Crawler settings
│ └── pipelines.py # Data pipelines
└── hotsearch_analysis_agent/ # Analysis system
├── analyzer.py # Core analysis engine
├── llm/ # LLM integrations
├── push/ # Push notification modules
├── api/ # Web API endpoints
└── content_extractor.py # Content extraction utilitiespython run_spiders.py --platforms baidu,toutiao
undefined—
项目结构参考
—
.
├── app.py # Web应用入口
├── run_spiders.py # 爬虫启动器
├── runspider-test.py # 爬虫测试脚本
├── test_push_task.py # 推送通知测试脚本
├── init.py # 数据库初始化脚本
├── requirements.txt # Python依赖列表
├── .env # 环境配置文件
├── hotsearchcrawler/ # 爬虫集群
│ ├── spiders/ # 平台专属爬虫
│ ├── settings.py # 爬虫配置
│ └── pipelines.py # 数据管道
└── hotsearch_analysis_agent/ # 分析系统
├── analyzer.py # 核心分析引擎
├── llm/ # LLM集成模块
├── push/ # 推送通知模块
├── api/ # Web API接口
└── content_extractor.py # 内容提取工具