llm-public-opinion-analytics

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

LLM-Based Public Opinion Analytics Assistant

基于LLM的舆情分析助手

Skill by ara.so — Data Skills collection.

由ara.so提供的Skill — 数据技能合集。

Overview

概述

This project is an intelligent public opinion analysis assistant that integrates real-time data from 15 mainstream platforms across 26 ranking lists with large language model (LLM) analysis capabilities. It provides conversational hot search queries, topic-specific searches, topic clustering, and sentiment analysis. The system supports:

Real-time web scraping from platforms like Weibo, Bilibili, Douyin, Baidu, etc.
LLM-powered content analysis (including video content extraction)
Multi-channel push notifications (WeChat, Enterprise WeChat, Telegram, Email)
Keyboard shortcuts for crawler control
Quick data lookup and platform jumping

本项目是一款智能舆情分析助手，整合了来自15个主流平台共26个榜单的实时数据与大语言模型（LLM）分析能力。它支持对话式热搜查询、特定主题搜索、主题聚类以及情感分析。该系统具备以下功能：

实时爬取微博（Weibo）、哔哩哔哩（Bilibili）、抖音（Douyin）、百度（Baidu）等平台的数据
基于LLM的内容分析（包括视频内容提取）
多渠道推送通知（微信、企业微信、Telegram、邮件）
爬虫控制快捷键
快速数据查询与平台跳转

Installation

安装步骤

Prerequisites

前置条件

Python Environment: Python 3.8+
MySQL Database: MySQL 5.7+ or 8.0+
Browser Driver: ChromeDriver or EdgeDriver

Python环境：Python 3.8+
MySQL数据库：MySQL 5.7+ 或 8.0+
浏览器驱动：ChromeDriver 或 EdgeDriver

Step 1: Browser Driver Setup

步骤1：浏览器驱动配置

Download the driver matching your browser version:

Chrome: ChromeDriver Downloads
Edge: EdgeDriver Downloads

Add the driver to your system PATH:

bash

undefined

下载与浏览器版本匹配的驱动：

Chrome：ChromeDriver 下载页面
Edge：EdgeDriver 下载页面

将驱动添加至系统PATH：

bash

undefined

macOS/Linux

export PATH=$PATH:/path/to/driver/directory

Windows: Add to System Environment Variables

Windows: 添加至系统环境变量


Verify installation:

```bash
chromedriver --version


验证安装：

```bash
chromedriver --version

or

或

msedgedriver --version

undefined

msedgedriver --version

undefined

Step 2: Clone and Install Dependencies

步骤2：克隆项目并安装依赖

bash

git clone https://github.com/hmmnxkl/LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant.git
cd LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant

bash

git clone https://github.com/hmmnxkl/LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant.git
cd LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant

Create virtual environment

创建虚拟环境

python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate

python -m venv venv source venv/bin/activate # Windows系统执行：venv\Scripts\activate

Install dependencies

安装依赖

pip install -r requirements.txt

undefined

pip install -r requirements.txt

undefined

Step 3: Database Setup

步骤3：数据库配置

Create MySQL database and tables:

python

undefined

创建MySQL数据库及表：

python

undefined

Reference init.py for schema

参考init.py中的数据库结构

import mysql.connector

conn = mysql.connector.connect( host=os.getenv('MYSQL_HOST', 'localhost'), user=os.getenv('MYSQL_USER'), password=os.getenv('MYSQL_PASSWORD') )

cursor = conn.cursor() cursor.execute("CREATE DATABASE IF NOT EXISTS hotsearch_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci") cursor.execute("USE hotsearch_db")

import mysql.connector

conn = mysql.connector.connect( host=os.getenv('MYSQL_HOST', 'localhost'), user=os.getenv('MYSQL_USER'), password=os.getenv('MYSQL_PASSWORD') )

cursor = conn.cursor() cursor.execute("CREATE DATABASE IF NOT EXISTS hotsearch_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci") cursor.execute("USE hotsearch_db")

Create tables (see init.py for full schema)

创建表（完整结构请查看init.py）

cursor.execute(""" CREATE TABLE IF NOT EXISTS hot_search_items ( id INT AUTO_INCREMENT PRIMARY KEY, platform VARCHAR(50), title VARCHAR(500), url TEXT, rank_index INT, heat_value VARCHAR(100), collected_at DATETIME, content TEXT, sentiment VARCHAR(20), INDEX idx_platform (platform), INDEX idx_collected (collected_at) ) """)

conn.commit()

undefined

conn.commit()

undefined

Step 4: Environment Configuration

步骤4：环境配置

Create

.env

file in project root:

bash

undefined

在项目根目录创建

.env

文件：

bash

undefined

MySQL Configuration

MySQL配置

MYSQL_HOST=localhost MYSQL_PORT=3306 MYSQL_USER=your_mysql_user MYSQL_PASSWORD=your_mysql_password MYSQL_DATABASE=hotsearch_db

LLM Configuration (OpenAI-compatible API)

LLM配置（兼容OpenAI的API）

OPENAI_API_KEY=your_api_key OPENAI_API_BASE=https://api.openai.com/v1 MODEL_NAME=gpt-4

Or use Huawei Pangu Model (local deployment)

或使用华为盘古大模型（本地部署）

PANGU_MODEL_PATH=/path/to/pangu/model

PANGU_API_URL=http://localhost:8080

Push Notification Channels

推送通知渠道

WeChat Work Bot

企业微信机器人

WECHAT_WORK_BOT_WEBHOOK=your_webhook_url

WeChat Work App

企业微信应用

WECHAT_WORK_CORP_ID=your_corp_id WECHAT_WORK_AGENT_ID=your_agent_id WECHAT_WORK_SECRET=your_secret

TELEGRAM_BOT_TOKEN=your_bot_token TELEGRAM_CHAT_ID=your_chat_id

Email (SMTP)

邮件（SMTP）

SMTP_HOST=smtp.gmail.com SMTP_PORT=587 SMTP_USER=your_email@gmail.com SMTP_PASSWORD=your_app_password SMTP_RECIPIENTS=recipient1@example.com,recipient2@example.com

undefined

SMTP_HOST=smtp.gmail.com SMTP_PORT=587 SMTP_USER=your_email@gmail.com SMTP_PASSWORD=your_app_password SMTP_RECIPIENTS=recipient1@example.com,recipient2@example.com

undefined

Core Components

核心组件

1. Web Scraping System (

hotsearchcrawler/

)

1. 网页爬取系统（

hotsearchcrawler/

）

The crawler cluster supports 15 platforms with 26 ranking lists:

python

undefined

爬虫集群支持15个平台的26个榜单：

python

undefined

Run all spiders

运行所有爬虫

python run_spiders.py

Test specific spider

测试特定爬虫

python runspider-test.py weibo # Test Weibo scraper

undefined

python runspider-test.py weibo # 测试微博爬虫

undefined

Crawler Configuration

爬虫配置

Edit

hotsearchcrawler/settings.py

python

undefined

编辑

hotsearchcrawler/settings.py

：

python

undefined

MySQL settings

MySQL设置

MYSQL_HOST = os.getenv('MYSQL_HOST', 'localhost') MYSQL_PORT = int(os.getenv('MYSQL_PORT', 3306)) MYSQL_USER = os.getenv('MYSQL_USER') MYSQL_PASSWORD = os.getenv('MYSQL_PASSWORD') MYSQL_DATABASE = os.getenv('MYSQL_DATABASE', 'hotsearch_db')

Optional: Platform-specific cookies

可选：平台专属Cookie

COOKIES = { 'weibo': 'your_weibo_cookies', 'bilibili': 'your_bilibili_cookies' }

Crawler settings

爬虫设置

CONCURRENT_REQUESTS = 16 DOWNLOAD_DELAY = 1 RANDOMIZE_DOWNLOAD_DELAY = True

undefined

CONCURRENT_REQUESTS = 16 DOWNLOAD_DELAY = 1 RANDOMIZE_DOWNLOAD_DELAY = True

undefined

Available Platforms

支持的平台

Social Media: Weibo, Douyin, Kuaishou
Video: Bilibili, Tencent Video
News: Baidu, Toutiao, Zhihu
E-commerce: Taobao, JD.com
Gaming: Steam, Tap Tap
Others: Tieba, Douban, etc.

社交媒体：微博（Weibo）、抖音（Douyin）、快手（Kuaishou）
视频平台：哔哩哔哩（Bilibili）、腾讯视频（Tencent Video）
资讯平台：百度（Baidu）、头条（Toutiao）、知乎（Zhihu）
电商平台：淘宝（Taobao）、京东（JD.com）
游戏平台：Steam、Tap Tap
其他：贴吧（Tieba）、豆瓣（Douban）等

2. Analysis System (

hotsearch_analysis_agent/

)

2. 分析系统（

hotsearch_analysis_agent/

）

LLM-powered analysis engine for topic clustering, sentiment analysis, and report generation.

python

from hotsearch_analysis_agent.analyzer import HotSearchAnalyzer

基于LLM的分析引擎，支持主题聚类、情感分析及报告生成。

python

from hotsearch_analysis_agent.analyzer import HotSearchAnalyzer

Initialize analyzer

初始化分析器

analyzer = HotSearchAnalyzer( api_key=os.getenv('OPENAI_API_KEY'), api_base=os.getenv('OPENAI_API_BASE'), model_name=os.getenv('MODEL_NAME', 'gpt-4') )

Analyze topics

分析主题

topics = analyzer.fetch_topics( platform='weibo', start_date='2026-05-01', end_date='2026-05-20' )

Topic clustering

主题聚类

clusters = analyzer.cluster_topics(topics, n_clusters=5)

Sentiment analysis

情感分析

for topic in topics: sentiment = analyzer.analyze_sentiment(topic['title'], topic['content']) print(f"{topic['title']}: {sentiment}")

Generate report

生成报告

report = analyzer.generate_report( query="人工智能与前沿科技", platforms=['weibo', 'bilibili', 'zhihu'], days=7 ) print(report)

undefined

report = analyzer.generate_report( query="人工智能与前沿科技", platforms=['weibo', 'bilibili', 'zhihu'], days=7 ) print(report)

undefined

Custom LLM Integration

自定义LLM集成

python

undefined

python

undefined

Using Huawei Pangu Model (local deployment)

使用华为盘古大模型（本地部署）

from hotsearch_analysis_agent.llm import PanguLLM

pangu = PanguLLM( model_path=os.getenv('PANGU_MODEL_PATH'), api_url=os.getenv('PANGU_API_URL') )

response = pangu.generate( prompt="分析以下新闻的情感倾向:\n{news_content}", max_tokens=500 )

undefined

from hotsearch_analysis_agent.llm import PanguLLM

pangu = PanguLLM( model_path=os.getenv('PANGU_MODEL_PATH'), api_url=os.getenv('PANGU_API_URL') )

response = pangu.generate( prompt="分析以下新闻的情感倾向:\n{news_content}", max_tokens=500 )

undefined

3. Web Application (

app.py

)

3. Web应用（

app.py

）

FastAPI-based web interface for interactive queries and control.

python

undefined

基于FastAPI的Web界面，支持交互式查询与控制。

python

undefined

Start the web application

启动Web应用

python app.py

Default runs on http://localhost:8000

默认运行在http://localhost:8000

undefined

undefined

API Endpoints

API接口

python

from fastapi import FastAPI
from hotsearch_analysis_agent.api import router

app = FastAPI()
app.include_router(router)

python

from fastapi import FastAPI
from hotsearch_analysis_agent.api import router

app = FastAPI()
app.include_router(router)

Example API calls

API调用示例

import httpx

Query hot searches

查询热搜

response = httpx.get('http://localhost:8000/api/hot-search', params={ 'platform': 'weibo', 'limit': 20 })

Search by keyword

关键词搜索

response = httpx.post('http://localhost:8000/api/search', json={ 'keyword': '人工智能', 'platforms': ['weibo', 'zhihu'], 'days': 7 })

Start crawler

启动爬虫

response = httpx.post('http://localhost:8000/api/crawler/start', json={ 'platforms': ['weibo', 'bilibili'] })

Stop crawler

停止爬虫

response = httpx.post('http://localhost:8000/api/crawler/stop')

undefined

response = httpx.post('http://localhost:8000/api/crawler/stop')

undefined

Push Notification System

推送通知系统

Configure and test multi-channel alerts:

python

undefined

配置并测试多渠道告警：

python

undefined

test_push_task.py

from hotsearch_analysis_agent.push import PushManager

manager = PushManager()

from hotsearch_analysis_agent.push import PushManager

manager = PushManager()

Configure push task

配置推送任务

task = { 'name': 'AI Tech Monitor', 'query': '人工智能', 'platforms': ['weibo', 'zhihu', 'bilibili'], 'schedule': '0 9,18 * * *', # Cron format: 9 AM and 6 PM daily 'channels': ['wechat_work', 'telegram', 'email'], 'min_heat': 100000 # Minimum heat value threshold }

manager.create_task(task)

task = { 'name': 'AI Tech Monitor', 'query': '人工智能', 'platforms': ['weibo', 'zhihu', 'bilibili'], 'schedule': '0 9,18 * * *', # Cron格式：每日上午9点和下午6点 'channels': ['wechat_work', 'telegram', 'email'], 'min_heat': 100000 # 最低热度阈值 }

manager.create_task(task)

Test push manually

手动测试推送

report = """

AI Technology Hot Topics - 2026-05-20

AI技术热点话题 - 2026-05-20

Key Findings

核心发现

GPT-6 context window leaked: 2M tokens
DeepSeek V4 uses Huawei Ascend chips
Chinese LLM API calls lead globally for 5 weeks

[Full report content...] """

GPT-6上下文窗口泄露：2M tokens
DeepSeek V4采用华为昇腾芯片
中国LLM API调用量连续5周全球领先

[完整报告内容...] """

Send to WeChat Work

发送至企业微信

manager.send_wechat_work(report)

Send to Telegram

发送至Telegram

manager.send_telegram(report)

Send email

发送邮件

manager.send_email( subject="AI Technology Hot Topics - 2026-05-20", content=report )

undefined

manager.send_email( subject="AI技术热点话题 - 2026-05-20", content=report )

undefined

Push Channel Configuration

推送渠道配置

python

undefined

python

undefined

WeChat Work Bot (Group Webhook)

企业微信机器人（群聊Webhook）

import requests

def send_wechat_work_bot(content): webhook = os.getenv('WECHAT_WORK_BOT_WEBHOOK') data = { "msgtype": "markdown", "markdown": { "content": content } } requests.post(webhook, json=data)

import requests

def send_wechat_work_bot(content): webhook = os.getenv('WECHAT_WORK_BOT_WEBHOOK') data = { "msgtype": "markdown", "markdown": { "content": content } } requests.post(webhook, json=data)

Telegram Bot

Telegram机器人

from telegram import Bot

def send_telegram(content): bot = Bot(token=os.getenv('TELEGRAM_BOT_TOKEN')) chat_id = os.getenv('TELEGRAM_CHAT_ID') bot.send_message(chat_id=chat_id, text=content, parse_mode='Markdown')

from telegram import Bot

def send_telegram(content): bot = Bot(token=os.getenv('TELEGRAM_BOT_TOKEN')) chat_id = os.getenv('TELEGRAM_CHAT_ID') bot.send_message(chat_id=chat_id, text=content, parse_mode='Markdown')

Email via SMTP

SMTP邮件发送

import smtplib from email.mime.text import MIMEText

def send_email(subject, content): msg = MIMEText(content, 'html', 'utf-8') msg['Subject'] = subject msg['From'] = os.getenv('SMTP_USER') msg['To'] = os.getenv('SMTP_RECIPIENTS')

with smtplib.SMTP(os.getenv('SMTP_HOST'), int(os.getenv('SMTP_PORT'))) as server:
    server.starttls()
    server.login(os.getenv('SMTP_USER'), os.getenv('SMTP_PASSWORD'))
    server.send_message(msg)

undefined

import smtplib from email.mime.text import MIMEText

def send_email(subject, content): msg = MIMEText(content, 'html', 'utf-8') msg['Subject'] = subject msg['From'] = os.getenv('SMTP_USER') msg['To'] = os.getenv('SMTP_RECIPIENTS')

with smtplib.SMTP(os.getenv('SMTP_HOST'), int(os.getenv('SMTP_PORT'))) as server:
    server.starttls()
    server.login(os.getenv('SMTP_USER'), os.getenv('SMTP_PASSWORD'))
    server.send_message(msg)

undefined

Common Usage Patterns

常见使用场景

Pattern 1: Daily Hot Topic Monitoring

场景1：每日热点话题监控

python

from datetime import datetime, timedelta
from hotsearch_analysis_agent.analyzer import HotSearchAnalyzer
from hotsearch_analysis_agent.push import PushManager

analyzer = HotSearchAnalyzer()
push_manager = PushManager()

python

from datetime import datetime, timedelta
from hotsearch_analysis_agent.analyzer import HotSearchAnalyzer
from hotsearch_analysis_agent.push import PushManager

analyzer = HotSearchAnalyzer()
push_manager = PushManager()

Get yesterday's hot topics

获取昨日热点话题

yesterday = datetime.now() - timedelta(days=1) topics = analyzer.fetch_topics( platforms=['weibo', 'zhihu', 'bilibili'], start_date=yesterday.strftime('%Y-%m-%d'), heat_threshold=50000 )

Cluster and analyze

聚类并分析

clusters = analyzer.cluster_topics(topics, n_clusters=5)

Generate report

生成报告

report = analyzer.generate_report_from_clusters(clusters)

Push to all channels

推送至所有渠道

push_manager.broadcast(report, channels=['wechat_work', 'telegram', 'email'])

undefined

push_manager.broadcast(report, channels=['wechat_work', 'telegram', 'email'])

undefined

Pattern 2: Keyword Alert System

场景2：关键词告警系统

python

undefined

python

undefined

Monitor specific keywords and send immediate alerts

监控特定关键词并即时发送告警

from hotsearch_analysis_agent.monitor import KeywordMonitor

monitor = KeywordMonitor( keywords=['芯片', 'AI', '大模型', '华为'], platforms=['weibo', 'toutiao', 'zhihu'], check_interval=300 # Check every 5 minutes )

def on_match(topic): """Callback when keyword is matched""" alert = f""" 🔔 Keyword Alert: {topic['title']} Platform: {topic['platform']} Heat: {topic['heat_value']} URL: {topic['url']} """ push_manager.send_telegram(alert)

monitor.start(callback=on_match)

undefined

from hotsearch_analysis_agent.monitor import KeywordMonitor

monitor = KeywordMonitor( keywords=['芯片', 'AI', '大模型', '华为'], platforms=['weibo', 'toutiao', 'zhihu'], check_interval=300 # 每5分钟检查一次 )

def on_match(topic): """匹配到关键词时的回调函数""" alert = f""" 🔔 关键词告警: {topic['title']} 平台: {topic['platform']} 热度: {topic['heat_value']} 链接: {topic['url']} """ push_manager.send_telegram(alert)

monitor.start(callback=on_match)

undefined

Pattern 3: Deep Content Analysis

场景3：深度内容分析

python

undefined

python

undefined

Analyze news detail pages (including video content)

分析新闻详情页（包括视频内容）

from hotsearch_analysis_agent.content_extractor import ContentExtractor

extractor = ContentExtractor()

from hotsearch_analysis_agent.content_extractor import ContentExtractor

extractor = ContentExtractor()

Get detailed content from URL

从URL提取详细内容

url = 'https://www.bilibili.com/video/BV13pSoBBEvX/' content = extractor.extract(url)

print(f"Title: {content['title']}") print(f"Type: {content['type']}") # 'video' or 'article' print(f"Content: {content['text'][:500]}...") # Extracted transcript/text

url = 'https://www.bilibili.com/video/BV13pSoBBEvX/' content = extractor.extract(url)

print(f"标题: {content['title']}") print(f"类型: {content['type']}") # 'video' 或 'article' print(f"内容: {content['text'][:500]}...") # 提取的字幕/文本

Analyze sentiment

情感分析

sentiment = analyzer.analyze_sentiment(content['title'], content['text']) print(f"Sentiment: {sentiment}")

sentiment = analyzer.analyze_sentiment(content['title'], content['text']) print(f"情感倾向: {sentiment}")

Extract entities

提取实体

entities = analyzer.extract_entities(content['text']) print(f"Entities: {entities}")

undefined

entities = analyzer.extract_entities(content['text']) print(f"实体: {entities}")

undefined

Pattern 4: Custom Report Generation

场景4：自定义报告生成

python

undefined

python

undefined

Generate custom analytical report

生成自定义分析报告

report_config = { 'title': '科技行业周报', 'query': '人工智能 OR 芯片 OR 量子计算', 'platforms': ['all'], 'date_range': 7, 'sections': [ 'core_findings', # Key discoveries 'news_details', # Detailed news list 'trend_analysis', # Trend analysis 'entity_network' # Entity relationship graph ], 'output_format': 'markdown' }

report = analyzer.generate_custom_report(**report_config)

report_config = { 'title': '科技行业周报', 'query': '人工智能 OR 芯片 OR 量子计算', 'platforms': ['all'], 'date_range': 7, 'sections': [ 'core_findings', # 核心发现 'news_details', # 新闻详情列表 'trend_analysis', # 趋势分析 'entity_network' # 实体关系图 ], 'output_format': 'markdown' }

report = analyzer.generate_custom_report(**report_config)

Save to file

保存至文件

with open(f"report_{datetime.now().strftime('%Y%m%d')}.md", 'w', encoding='utf-8') as f: f.write(report)

undefined

with open(f"report_{datetime.now().strftime('%Y%m%d')}.md", 'w', encoding='utf-8') as f: f.write(report)

undefined

Troubleshooting

故障排查

Issue 1: Browser Driver Errors

问题1：浏览器驱动错误

selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH

Solution: Ensure ChromeDriver/EdgeDriver is in system PATH and matches browser version.

bash

undefined

selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH

解决方案：确保ChromeDriver/EdgeDriver已添加至系统PATH，且版本与浏览器匹配。

bash

undefined

Check driver version

检查驱动版本

chromedriver --version

Check Chrome version

检查Chrome版本

google-chrome --version # Linux

google-chrome --version # Linux系统

or open chrome://version in browser

或在浏览器中打开chrome://version查看

Download matching version from https://chromedriver.chromium.org/

从https://chromedriver.chromium.org/下载匹配版本

undefined

undefined

Issue 2: Database Connection Failures

问题2：数据库连接失败

mysql.connector.errors.ProgrammingError: Access denied for user

Solution: Verify MySQL credentials in

.env

and ensure user has proper permissions.

sql

-- Grant permissions
GRANT ALL PRIVILEGES ON hotsearch_db.* TO 'your_user'@'localhost';
FLUSH PRIVILEGES;

mysql.connector.errors.ProgrammingError: Access denied for user

解决方案：验证

.env

中的MySQL凭据，确保用户拥有足够权限。

sql

undefined

Issue 3: LLM API Rate Limits

授予权限

openai.error.RateLimitError: Rate limit exceeded

Solution: Implement request throttling or switch to local model:

python

import time
from functools import wraps

def rate_limit(calls_per_minute=10):
    min_interval = 60.0 / calls_per_minute
    last_called = [0.0]
    
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            elapsed = time.time() - last_called[0]
            wait_time = min_interval - elapsed
            if wait_time > 0:
                time.sleep(wait_time)
            result = func(*args, **kwargs)
            last_called[0] = time.time()
            return result
        return wrapper
    return decorator

@rate_limit(calls_per_minute=10)
def call_llm(prompt):
    return analyzer.generate(prompt)

GRANT ALL PRIVILEGES ON hotsearch_db.* TO 'your_user'@'localhost'; FLUSH PRIVILEGES;

undefined

Issue 4: Crawler Being Blocked

问题3：LLM API速率限制

Solution: Rotate user agents and add delays:

python

undefined

openai.error.RateLimitError: Rate limit exceeded

解决方案：实现请求限流或切换至本地模型：

python

import time
from functools import wraps

def rate_limit(calls_per_minute=10):
    min_interval = 60.0 / calls_per_minute
    last_called = [0.0]
    
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            elapsed = time.time() - last_called[0]
            wait_time = min_interval - elapsed
            if wait_time > 0:
                time.sleep(wait_time)
            result = func(*args, **kwargs)
            last_called[0] = time.time()
            return result
        return wrapper
    return decorator

@rate_limit(calls_per_minute=10)
def call_llm(prompt):
    return analyzer.generate(prompt)

In hotsearchcrawler/settings.py

问题4：爬虫被拦截

DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, 'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400, }

DOWNLOAD_DELAY = 3 RANDOMIZE_DOWNLOAD_DELAY = True CONCURRENT_REQUESTS_PER_DOMAIN = 2

undefined

解决方案：轮换用户代理并添加延迟：

python

undefined

Issue 5: Encoding Issues with Chinese Text

在hotsearchcrawler/settings.py中配置

Solution: Ensure UTF-8 encoding throughout:

python

undefined

DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, 'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400, }

DOWNLOAD_DELAY = 3 RANDOMIZE_DOWNLOAD_DELAY = True CONCURRENT_REQUESTS_PER_DOMAIN = 2

undefined

Database connection

问题5：中文文本编码问题

import mysql.connector

conn = mysql.connector.connect( host=os.getenv('MYSQL_HOST'), user=os.getenv('MYSQL_USER'), password=os.getenv('MYSQL_PASSWORD'), database=os.getenv('MYSQL_DATABASE'), charset='utf8mb4', collation='utf8mb4_unicode_ci' )

解决方案：确保全程使用UTF-8编码：

python

undefined

File operations

数据库连接

with open('report.md', 'w', encoding='utf-8') as f: f.write(report)

undefined

import mysql.connector

Advanced Configuration

文件操作

Using Huawei Pangu Model (Local Deployment)

—

Download and deploy the model:

bash

undefined

with open('report.md', 'w', encoding='utf-8') as f: f.write(report)

undefined

Download from https://ai.gitcode.com/ascend-tribe/openpangu-embedded-7b-model

高级配置

Start model service

使用华为盘古大模型（本地部署）

python -m hotsearch_analysis_agent.llm.pangu_server --model_path /path/to/model --port 8080


Configure in code:

```python
from hotsearch_analysis_agent.llm import PanguLLM

analyzer = HotSearchAnalyzer(
    llm=PanguLLM(api_url='http://localhost:8080')
)

下载并部署模型：

bash

undefined

Distributed Crawling

从https://ai.gitcode.com/ascend-tribe/openpangu-embedded-7b-model下载

—

启动模型服务

Scale up with multiple crawler instances:

bash

undefined

python -m hotsearch_analysis_agent.llm.pangu_server --model_path /path/to/model --port 8080


在代码中配置：

```python
from hotsearch_analysis_agent.llm import PanguLLM

analyzer = HotSearchAnalyzer(
    llm=PanguLLM(api_url='http://localhost:8080')
)

Instance 1: Weibo, Zhihu

分布式爬取

python run_spiders.py --platforms weibo,zhihu

通过多个爬虫实例扩展规模：

bash

undefined

Instance 2: Bilibili, Douyin

实例1：微博、知乎

python run_spiders.py --platforms bilibili,douyin

python run_spiders.py --platforms weibo,zhihu

Instance 3: News platforms

实例2：哔哩哔哩、抖音

python run_spiders.py --platforms baidu,toutiao

undefined

python run_spiders.py --platforms bilibili,douyin

Project Structure Reference

实例3：资讯平台

.
├── app.py                          # Web application entry
├── run_spiders.py                  # Crawler launcher
├── runspider-test.py               # Crawler testing
├── test_push_task.py               # Push notification testing
├── init.py                         # Database initialization
├── requirements.txt                # Python dependencies
├── .env                            # Environment configuration
├── hotsearchcrawler/               # Crawler cluster
│   ├── spiders/                    # Platform-specific spiders
│   ├── settings.py                 # Crawler settings
│   └── pipelines.py                # Data pipelines
└── hotsearch_analysis_agent/       # Analysis system
    ├── analyzer.py                 # Core analysis engine
    ├── llm/                        # LLM integrations
    ├── push/                       # Push notification modules
    ├── api/                        # Web API endpoints
    └── content_extractor.py        # Content extraction utilities

python run_spiders.py --platforms baidu,toutiao

undefined

—

项目结构参考

—

.
├── app.py                          # Web应用入口
├── run_spiders.py                  # 爬虫启动器
├── runspider-test.py               # 爬虫测试脚本
├── test_push_task.py               # 推送通知测试脚本
├── init.py                         # 数据库初始化脚本
├── requirements.txt                # Python依赖列表
├── .env                            # 环境配置文件
├── hotsearchcrawler/               # 爬虫集群
│   ├── spiders/                    # 平台专属爬虫
│   ├── settings.py                 # 爬虫配置
│   └── pipelines.py                # 数据管道
└── hotsearch_analysis_agent/       # 分析系统
    ├── analyzer.py                 # 核心分析引擎
    ├── llm/                        # LLM集成模块
    ├── push/                       # 推送通知模块
    ├── api/                        # Web API接口
    └── content_extractor.py        # 内容提取工具

llm-public-opinion-analytics

Original

Translation

LLM-Based Public Opinion Analytics Assistant

基于LLM的舆情分析助手

Overview

概述

Installation

安装步骤

Prerequisites

前置条件

Step 1: Browser Driver Setup

步骤1：浏览器驱动配置

macOS/Linux

macOS/Linux

Windows: Add to System Environment Variables

Windows: 添加至系统环境变量

or

或

Step 2: Clone and Install Dependencies

步骤2：克隆项目并安装依赖

Create virtual environment

创建虚拟环境

Install dependencies

安装依赖

Step 3: Database Setup

步骤3：数据库配置

Reference init.py for schema

参考init.py中的数据库结构

Create tables (see init.py for full schema)

创建表（完整结构请查看init.py）

Step 4: Environment Configuration

步骤4：环境配置

MySQL Configuration

MySQL配置

LLM Configuration (OpenAI-compatible API)

LLM配置（兼容OpenAI的API）

Or use Huawei Pangu Model (local deployment)

或使用华为盘古大模型（本地部署）

PANGU_MODEL_PATH=/path/to/pangu/model

PANGU_MODEL_PATH=/path/to/pangu/model

PANGU_API_URL=http://localhost:8080

PANGU_API_URL=http://localhost:8080

Push Notification Channels

推送通知渠道

WeChat Work Bot

企业微信机器人

WeChat Work App

企业微信应用

Telegram

Telegram

Email (SMTP)

邮件（SMTP）

Core Components

核心组件

1. Web Scraping System (hotsearchcrawler/)

1. 网页爬取系统（hotsearchcrawler/）

Run all spiders

运行所有爬虫

Test specific spider

测试特定爬虫

Crawler Configuration

爬虫配置

MySQL settings

MySQL设置

Optional: Platform-specific cookies

可选：平台专属Cookie

Crawler settings

爬虫设置

Available Platforms

支持的平台

2. Analysis System (hotsearch_analysis_agent/)

2. 分析系统（hotsearch_analysis_agent/）

Initialize analyzer

初始化分析器

Analyze topics

分析主题

Topic clustering

主题聚类

Sentiment analysis

1. Web Scraping System (
`hotsearchcrawler/`
)

1. 网页爬取系统（
`hotsearchcrawler/`
）

2. Analysis System (
`hotsearch_analysis_agent/`
)

2. 分析系统（
`hotsearch_analysis_agent/`
）

3. Web Application (
`app.py`
)

3. Web应用（
`app.py`
）