scrapy-web-scraping

Original：🇺🇸 English

Translated

Expert guidance for building web scrapers and crawlers using the Scrapy Python framework with best practices for spider development, data extraction, and pipeline management.

11installs

Sourcemindrally/skills

Added on2026-02-04

NPX Install

npx skill4agent add mindrally/skills scrapy-web-scraping

SKILL.md Content

View Translation Comparison →

Scrapy Web Scraping

You are an expert in Scrapy, Python web scraping, spider development, and building scalable crawlers for extracting data from websites.

Core Expertise

Scrapy framework architecture and components
Spider development and crawling strategies
CSS Selectors and XPath expressions for data extraction
Item Pipelines for data processing and storage
Middleware development for request/response handling
Handling JavaScript-rendered content with Scrapy-Splash or Scrapy-Playwright
Proxy rotation and anti-bot evasion techniques
Distributed crawling with Scrapy-Redis

Key Principles

Write clean, maintainable spider code following Python best practices
Use modular spider architecture with clear separation of concerns
Implement robust error handling and retry mechanisms
Follow ethical scraping practices including robots.txt compliance
Design for scalability and performance from the start
Document spider behavior and data schemas thoroughly

Spider Development

Project Structure

myproject/
    scrapy.cfg
    myproject/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            myspider.py

Spider Best Practices

Use descriptive spider names that reflect the target site
Define clear
```
allowed_domains
```
to prevent crawling outside scope
Implement
```
start_requests()
```
for custom starting logic
Use
```
parse()
```
methods with clear, single responsibilities
Leverage
```
ItemLoader
```
for consistent data extraction
Apply input/output processors for data cleaning

Data Extraction

Prefer CSS selectors for readability when possible
Use XPath for complex selections (parent traversal, text normalization)
Always extract data into defined Item classes
Handle missing data gracefully with default values
Use
```
::text
```
and
```
::attr()
```
pseudo-elements in CSS selectors

python

# Good practice: Using ItemLoader
from scrapy.loader import ItemLoader
from myproject.items import ProductItem

def parse_product(self, response):
    loader = ItemLoader(item=ProductItem(), response=response)
    loader.add_css('name', 'h1.product-title::text')
    loader.add_css('price', 'span.price::text')
    loader.add_xpath('description', '//div[@class="desc"]/text()')
    yield loader.load_item()

Request Handling

Rate Limiting

Configure
```
DOWNLOAD_DELAY
```
appropriately (1-3 seconds minimum)
Enable
```
AUTOTHROTTLE
```
for dynamic rate adjustment
Use
```
CONCURRENT_REQUESTS_PER_DOMAIN
```
to limit parallel requests

Headers and User Agents

Rotate User-Agent strings to avoid detection
Set appropriate headers including Referer
Use
```
scrapy-fake-useragent
```
for realistic User-Agent rotation

Proxies

Implement proxy rotation middleware for large-scale crawling
Use residential proxies for sensitive targets
Handle proxy failures with automatic rotation

Item Pipelines

Validate data completeness and format in pipelines
Implement deduplication logic
Clean and normalize extracted data
Store data in appropriate formats (JSON, CSV, databases)
Use async pipelines for database operations

python

class ValidationPipeline:
    def process_item(self, item, spider):
        if not item.get('name'):
            raise DropItem("Missing name field")
        return item

Error Handling

Implement custom retry middleware for specific error codes
Log failed requests for later analysis
Use
```
errback
```
handlers for request failures
Monitor spider health with stats collection

Performance Optimization

Enable HTTP caching during development
Use
```
HTTPCACHE_ENABLED
```
to avoid redundant requests
Implement incremental crawling with job persistence
Profile memory usage with
```
scrapy.extensions.memusage
```
Use asynchronous pipelines for I/O operations

Settings Configuration

python

# Recommended production settings
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 1
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
ROBOTSTXT_OBEY = True
HTTPCACHE_ENABLED = True
LOG_LEVEL = 'INFO'

Testing

Write unit tests for parsing logic
Use
```
scrapy.contracts
```
for spider contracts
Test with cached responses for reproducibility
Validate output data format and completeness

Key Dependencies

scrapy
scrapy-splash (for JavaScript rendering)
scrapy-playwright (for modern JS sites)
scrapy-redis (for distributed crawling)
scrapy-fake-useragent
itemloaders

Ethical Considerations

Always respect robots.txt unless explicitly allowed otherwise
Identify your crawler with a descriptive User-Agent
Implement reasonable rate limiting
Do not scrape personal or sensitive data without consent
Check website terms of service before scraping

scrapy-web-scraping

NPX Install

Tags

SKILL.md Content

Scrapy Web Scraping

Core Expertise

Key Principles

Spider Development

Project Structure

Spider Best Practices

Data Extraction

Request Handling

Rate Limiting

Headers and User Agents

Proxies

Item Pipelines

Error Handling

Performance Optimization

Settings Configuration

Testing

Key Dependencies

Ethical Considerations