AutoScraper: AI Multi-Agent System for Scrapy

An intelligent multi-agent system that generates Python Scrapy scrapers based on user inputs. The system uses AI agents to analyze websites, generate navigation code, and create robust scrapers.

Technical Overview

The system is built around three core components:

Navigator Agent: Analyzes website structure using Selenium and ChromeDriver
Generator Agent: Creates Scrapy spiders based on analysis
Debugger Agent: Tests and improves scrapers through feedback loops

Key Design Choices

Architecture

The system follows a modular architecture with clear separation of concerns:

Agents: Specialized components for specific tasks
Utils: Reusable utility functions
Models: Data structures for system communication
Prompts: Templates for AI interactions

Technology Stack

Scrapy Zyte API: For reliable web scraping
Selenium/ChromeDriver: For website navigation and analysis
OpenRouter API: For AI-powered code generation
Pydantic: For structured data validation
Loguru: For comprehensive logging

Project Structure

autoscraper/
├── agents/             # AI agents for different tasks
│   ├── navigator.py    # Website navigation agent
│   ├── generator.py    # Scraper code generation agent
│   └── debugger.py     # Testing and debugging agent
├── base_spider/        # Base Scrapy project template
│   ├── items.py
│   ├── pipelines.py
│   └── settings.py
├── config.py          # Configuration settings
├── models.py          # Data models
├── utils/             # Utility functions
│   ├── chrome_driver.py # Browser automation
│   ├── html_parser.py  # HTML parsing
│   ├── openrouter.py   # OpenRouter API integration
│   └── file_manager.py # File operations
│   └── spider_runner.py # Scrapy spider runner
├── prompts/           # Prompt templates for AI agents
│   └── templates/
│       ├── page_analysis.jinja2
│       └── website_analysis.jinja2
|       └── ...
├── requirements.txt   # Dependencies
├── autoscraper.py     # Main AutoScraper class
└── example.py         # Example usage

Core Workflow

The system follows a structured workflow to ensure quality and iterative refinement:

Website Analysis
- Navigates website using Selenium
- Identifies data sources and extraction methods
- Generates detailed analysis report
Scraper Generation
- Creates Scrapy spider based on analysis
- Implements navigation and extraction logic
- Sets up pipelines for data processing
Testing and Debugging
- Executes spider with timeout for testing
- Analyzes output for errors and missing data
- Iteratively improves spider based on feedback
Spider Execution
- Runs generated spider without timeout
- Handles logging and error reporting
- Saves output to designated location

Model Choices

The system uses different AI models for specific tasks:

Agent	Primary Model	Fallback Models
Navigator	gemini-2-flash	gpt-4o-mini, gemini-flash-1-5
Generator	gpt-4o	gemini-exp, claude-3.5-sonnet
Debugger	gemini-2-flash-thinking	o1-mini, qwq

Development Guidelines

Code Structure
- Keep files under 300 lines
- Maintain clear separation of concerns
- Use type hints for all function definitions
Environment Management
- Use Poetry for dependency management
- Configure environment variables in .env
- Use Docker for containerization
Code Quality
- Use Ruff for linting and formatting
- Follow PEP 8 style guide
- Maintain comprehensive logging

Example Usage

from autoscraper import AutoScraper

scraper = AutoScraper()
scraper.analyze_website("https://example.com")
scraper.set_target_fields({
    "product_name": "str",
    "price": "number",
    "availability": "boolean"
})
scraper.generate()
scraper.run()

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
autoscraper		autoscraper
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoScraper: AI Multi-Agent System for Scrapy

Technical Overview

Key Design Choices

Architecture

Technology Stack

Project Structure

Core Workflow

Model Choices

Development Guidelines

Example Usage

License

About

Releases

Packages

Languages

Anko59/AutoHubble

Folders and files

Latest commit

History

Repository files navigation

AutoScraper: AI Multi-Agent System for Scrapy

Technical Overview

Key Design Choices

Architecture

Technology Stack

Project Structure

Core Workflow

Model Choices

Development Guidelines

Example Usage

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages