This project is an AI-powered web scraper that uses OpenAI's GPT model to dynamically analyze and optimize CSS selectors for reliable web scraping.
- Dynamic CSS selector optimization using AI
- Visual feedback with highlighted elements in the browser
- Automatic screenshot capture for AI analysis
- Simplified DOM tree structure analysis
- Configurable scraping goals
- Node.js (v14 or later recommended)
- An OpenAI API key
-
Clone the repository:
git clone https://github.com/yourusername/ai-powered-web-scraper.git cd ai-powered-web-scraper
-
Install dependencies:
npm install
-
Create a
config.js
file in the root directory with your OpenAI API key:module.exports = { OPENAI_API_KEY: 'your-api-key-here', MODEL: 'gpt-4o-mini' }
To start the web scraper, run:
node crawler.js
You can modify the scrapingGoal
and target URL in the crawler.js
file to customize the scraping task.
- The scraper starts with an initial CSS selector and loads the target webpage.
- It captures a screenshot and analyzes the DOM structure.
- The AI model analyzes the current selector, screenshot, and DOM structure to suggest optimizations.
- The process repeats until the AI determines the selector is optimal or no further improvements can be made.
- Finally, the scraper extracts the desired information using the optimized selector.
crawler.js
: Main script that controls the web scraping process.openai.js
: Handles interactions with the OpenAI API for selector analysis.config.js
: Contains configuration settings (API key, model name).
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License.