Welcome to the AI Scraper project repository! This project uses Python, Selenium, BeautifulSoup, and the Ollama language model to scrape, parse, and extract information from web pages.
The AI Scraper is designed to handle complex web scraping tasks including captcha solving, HTML parsing, and structured data extraction using advanced AI techniques.
Before you begin, ensure you have the following installed:
- Python 3.8 or higher
- pip (Python package installer)
-
Clone the Repository:
git clone https://github.com/umutkayash/AI-Scraper.git cd AI-Scraper
-
Set Up a Virtual Environment (recommended):
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts�ctivate`
-
Install Dependencies:
pip install -r requirements.txt
- You will need to set environment variables for the Selenium WebDriver URL. You can do this by creating a
.env
file in the project root with the following content:WEBDRIVER_URL="your_webdriver_url_here"
To run the scraper:
-
Activate your virtual environment if not already activated:
source venv/bin/activate # On Windows use `venv\Scripts�ctivate`
-
Run the Scraper:
python main.py
Replace
main.py
with the script you wish to run.
The AI Scraper performs the following steps:
- Connects to a web page using Selenium.
- Handles any captchas using configured settings.
- Extracts HTML content and parses it using BeautifulSoup.
- Segments the HTML content if necessary.
- Uses the Ollama model to extract specific information based on user-defined criteria.
Contributions are welcome! Please fork the repository and create a pull request with your changes.
This project is licensed under the MIT License - see the LICENSE file for details.