Skip to content

noobAIcoder/web-scraper

Repository files navigation

web-scraper

Project Summary: Web Scraping Tool with GUI

Objective: The objective of this project is to develop a web scraping tool with a graphical user interface (GUI) that allows users to extract data from websites that require login and have a search functionality. The tool will automate the process of logging in, performing a search, and scraping desired data from the website. The scraped data will be stored in a SQLite database for further analysis and processing.

Key Features:

  1. GUI Interface: The tool will provide a GUI built with PyQt5, allowing users to input website URL, login credentials, search query, and CSS selectors for data extraction and navigation.

  2. Login Automation: The tool will automate the login process by filling in the login form with user-provided credentials and checking for successful login based on the presence of a specific element on the page.

  3. Search Functionality: After successful login, the tool will allow users to enter a search query and perform a search on the website. The search results will be loaded and ready for data extraction.

  4. Data Extraction: Users will provide CSS selectors to identify the desired data elements on the page. The tool will extract the specified data from the search results and store it in a raw format in an SQLite database.

  5. Pagination Handling: If the search results are paginated, the tool will automatically navigate through the pages by clicking on the "next" button until all pages are processed.

  6. Progress Tracking: The GUI will display the progress of the scraping process, including the current page being scraped, the number of items extracted, and any errors encountered.

  7. Database Storage: The scraped data will be stored in a SQLite database for persistent storage and future analysis. The database will be created and managed by the tool. Columns should be following: URL, date, div contents

  8. Error Handling: The tool will incorporate robust error handling mechanisms to gracefully handle and report errors related to login, search, data extraction, and database operations.

  9. Logging: Detailed logging will be implemented to capture important events, errors, and debug information during the scraping process. Logs will be stored in a file for later review and troubleshooting.

  10. Configuration Management: The tool will store user-provided settings such as website URL, login credentials, and CSS selectors in a configuration file. These settings will be loaded automatically when the tool is launched, allowing users to quickly resume their scraping tasks.

  11. Multithreading: The scraping process will be executed in a separate thread to ensure a responsive GUI and prevent blocking of user interactions.

  12. Modularity and Reusability: The codebase will be organized into modular components, separating the GUI, scraping, database, and utility functions. This modular architecture will enhance code reusability and maintainability.

Technologies and Libraries:

  • Python: The primary programming language for the project.
  • PyQt5: The library used for building the GUI.
  • Selenium: A web automation library used for interacting with web pages, handling login, and performing search.
  • BeautifulSoup: A library for parsing HTML and extracting desired data from web pages.
  • SQLite: A lightweight and embedded database engine for storing scraped data.
  • Requests: A library for making HTTP requests to web pages.
  • ChromeDriver: A web driver used by Selenium to automate Google Chrome browser.

Project Structure: The project will be organized into the following files and directories:

  • main.py: The entry point of the application, responsible for creating the GUI and starting the application.
  • main_window.py: Defines the main window of the GUI and handles user interactions.
  • scraper.py: Defines the core scraping functionality, including login, search, and data extraction.
  • database.py: Defines the database operations for storing scraped data.
  • config.py: Handles configuration management for storing and loading user settings.
  • logger.py: Implements logging functionality for capturing important events and errors.
  • config.ini: A configuration file for storing user-provided settings.
  • requirements.txt: A file listing the project dependencies.
  • README.md: A readme file providing an overview of the project and installation instructions.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published