Skip to content

ใ€One star = One happy developer doing a little dance ๐Ÿ’ƒโญ๏ธใ€‘A robust Python scraper for collecting and analyzing movie reviews from Douban.com, featuring comprehensive data processing and analysis capabilities.

License

Notifications You must be signed in to change notification settings

ChanMeng666/douban-review-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

12 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŽฌ Douban Movie Reviews Scraper

A powerful tool for collecting and analyzing Douban movie reviews


๐Ÿš€ Features

  • ๐Ÿ”„ Robust scraping with rate limiting and retry mechanisms
  • ๐Ÿงน Advanced data cleaning and normalization
  • ๐Ÿ“Š Sentiment analysis categorization
  • ๐Ÿ’พ Efficient CSV export functionality
  • ๐Ÿ” Comprehensive error handling and logging
  • ๐Ÿ›ก๏ธ Built-in protection against API rate limits
  • ๐Ÿ“ Detailed comment metadata extraction
  • ๐ŸŽฏ Configurable scraping parameters

๐Ÿ› ๏ธ Requirements

  • Python 3.8+
  • Required packages:
    beautifulsoup4==4.12.3
    numpy==2.1.3
    pandas==2.2.3
    python-dateutil==2.9.0.post0
    pytz==2024.2
    requests~=2.32.3
    six==1.16.0
    soupsieve==2.6
    tzdata==2024.2
    

๐Ÿ“ฆ Installation

  1. Clone the repository:
git clone https://github.com/ChanMeng666/douban-review-scraper.git
  1. Navigate to the project directory:
cd douban-review-scraper
  1. Install dependencies:
pip install -r requirements.txt

โš™๏ธ Configuration

Edit config.py to customize your scraping parameters:

MOVIE_ID = 'your_movie_id'  # Douban movie ID
MAX_PAGES = 50              # Maximum pages to scrape
REQUEST_TIMEOUT = 30        # Request timeout in seconds
RETRY_TIMES = 3            # Number of retry attempts

๐Ÿš€ Usage

  1. Configure your target movie ID in config.py
  2. Run the scraper:
python main.py

๐Ÿ“Š Output Format

The scraper generates CSV files containing:

  • timestamp: Comment timestamp
  • content: Cleaned comment text
  • rating: User rating (1-5)
  • user_id: Douban user ID
  • category: Comment category (positive/negative/neutral)

โš ๏ธ Important Notes

  • Respect Douban's robots.txt and API limitations
  • Update cookies periodically for reliable operation
  • Consider using proxies for large-scale scraping
  • Check Douban's terms of service before use

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request. Here's how you can contribute:

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ‘ฅ Author

Chan Meng

๐ŸŒŸ Show your support

Give a โญ๏ธ if this project helped you!


Made with โค๏ธ by Chan Meng

About

ใ€One star = One happy developer doing a little dance ๐Ÿ’ƒโญ๏ธใ€‘A robust Python scraper for collecting and analyzing movie reviews from Douban.com, featuring comprehensive data processing and analysis capabilities.

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

No packages published

Languages