- ๐ Robust scraping with rate limiting and retry mechanisms
- ๐งน Advanced data cleaning and normalization
- ๐ Sentiment analysis categorization
- ๐พ Efficient CSV export functionality
- ๐ Comprehensive error handling and logging
- ๐ก๏ธ Built-in protection against API rate limits
- ๐ Detailed comment metadata extraction
- ๐ฏ Configurable scraping parameters
- Python 3.8+
- Required packages:
beautifulsoup4==4.12.3 numpy==2.1.3 pandas==2.2.3 python-dateutil==2.9.0.post0 pytz==2024.2 requests~=2.32.3 six==1.16.0 soupsieve==2.6 tzdata==2024.2
- Clone the repository:
git clone https://github.com/ChanMeng666/douban-review-scraper.git
- Navigate to the project directory:
cd douban-review-scraper
- Install dependencies:
pip install -r requirements.txt
Edit config.py
to customize your scraping parameters:
MOVIE_ID = 'your_movie_id' # Douban movie ID
MAX_PAGES = 50 # Maximum pages to scrape
REQUEST_TIMEOUT = 30 # Request timeout in seconds
RETRY_TIMES = 3 # Number of retry attempts
- Configure your target movie ID in
config.py
- Run the scraper:
python main.py
The scraper generates CSV files containing:
timestamp
: Comment timestampcontent
: Cleaned comment textrating
: User rating (1-5)user_id
: Douban user IDcategory
: Comment category (positive/negative/neutral)
- Respect Douban's robots.txt and API limitations
- Update cookies periodically for reliable operation
- Consider using proxies for large-scale scraping
- Check Douban's terms of service before use
Contributions are welcome! Please feel free to submit a Pull Request. Here's how you can contribute:
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Chan Meng
- LinkedIn: chanmeng666
- GitHub: ChanMeng666
Give a โญ๏ธ if this project helped you!