Intelligently scrapes elite posts while respecting Douban's access patterns and rate limits.
Downloads and organizes all images associated with each post, maintaining the original content integrity.
Converts posts into well-structured Markdown files, perfect for offline reading and archival.
Comprehensive error management for network issues, missing content, and file system operations.
Built-in delays and smart request handling to avoid overwhelming Douban's servers.
Retains important post information including author details and source URLs.
- Clone the repository:
git clone https://github.com/ChanMeng666/douban-elite-scraper.git
cd douban-elite-scraper
- Install required dependencies:
pip install -r requirements.txt
- Run the scraper:
python main.py
- Configure target groups by editing
main.py
:
# Skip specific posts by title
skip_titles = ["够用就好2"]
# Target group URL
base_url = "https://www.douban.com/group/662976/?type=elite#topics"
douban-elite-scraper/
├── main.py # Main script and entry point
├── scraper.py # Core scraping functionality
└── requirements.txt # Project dependencies
Each scraped post creates:
Post_Title_123abc/
├── post.md
├── image_1.jpg
├── image_2.jpg
└── image_3.jpg
The post.md
file contains:
- Post title
- Author information
- Original URL
- Post content
- Image references
The scraper includes several configurable options in the DoubanScraper
class:
- User-Agent headers
- File naming patterns
- Rate limiting delays
- Output formatting
The scraper implements a 2-second delay between requests by default. Adjust in main.py
:
time.sleep(2) # Adjust delay as needed
Contributions are welcome! Please feel free to submit a Pull Request.
This tool is for educational purposes only. Please ensure compliance with Douban's terms of service and implement appropriate rate limiting. The user is responsible for how they use this tool.
This project is licensed under the MIT License - see the LICENSE file for details.
🔧 Advanced Configuration
The DoubanScraper
class provides additional configuration options:
scraper = DoubanScraper(
headers={'User-Agent': 'your-custom-user-agent'},
delay=3, # Custom delay between requests
output_format='markdown' # Output format
)
See scraper.py
for more configuration options.
Created and maintained by Chan Meng.