🔍 Douban Elite Scraper

Archive elite posts from Douban groups with style

✨ Features

🎯 Smart Content Extraction

Intelligently scrapes elite posts while respecting Douban's access patterns and rate limits.

📸 Complete Media Preservation

Downloads and organizes all images associated with each post, maintaining the original content integrity.

📝 Clean Markdown Generation

Converts posts into well-structured Markdown files, perfect for offline reading and archival.

🔒 Robust Error Handling

Comprehensive error management for network issues, missing content, and file system operations.

🚦 Rate Limiting Protection

Built-in delays and smart request handling to avoid overwhelming Douban's servers.

📊 Metadata Preservation

Retains important post information including author details and source URLs.

🚀 Installation

Clone the repository:

git clone https://github.com/ChanMeng666/douban-elite-scraper.git
cd douban-elite-scraper

Install required dependencies:

pip install -r requirements.txt

💻 Usage

Run the scraper:

python main.py

Configure target groups by editing main.py:

# Skip specific posts by title
skip_titles = ["够用就好2"]

# Target group URL
base_url = "https://www.douban.com/group/662976/?type=elite#topics"

📁 Project Structure

douban-elite-scraper/
├── main.py          # Main script and entry point
├── scraper.py       # Core scraping functionality
└── requirements.txt # Project dependencies

📦 Output Format

Each scraped post creates:

Post_Title_123abc/
├── post.md
├── image_1.jpg
├── image_2.jpg
└── image_3.jpg

The post.md file contains:

Post title
Author information
Original URL
Post content
Image references

⚙️ Configuration

The scraper includes several configurable options in the DoubanScraper class:

User-Agent headers
File naming patterns
Rate limiting delays
Output formatting

🛡️ Rate Limiting

The scraper implements a 2-second delay between requests by default. Adjust in main.py:

time.sleep(2)  # Adjust delay as needed

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

⚠️ Legal Notice

This tool is for educational purposes only. Please ensure compliance with Douban's terms of service and implement appropriate rate limiting. The user is responsible for how they use this tool.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔧 Advanced Configuration

The DoubanScraper class provides additional configuration options:

scraper = DoubanScraper(
    headers={'User-Agent': 'your-custom-user-agent'},
    delay=3,  # Custom delay between requests
    output_format='markdown'  # Output format
)

See scraper.py for more configuration options.

🙋‍♀ Author

Created and maintained by Chan Meng.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
.idea		.idea
__pycache__		__pycache__
.gitattributes		.gitattributes
CODE_OF_CONDUCT		CODE_OF_CONDUCT
LICENSE		LICENSE
README.md		README.md
main.py		main.py
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 Douban Elite Scraper

✨ Features

🎯 Smart Content Extraction

📸 Complete Media Preservation

📝 Clean Markdown Generation

🔒 Robust Error Handling

🚦 Rate Limiting Protection

📊 Metadata Preservation

🚀 Installation

💻 Usage

📁 Project Structure

📦 Output Format

⚙️ Configuration

🛡️ Rate Limiting

🤝 Contributing

⚠️ Legal Notice

📜 License

🙋‍♀ Author

About

Releases

Sponsor this project

Packages

Languages

License

ChanMeng666/douban-elite-scraper

Folders and files

Latest commit

History

Repository files navigation

🔍 Douban Elite Scraper

✨ Features

🎯 Smart Content Extraction

📸 Complete Media Preservation

📝 Clean Markdown Generation

🔒 Robust Error Handling

🚦 Rate Limiting Protection

📊 Metadata Preservation

🚀 Installation

💻 Usage

📁 Project Structure

📦 Output Format

⚙️ Configuration

🛡️ Rate Limiting

🤝 Contributing

⚠️ Legal Notice

📜 License

🙋‍♀ Author

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Languages

Packages