Skip to content

【Stars are like virtual high-fives - come on, don't leave us hanging!⭐️】A streamlined Python scraper for archiving elite posts from Douban groups into well-structured Markdown files with images, designed for efficient content preservation and offline reading.

License

Notifications You must be signed in to change notification settings

ChanMeng666/douban-elite-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔍 Douban Elite Scraper

Archive elite posts from Douban groups with style

✨ Features

🎯 Smart Content Extraction

Intelligently scrapes elite posts while respecting Douban's access patterns and rate limits.

📸 Complete Media Preservation

Downloads and organizes all images associated with each post, maintaining the original content integrity.

📝 Clean Markdown Generation

Converts posts into well-structured Markdown files, perfect for offline reading and archival.

🔒 Robust Error Handling

Comprehensive error management for network issues, missing content, and file system operations.

🚦 Rate Limiting Protection

Built-in delays and smart request handling to avoid overwhelming Douban's servers.

📊 Metadata Preservation

Retains important post information including author details and source URLs.

🚀 Installation

  1. Clone the repository:
git clone https://github.com/ChanMeng666/douban-elite-scraper.git
cd douban-elite-scraper
  1. Install required dependencies:
pip install -r requirements.txt

💻 Usage

  1. Run the scraper:
python main.py
  1. Configure target groups by editing main.py:
# Skip specific posts by title
skip_titles = ["够用就好2"]

# Target group URL
base_url = "https://www.douban.com/group/662976/?type=elite#topics"

📁 Project Structure

douban-elite-scraper/
├── main.py          # Main script and entry point
├── scraper.py       # Core scraping functionality
└── requirements.txt # Project dependencies

📦 Output Format

Each scraped post creates:

Post_Title_123abc/
├── post.md
├── image_1.jpg
├── image_2.jpg
└── image_3.jpg

The post.md file contains:

  • Post title
  • Author information
  • Original URL
  • Post content
  • Image references

⚙️ Configuration

The scraper includes several configurable options in the DoubanScraper class:

  • User-Agent headers
  • File naming patterns
  • Rate limiting delays
  • Output formatting

🛡️ Rate Limiting

The scraper implements a 2-second delay between requests by default. Adjust in main.py:

time.sleep(2)  # Adjust delay as needed

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

⚠️ Legal Notice

This tool is for educational purposes only. Please ensure compliance with Douban's terms of service and implement appropriate rate limiting. The user is responsible for how they use this tool.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔧 Advanced Configuration

The DoubanScraper class provides additional configuration options:

scraper = DoubanScraper(
    headers={'User-Agent': 'your-custom-user-agent'},
    delay=3,  # Custom delay between requests
    output_format='markdown'  # Output format
)

See scraper.py for more configuration options.

🙋‍♀ Author

Created and maintained by Chan Meng.

About

【Stars are like virtual high-fives - come on, don't leave us hanging!⭐️】A streamlined Python scraper for archiving elite posts from Douban groups into well-structured Markdown files with images, designed for efficient content preservation and offline reading.

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

No packages published

Languages