devranta/README.md at 3817151750356006bc6746c26d3dbee29a6b5ceb

 # devRant Exhaustive Crawler
 Author: retoor <retoor@molodetz.nl>
 An asynchronous crawler for comprehensive data collection from the devRant platform. Implements a producer-consumer architecture with multiple discovery strategies to maximize content coverage.
 ## SSL Note
 The devRant API SSL certificate is expired. This crawler disables SSL verification to maintain connectivity. This is handled automatically by the API client.
 ## Architecture
 The crawler employs four concurrent producers feeding into worker pools:
 | Producer | Strategy | Interval |
 |----------|----------|----------|
 | Recent | Paginate through recent rants | 2s |
 | Top | Paginate through top-rated rants | 5s |
 | Algo | Paginate through algorithm-sorted rants | 5s |
 | Search | Cycle through 48 programming-related search terms | 30s |
 Worker pools process discovered content:
 - 10 rant consumers fetch rant details and extract comments
 - 5 user consumers fetch profiles and discover associated rants
 Discovery graph: rants reveal users, users reveal more rants (from their profile, upvoted, favorites).
 ## Data Storage
 Uses SQLite via the dataset library with:
 - Batched writes (100 items or 5s interval)
 - Automatic upsert for deduplication
 - Indexes on user_id, created_time, rant_id
 - State persistence for resume capability
 ### Schema
 **rants**: id, user_id, text, score, created_time, num_comments, attached_image_url, tags, link, vote_state, user_username, user_score
 **comments**: id, rant_id, user_id, body, score, created_time, vote_state, user_username, user_score
 **users**: id, username, score, about, location, created_time, skills, github, website
 **crawler_state**: Persists producer positions (skip values, search term index)
 ## Usage
 ### Quick Start
 ```bash
 make
 ```
 This creates a virtual environment, installs dependencies, and starts the crawler.
 ### Manual Setup
 ```bash
 python3 -m venv .venv
 source .venv/bin/activate
 pip install -e ../../.
 pip install -r requirements.txt
 python main.py
 ```
 ### Stopping
 Press `Ctrl+C` for graceful shutdown. The crawler will:
 . Save current state to database
 . Wait up to 30 seconds for queues to drain
 . Flush remaining batched writes
 ### Resuming
 Simply run again. The crawler loads saved state and continues from where it stopped.
 ## Configuration
 Edit `main.py` to adjust:
 ```python
 DB_FILE = "devrant.sqlite"
 CONCURRENT_RANT_CONSUMERS = 10
 CONCURRENT_USER_CONSUMERS = 5
 BATCH_SIZE = 100
 FLUSH_INTERVAL = 5.0
 ```
 ## Output
 The crawler logs statistics every 15 seconds:
 ```
 [STATS] Rants Q'd/Proc: 1250/1200 | Users Q'd/Proc: 450/400 | Comments DB: 5600 | Queues (R/U): 50/50 | API Errors: 0
 ```
 ## Cleanup
 ```bash
 make clean
 ```
 Removes the virtual environment. Database file (`devrant.sqlite`) is preserved.
 ## Requirements
 - Python 3.10+
 - dataset
 - aiohttp (via parent devranta package)
 ## File Structure
 ```
 crawler/
 ├── main.py           # Entry point, configuration
 ├── crawler.py        # Producer-consumer implementation
 ├── database.py       # Dataset wrapper with batch queue
 ├── requirements.txt  # Dependencies
 ├── Makefile          # Build automation
 ├── .venv/            # Virtual environment (created on first run)
 └── devrant.sqlite    # SQLite database (created on first run)
 ```