# devRant Exhaustive Crawler Author: retoor An asynchronous crawler for comprehensive data collection from the devRant platform. Implements a producer-consumer architecture with multiple discovery strategies to maximize content coverage. ## SSL Note The devRant API SSL certificate is expired. This crawler disables SSL verification to maintain connectivity. This is handled automatically by the API client. ## Architecture The crawler employs four concurrent producers feeding into worker pools: | Producer | Strategy | Interval | |----------|----------|----------| | Recent | Paginate through recent rants | 2s | | Top | Paginate through top-rated rants | 5s | | Algo | Paginate through algorithm-sorted rants | 5s | | Search | Cycle through 48 programming-related search terms | 30s | Worker pools process discovered content: - 10 rant consumers fetch rant details and extract comments - 5 user consumers fetch profiles and discover associated rants Discovery graph: rants reveal users, users reveal more rants (from their profile, upvoted, favorites). ## Data Storage Uses SQLite via the dataset library with: - Batched writes (100 items or 5s interval) - Automatic upsert for deduplication - Indexes on user_id, created_time, rant_id - State persistence for resume capability ### Schema **rants**: id, user_id, text, score, created_time, num_comments, attached_image_url, tags, link, vote_state, user_username, user_score **comments**: id, rant_id, user_id, body, score, created_time, vote_state, user_username, user_score **users**: id, username, score, about, location, created_time, skills, github, website **crawler_state**: Persists producer positions (skip values, search term index) ## Usage ### Quick Start ```bash make ``` This creates a virtual environment, installs dependencies, and starts the crawler. ### Manual Setup ```bash python3 -m venv .venv source .venv/bin/activate pip install -e ../../. pip install -r requirements.txt python main.py ``` ### Stopping Press `Ctrl+C` for graceful shutdown. The crawler will: 1. Save current state to database 2. Wait up to 30 seconds for queues to drain 3. Flush remaining batched writes ### Resuming Simply run again. The crawler loads saved state and continues from where it stopped. ## Configuration Edit `main.py` to adjust: ```python DB_FILE = "devrant.sqlite" CONCURRENT_RANT_CONSUMERS = 10 CONCURRENT_USER_CONSUMERS = 5 BATCH_SIZE = 100 FLUSH_INTERVAL = 5.0 ``` ## Output The crawler logs statistics every 15 seconds: ``` [STATS] Rants Q'd/Proc: 1250/1200 | Users Q'd/Proc: 450/400 | Comments DB: 5600 | Queues (R/U): 50/50 | API Errors: 0 ``` ## Cleanup ```bash make clean ``` Removes the virtual environment. Database file (`devrant.sqlite`) is preserved. ## Requirements - Python 3.10+ - dataset - aiohttp (via parent devranta package) ## File Structure ``` crawler/ ├── main.py # Entry point, configuration ├── crawler.py # Producer-consumer implementation ├── database.py # Dataset wrapper with batch queue ├── requirements.txt # Dependencies ├── Makefile # Build automation ├── .venv/ # Virtual environment (created on first run) └── devrant.sqlite # SQLite database (created on first run) ```