devRant Exhaustive Crawler
Author: retoor retoor@molodetz.nl
An asynchronous crawler for comprehensive data collection from the devRant platform. Implements a producer-consumer architecture with multiple discovery strategies to maximize content coverage.
SSL Note
The devRant API SSL certificate is expired. This crawler disables SSL verification to maintain connectivity. This is handled automatically by the API client.
Architecture
The crawler employs four concurrent producers feeding into worker pools:
| Producer | Strategy | Interval |
|---|---|---|
| Recent | Paginate through recent rants | 2s |
| Top | Paginate through top-rated rants | 5s |
| Algo | Paginate through algorithm-sorted rants | 5s |
| Search | Cycle through 48 programming-related search terms | 30s |
Worker pools process discovered content:
- 10 rant consumers fetch rant details and extract comments
- 5 user consumers fetch profiles and discover associated rants
Discovery graph: rants reveal users, users reveal more rants (from their profile, upvoted, favorites).
Data Storage
Uses SQLite via the dataset library with:
- Batched writes (100 items or 5s interval)
- Automatic upsert for deduplication
- Indexes on user_id, created_time, rant_id
- State persistence for resume capability
Schema
rants: id, user_id, text, score, created_time, num_comments, attached_image_url, tags, link, vote_state, user_username, user_score
comments: id, rant_id, user_id, body, score, created_time, vote_state, user_username, user_score
users: id, username, score, about, location, created_time, skills, github, website
crawler_state: Persists producer positions (skip values, search term index)
Usage
Quick Start
make
This creates a virtual environment, installs dependencies, and starts the crawler.
Manual Setup
python3 -m venv .venv
source .venv/bin/activate
pip install -e ../../.
pip install -r requirements.txt
python main.py
Stopping
Press Ctrl+C for graceful shutdown. The crawler will:
- Save current state to database
- Wait up to 30 seconds for queues to drain
- Flush remaining batched writes
Resuming
Simply run again. The crawler loads saved state and continues from where it stopped.
Configuration
Edit main.py to adjust:
DB_FILE = "devrant.sqlite"
CONCURRENT_RANT_CONSUMERS = 10
CONCURRENT_USER_CONSUMERS = 5
BATCH_SIZE = 100
FLUSH_INTERVAL = 5.0
Output
The crawler logs statistics every 15 seconds:
[STATS] Rants Q'd/Proc: 1250/1200 | Users Q'd/Proc: 450/400 | Comments DB: 5600 | Queues (R/U): 50/50 | API Errors: 0
Cleanup
make clean
Removes the virtual environment. Database file (devrant.sqlite) is preserved.
Requirements
- Python 3.10+
- dataset
- aiohttp (via parent devranta package)
File Structure
crawler/
├── main.py # Entry point, configuration
├── crawler.py # Producer-consumer implementation
├── database.py # Dataset wrapper with batch queue
├── requirements.txt # Dependencies
├── Makefile # Build automation
├── .venv/ # Virtual environment (created on first run)
└── devrant.sqlite # SQLite database (created on first run)
| .. | ||
| crawler.py | ||
| database.py | ||
| main.py | ||
| Makefile | ||
| README.md | ||
| requirements.txt | ||