devRant Exhaustive Crawler

Author: retoor retoor@molodetz.nl

An asynchronous crawler for comprehensive data collection from the devRant platform. Implements a producer-consumer architecture with multiple discovery strategies to maximize content coverage.

SSL Note

The devRant API SSL certificate is expired. This crawler disables SSL verification to maintain connectivity. This is handled automatically by the API client.

Architecture

The crawler employs four concurrent producers feeding into worker pools:

Producer	Strategy	Interval
Recent	Paginate through recent rants	2s
Top	Paginate through top-rated rants	5s
Algo	Paginate through algorithm-sorted rants	5s
Search	Cycle through 48 programming-related search terms	30s

Worker pools process discovered content:

10 rant consumers fetch rant details and extract comments
5 user consumers fetch profiles and discover associated rants

Discovery graph: rants reveal users, users reveal more rants (from their profile, upvoted, favorites).

Data Storage

Uses SQLite via the dataset library with:

Batched writes (100 items or 5s interval)
Automatic upsert for deduplication
Indexes on user_id, created_time, rant_id
State persistence for resume capability

Schema

rants: id, user_id, text, score, created_time, num_comments, attached_image_url, tags, link, vote_state, user_username, user_score

comments: id, rant_id, user_id, body, score, created_time, vote_state, user_username, user_score

users: id, username, score, about, location, created_time, skills, github, website

crawler_state: Persists producer positions (skip values, search term index)

Usage

Quick Start

make

This creates a virtual environment, installs dependencies, and starts the crawler.

Manual Setup

python3 -m venv .venv
source .venv/bin/activate
pip install -e ../../.
pip install -r requirements.txt
python main.py

Stopping

Press Ctrl+C for graceful shutdown. The crawler will:

Save current state to database
Wait up to 30 seconds for queues to drain
Flush remaining batched writes

Resuming

Simply run again. The crawler loads saved state and continues from where it stopped.

Configuration

Edit main.py to adjust:

DB_FILE = "devrant.sqlite"
CONCURRENT_RANT_CONSUMERS = 10
CONCURRENT_USER_CONSUMERS = 5
BATCH_SIZE = 100
FLUSH_INTERVAL = 5.0

Output

The crawler logs statistics every 15 seconds:

[STATS] Rants Q'd/Proc: 1250/1200 | Users Q'd/Proc: 450/400 | Comments DB: 5600 | Queues (R/U): 50/50 | API Errors: 0

Cleanup

make clean

Removes the virtual environment. Database file (devrant.sqlite) is preserved.

Requirements

Python 3.10+
dataset
aiohttp (via parent devranta package)

File Structure

crawler/
├── main.py           # Entry point, configuration
├── crawler.py        # Producer-consumer implementation
├── database.py       # Dataset wrapper with batch queue
├── requirements.txt  # Dependencies
├── Makefile          # Build automation
├── .venv/            # Virtual environment (created on first run)
└── devrant.sqlite    # SQLite database (created on first run)

..
crawler.py
database.py
main.py
Makefile
README.md
requirements.txt