devRant Exhaustive Crawler

Author: retoor retoor@molodetz.nl

An asynchronous crawler for comprehensive data collection from the devRant platform. Implements a producer-consumer architecture with multiple discovery strategies to maximize content coverage.

SSL Note

The devRant API SSL certificate is expired. This crawler disables SSL verification to maintain connectivity. This is handled automatically by the API client.

Architecture

The crawler employs four concurrent producers feeding into worker pools:

Producer Strategy Interval
Recent Paginate through recent rants 2s
Top Paginate through top-rated rants 5s
Algo Paginate through algorithm-sorted rants 5s
Search Cycle through 48 programming-related search terms 30s

Worker pools process discovered content:

  • 10 rant consumers fetch rant details and extract comments
  • 5 user consumers fetch profiles and discover associated rants

Discovery graph: rants reveal users, users reveal more rants (from their profile, upvoted, favorites).

Data Storage

Uses SQLite via the dataset library with:

  • Batched writes (100 items or 5s interval)
  • Automatic upsert for deduplication
  • Indexes on user_id, created_time, rant_id
  • State persistence for resume capability

Schema

rants: id, user_id, text, score, created_time, num_comments, attached_image_url, tags, link, vote_state, user_username, user_score

comments: id, rant_id, user_id, body, score, created_time, vote_state, user_username, user_score

users: id, username, score, about, location, created_time, skills, github, website

crawler_state: Persists producer positions (skip values, search term index)

Usage

Quick Start

make

This creates a virtual environment, installs dependencies, and starts the crawler.

Manual Setup

python3 -m venv .venv
source .venv/bin/activate
pip install -e ../../.
pip install -r requirements.txt
python main.py

Stopping

Press Ctrl+C for graceful shutdown. The crawler will:

  1. Save current state to database
  2. Wait up to 30 seconds for queues to drain
  3. Flush remaining batched writes

Resuming

Simply run again. The crawler loads saved state and continues from where it stopped.

Configuration

Edit main.py to adjust:

DB_FILE = "devrant.sqlite"
CONCURRENT_RANT_CONSUMERS = 10
CONCURRENT_USER_CONSUMERS = 5
BATCH_SIZE = 100
FLUSH_INTERVAL = 5.0

Output

The crawler logs statistics every 15 seconds:

[STATS] Rants Q'd/Proc: 1250/1200 | Users Q'd/Proc: 450/400 | Comments DB: 5600 | Queues (R/U): 50/50 | API Errors: 0

Cleanup

make clean

Removes the virtual environment. Database file (devrant.sqlite) is preserved.

Requirements

  • Python 3.10+
  • dataset
  • aiohttp (via parent devranta package)

File Structure

crawler/
├── main.py           # Entry point, configuration
├── crawler.py        # Producer-consumer implementation
├── database.py       # Dataset wrapper with batch queue
├── requirements.txt  # Dependencies
├── Makefile          # Build automation
├── .venv/            # Virtual environment (created on first run)
└── devrant.sqlite    # SQLite database (created on first run)
..
crawler.py
database.py
main.py
Makefile
README.md
requirements.txt