# devRant Exhaustive Crawler
Author: retoor <retoor@molodetz.nl>
An asynchronous crawler for comprehensive data collection from the devRant platform. Implements a producer-consumer architecture with multiple discovery strategies to maximize content coverage.
## SSL Note
The devRant API SSL certificate is expired. This crawler disables SSL verification to maintain connectivity. This is handled automatically by the API client.
## Architecture
The crawler employs four concurrent producers feeding into worker pools:
| Producer | Strategy | Interval |
|----------|----------|----------|
| Recent | Paginate through recent rants | 2s |
| Top | Paginate through top-rated rants | 5s |
| Algo | Paginate through algorithm-sorted rants | 5s |
| Search | Cycle through 48 programming-related search terms | 30s |
Worker pools process discovered content:
- 10 rant consumers fetch rant details and extract comments
- 5 user consumers fetch profiles and discover associated rants
Discovery graph: rants reveal users, users reveal more rants (from their profile, upvoted, favorites).
## Data Storage
Uses SQLite via the dataset library with:
- Batched writes (100 items or 5s interval)
- Automatic upsert for deduplication
- Indexes on user_id, created_time, rant_id
- State persistence for resume capability
### Schema
**rants**: id, user_id, text, score, created_time, num_comments, attached_image_url, tags, link, vote_state, user_username, user_score
**comments**: id, rant_id, user_id, body, score, created_time, vote_state, user_username, user_score
**users**: id, username, score, about, location, created_time, skills, github, website
**crawler_state**: Persists producer positions (skip values, search term index)
## Usage
### Quick Start
```bash
make
```
This creates a virtual environment, installs dependencies, and starts the crawler.
### Manual Setup
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -e ../../.
pip install -r requirements.txt
python main.py
```
### Stopping
Press `Ctrl+C` for graceful shutdown. The crawler will:
1. Save current state to database
2. Wait up to 30 seconds for queues to drain
3. Flush remaining batched writes
### Resuming
Simply run again. The crawler loads saved state and continues from where it stopped.
## Configuration
Edit `main.py` to adjust:
```python
DB_FILE = "devrant.sqlite"
CONCURRENT_RANT_CONSUMERS = 10
CONCURRENT_USER_CONSUMERS = 5
BATCH_SIZE = 100
FLUSH_INTERVAL = 5.0
```
## Output
The crawler logs statistics every 15 seconds:
```
[STATS] Rants Q'd/Proc: 1250/1200 | Users Q'd/Proc: 450/400 | Comments DB: 5600 | Queues (R/U): 50/50 | API Errors: 0
```
## Cleanup
```bash
make clean
```
Removes the virtual environment. Database file (`devrant.sqlite`) is preserved.
## Requirements
- Python 3.10+
- dataset
- aiohttp (via parent devranta package)
## File Structure
```
crawler/
├── main.py # Entry point, configuration
├── crawler.py # Producer-consumer implementation
├── database.py # Dataset wrapper with batch queue
├── requirements.txt # Dependencies
├── Makefile # Build automation
├── .venv/ # Virtual environment (created on first run)
└── devrant.sqlite # SQLite database (created on first run)
```