# devRant Exhaustive Crawler

Author: retoor <retoor@molodetz.nl>

An asynchronous crawler for comprehensive data collection from the devRant platform. Implements a producer-consumer architecture with multiple discovery strategies to maximize content coverage.

## SSL Note

The devRant API SSL certificate is expired. This crawler disables SSL verification to maintain connectivity. This is handled automatically by the API client.

## Architecture

The crawler employs four concurrent producers feeding into worker pools:

| Producer | Strategy | Interval |
|----------|----------|----------|
| Recent | Paginate through recent rants | 2s |
| Top | Paginate through top-rated rants | 5s |
| Algo | Paginate through algorithm-sorted rants | 5s |
| Search | Cycle through 48 programming-related search terms | 30s |

Worker pools process discovered content:
- 10 rant consumers fetch rant details and extract comments
- 5 user consumers fetch profiles and discover associated rants

Discovery graph: rants reveal users, users reveal more rants (from their profile, upvoted, favorites).

## Data Storage

Uses SQLite via the dataset library with:
- Batched writes (100 items or 5s interval)
- Automatic upsert for deduplication
- Indexes on user_id, created_time, rant_id
- State persistence for resume capability

### Schema

**rants**: id, user_id, text, score, created_time, num_comments, attached_image_url, tags, link, vote_state, user_username, user_score

**comments**: id, rant_id, user_id, body, score, created_time, vote_state, user_username, user_score

**users**: id, username, score, about, location, created_time, skills, github, website

**crawler_state**: Persists producer positions (skip values, search term index)

## Usage

### Quick Start

```bash
make
```

This creates a virtual environment, installs dependencies, and starts the crawler.

### Manual Setup

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -e ../../.
pip install -r requirements.txt
python main.py
```

### Stopping

Press `Ctrl+C` for graceful shutdown. The crawler will:
1. Save current state to database
2. Wait up to 30 seconds for queues to drain
3. Flush remaining batched writes

### Resuming

Simply run again. The crawler loads saved state and continues from where it stopped.

## Configuration

Edit `main.py` to adjust:

```python
DB_FILE = "devrant.sqlite"
CONCURRENT_RANT_CONSUMERS = 10
CONCURRENT_USER_CONSUMERS = 5
BATCH_SIZE = 100
FLUSH_INTERVAL = 5.0
```

## Output

The crawler logs statistics every 15 seconds:

```
[STATS] Rants Q'd/Proc: 1250/1200 | Users Q'd/Proc: 450/400 | Comments DB: 5600 | Queues (R/U): 50/50 | API Errors: 0
```

## Cleanup

```bash
make clean
```

Removes the virtual environment. Database file (`devrant.sqlite`) is preserved.

## Requirements

- Python 3.10+
- dataset
- aiohttp (via parent devranta package)

## File Structure

```
crawler/
├── main.py           # Entry point, configuration
├── crawler.py        # Producer-consumer implementation
├── database.py       # Dataset wrapper with batch queue
├── requirements.txt  # Dependencies
├── Makefile          # Build automation
├── .venv/            # Virtual environment (created on first run)
└── devrant.sqlite    # SQLite database (created on first run)
```