|
# devRant Exhaustive Crawler
|
|
|
|
Author: retoor <retoor@molodetz.nl>
|
|
|
|
An asynchronous crawler for comprehensive data collection from the devRant platform. Implements a producer-consumer architecture with multiple discovery strategies to maximize content coverage.
|
|
|
|
## SSL Note
|
|
|
|
The devRant API SSL certificate is expired. This crawler disables SSL verification to maintain connectivity. This is handled automatically by the API client.
|
|
|
|
## Architecture
|
|
|
|
The crawler employs four concurrent producers feeding into worker pools:
|
|
|
|
| Producer | Strategy | Interval |
|
|
|----------|----------|----------|
|
|
| Recent | Paginate through recent rants | 2s |
|
|
| Top | Paginate through top-rated rants | 5s |
|
|
| Algo | Paginate through algorithm-sorted rants | 5s |
|
|
| Search | Cycle through 48 programming-related search terms | 30s |
|
|
|
|
Worker pools process discovered content:
|
|
- 10 rant consumers fetch rant details and extract comments
|
|
- 5 user consumers fetch profiles and discover associated rants
|
|
|
|
Discovery graph: rants reveal users, users reveal more rants (from their profile, upvoted, favorites).
|
|
|
|
## Data Storage
|
|
|
|
Uses SQLite via the dataset library with:
|
|
- Batched writes (100 items or 5s interval)
|
|
- Automatic upsert for deduplication
|
|
- Indexes on user_id, created_time, rant_id
|
|
- State persistence for resume capability
|
|
|
|
### Schema
|
|
|
|
**rants**: id, user_id, text, score, created_time, num_comments, attached_image_url, tags, link, vote_state, user_username, user_score
|
|
|
|
**comments**: id, rant_id, user_id, body, score, created_time, vote_state, user_username, user_score
|
|
|
|
**users**: id, username, score, about, location, created_time, skills, github, website
|
|
|
|
**crawler_state**: Persists producer positions (skip values, search term index)
|
|
|
|
## Usage
|
|
|
|
### Quick Start
|
|
|
|
```bash
|
|
make
|
|
```
|
|
|
|
This creates a virtual environment, installs dependencies, and starts the crawler.
|
|
|
|
### Manual Setup
|
|
|
|
```bash
|
|
python3 -m venv .venv
|
|
source .venv/bin/activate
|
|
pip install -e ../../.
|
|
pip install -r requirements.txt
|
|
python main.py
|
|
```
|
|
|
|
### Stopping
|
|
|
|
Press `Ctrl+C` for graceful shutdown. The crawler will:
|
|
1. Save current state to database
|
|
2. Wait up to 30 seconds for queues to drain
|
|
3. Flush remaining batched writes
|
|
|
|
### Resuming
|
|
|
|
Simply run again. The crawler loads saved state and continues from where it stopped.
|
|
|
|
## Configuration
|
|
|
|
Edit `main.py` to adjust:
|
|
|
|
```python
|
|
DB_FILE = "devrant.sqlite"
|
|
CONCURRENT_RANT_CONSUMERS = 10
|
|
CONCURRENT_USER_CONSUMERS = 5
|
|
BATCH_SIZE = 100
|
|
FLUSH_INTERVAL = 5.0
|
|
```
|
|
|
|
## Output
|
|
|
|
The crawler logs statistics every 15 seconds:
|
|
|
|
```
|
|
[STATS] Rants Q'd/Proc: 1250/1200 | Users Q'd/Proc: 450/400 | Comments DB: 5600 | Queues (R/U): 50/50 | API Errors: 0
|
|
```
|
|
|
|
## Cleanup
|
|
|
|
```bash
|
|
make clean
|
|
```
|
|
|
|
Removes the virtual environment. Database file (`devrant.sqlite`) is preserved.
|
|
|
|
## Requirements
|
|
|
|
- Python 3.10+
|
|
- dataset
|
|
- aiohttp (via parent devranta package)
|
|
|
|
## File Structure
|
|
|
|
```
|
|
crawler/
|
|
├── main.py # Entry point, configuration
|
|
├── crawler.py # Producer-consumer implementation
|
|
├── database.py # Dataset wrapper with batch queue
|
|
├── requirements.txt # Dependencies
|
|
├── Makefile # Build automation
|
|
├── .venv/ # Virtual environment (created on first run)
|
|
└── devrant.sqlite # SQLite database (created on first run)
|
|
```
|