This application is vibe coded based on spam data. I've had a fight with a 'hacker' and since then it bombards my repository server with fake accounts. I exported this data to AI to recognize patterns and it found some. I translated that to code with safety margin. Then, I've added tests and calculated accuracy. This is good for now.
## Overview
This module provides a lightweight, heuristic-based function to calculate a spam risk score (0-100) for a username and email pair. The score helps identify potential spammer registrations during user signup or validation, prioritizing low false positives to ensure legitimate users experience minimal disruption. A higher score indicates greater likelihood of spam (e.g., threshold >50 for flagging, >70 for blocking).
## Key Features
- **Domain Analysis:** Detects temporary/disposable email domains (e.g., from common blocklists like 10minutemail.com, yopmail.com) with a +50 penalty. Deducts -30 for trusted providers (e.g., Gmail, ProtonMail) to protect real users.
- **Pattern Matching:** Scans for suspicious elements like excessive digits (+15), spam keywords (e.g., _temp, _fake; +25), short non-standard emails (+10), and username anomalies (e.g., numbers +10, keywords like _buy +20, extreme lengths +5, bot terms +15).
- **Bonus Checks:** Adds +10 for combined high-risk signals (e.g., temp domain with random patterns).
- **Efficiency:** Uses only standard Python (re module); no external dependencies. Fast for real-time use.
- **Leniency Design:** Weights favor caution—e.g., common username numbers (like john123) add only minor points, ensuring ~0% false positives on legit samples.
- Suspicious: "bot123" @ "tempr.email" → 80+ (flags as spam)
## Testing and Performance
The included `test_scam_email_score()` runs on 51 samples (45 presumed spam, 6 legit):
- 33 score ≥50 (flagged)
- 18 score <50(passed)
- False positives: 0 (no legit flagged)
- False negatives: 11 (missed spams, often due to unlisted domains)
- Accuracy: ~78% (correct classifications / total; tunable via thresholds)
Expand domain lists periodically for better coverage. For production, integrate with CAPTCHA for mid-scores (30-60) or log for review. This heuristic excels in low-overhead environments but pairs well with ML models for deeper analysis.