New document.
This commit is contained in:
parent
683604e12b
commit
e92f7e0fd0
172
spam_detection_heuristic.md
Normal file
172
spam_detection_heuristic.md
Normal file
@ -0,0 +1,172 @@
|
||||
# Email Scam Detection Heuristic
|
||||
|
||||
This application is vibe coded based on spam data. I've had a fight with a 'hacker' and since then it bombards my repository server with fake accounts. I exported this data to AI to recognize patterns and it found some. I translated that to code with safety margin. Then, I've added tests and calculated accuracy. This is good for now.
|
||||
|
||||
## Overview
|
||||
|
||||
This module provides a lightweight, heuristic-based function to calculate a spam risk score (0-100) for a username and email pair. The score helps identify potential spammer registrations during user signup or validation, prioritizing low false positives to ensure legitimate users experience minimal disruption. A higher score indicates greater likelihood of spam (e.g., threshold >50 for flagging, >70 for blocking).
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Domain Analysis:** Detects temporary/disposable email domains (e.g., from common blocklists like 10minutemail.com, yopmail.com) with a +50 penalty. Deducts -30 for trusted providers (e.g., Gmail, ProtonMail) to protect real users.
|
||||
|
||||
- **Pattern Matching:** Scans for suspicious elements like excessive digits (+15), spam keywords (e.g., _temp, _fake; +25), short non-standard emails (+10), and username anomalies (e.g., numbers +10, keywords like _buy +20, extreme lengths +5, bot terms +15).
|
||||
|
||||
- **Bonus Checks:** Adds +10 for combined high-risk signals (e.g., temp domain with random patterns).
|
||||
|
||||
- **Efficiency:** Uses only standard Python (re module); no external dependencies. Fast for real-time use.
|
||||
|
||||
- **Leniency Design:** Weights favor caution—e.g., common username numbers (like john123) add only minor points, ensuring ~0% false positives on legit samples.
|
||||
|
||||
## Usage
|
||||
|
||||
Import and call the function:
|
||||
|
||||
```python
|
||||
from your_module import email_scam_score
|
||||
|
||||
score = email_scam_score("username", "user@example.com")
|
||||
# Returns int (0-100); e.g., 0 for clean legit, 80+ for obvious spam
|
||||
```
|
||||
|
||||
### Example Scores:
|
||||
|
||||
- Legit: "john_doe" @ "gmail.com" → ~0-10 (passes easily)
|
||||
|
||||
- Suspicious: "bot123" @ "tempr.email" → 80+ (flags as spam)
|
||||
|
||||
## Testing and Performance
|
||||
|
||||
The included `test_scam_email_score()` runs on 51 samples (45 presumed spam, 6 legit):
|
||||
|
||||
- 33 score ≥50 (flagged)
|
||||
|
||||
- 18 score <50 (passed)
|
||||
|
||||
- False positives: 0 (no legit flagged)
|
||||
|
||||
- False negatives: 11 (missed spams, often due to unlisted domains)
|
||||
|
||||
- Accuracy: ~78% (correct classifications / total; tunable via thresholds)
|
||||
|
||||
Expand domain lists periodically for better coverage. For production, integrate with CAPTCHA for mid-scores (30-60) or log for review. This heuristic excels in low-overhead environments but pairs well with ML models for deeper analysis.
|
||||
|
||||
## Source Code
|
||||
|
||||
```python
|
||||
import re
|
||||
|
||||
def email_scam_score(username, email):
|
||||
score = 0
|
||||
max_score = 100
|
||||
min_score = 0
|
||||
|
||||
domain_match = re.search(r'@([^@]+)$', email)
|
||||
if domain_match:
|
||||
domain = domain_match.group(1).lower()
|
||||
else:
|
||||
return max_score
|
||||
|
||||
temp_domains = [
|
||||
'emailus.click', 'wetransfer.click', 'shenika.top', 'johana.top', 'jospeh.top',
|
||||
'astroaxis.site', 'tempr.email', 'hidebox.org', 'mailmenot.io',
|
||||
'discard.email', '33mail.com', 'dropmail.me', 'mailinator.com', 'yopmail.com', 'guerrillamail.com',
|
||||
'mailnesia.com', 'temp-mail.org', 'maildrop.cc', 'getnada.com', '10minutemail.com', 'mailcatch.com',
|
||||
'throwawaymail.com', 'spamgourmet.com', 'dispostable.com', 'tempemail.net', 'mytemp.email', 'trashmail.com'
|
||||
]
|
||||
|
||||
legit_domains = [
|
||||
'gmail.com', 'yahoo.com', 'outlook.com', 'hotmail.com', 'proton.me', 'icloud.com', 'example.com',
|
||||
'zoho.com', 'aol.com', 'mail.com', 'tutanota.com', 'fastmail.com', 'mailfence.com', 'mailbox.org',
|
||||
'tuta.io', 'posteo.de', 'thexyz.com', 'privatemail.com', 'neo.com', 'atomicmail.io'
|
||||
]
|
||||
|
||||
if any(temp_domain in domain for temp_domain in temp_domains):
|
||||
score += 50
|
||||
|
||||
if any(domain.endswith(legit) for legit in legit_domains):
|
||||
score -= 30
|
||||
|
||||
if re.search(r'\d{3,}', email):
|
||||
score += 15
|
||||
if re.search(r'_(temp|fake|test|bot|spam|junk|disposable)', email, re.IGNORECASE):
|
||||
score += 25
|
||||
if len(email) < 20 and not re.search(r'\.(com|net|org)$', domain):
|
||||
score += 10
|
||||
|
||||
username_lower = username.lower()
|
||||
if re.search(r'\d{2,}', username):
|
||||
score += 10
|
||||
if re.search(r'[-_](best|buy|machine|windows|zakup)', username_lower):
|
||||
score += 20
|
||||
if len(username) < 5 or len(username) > 25:
|
||||
score += 5
|
||||
if re.search(r'\b(?:admin|root|test|fake)\b', username_lower):
|
||||
score += 15
|
||||
|
||||
if score > 20 and any(temp_domain in domain for temp_domain in temp_domains):
|
||||
score += 10
|
||||
|
||||
return max(min_score, min(max_score, score))
|
||||
|
||||
def test_scam_email_score():
|
||||
user_data = [
|
||||
("tilly218881528", "tilly.morton71@questions.emailus.click"),
|
||||
("tinashumaker89", "tina-shumaker5@wetransfer.click"),
|
||||
("toneypalladino", "toney-palladino98@emailus.click"),
|
||||
("tracyfagan0426", "daryn@b.echocosmos.online"),
|
||||
("tracyrayford20", "dustinhorseman@hidebox.org"),
|
||||
("tristanparedes", "tristan_paredes8@questions.emailus.click"),
|
||||
("tysonswafford0", "alidia@e.cosmicbridge.site"),
|
||||
("ukycarlota9372", "carlota.doran78@assist.wetransfer.click"),
|
||||
("ulrike00a3347", "wilhelminabloomfield2005@discard.email"),
|
||||
("uoy0166", "jkrowling321@gmail.com"),
|
||||
("uqtdong777253", "dong.delittle47@feedback.emailus.click"),
|
||||
("used-couches-for-sale2303", "jody-bolivar@8k0.jospeh.top"),
|
||||
("ushmoises27913", "moises.goldhar32@emailus.click"),
|
||||
("uugashleigh52", "ashleigh_oreilly22@feedback.emailus.click"),
|
||||
("uusladonna6970", "ladonna_rosario@care.trytrip.click"),
|
||||
("vallieullathor", "orelee@a.cosmiccluster.store"),
|
||||
("vanessahowitt", "vanessa.howitt@assist.wetransfer.click"),
|
||||
("vanv7048981847", "ameliawilson33688@a.seoautomationpro.com"),
|
||||
("vedaackerman1", "reaganhacker5767@spambog.ru"),
|
||||
("veratiemann936", "cybil@i.cosmicbridge.site"),
|
||||
("verncrumpton50", "isabel@astroaxis.site"),
|
||||
("veroniqueoquen", "veronique.oquendo96@questions.emailus.click"),
|
||||
("vinceh28371073", "vince-mayne@quickreply.trytrip.click"),
|
||||
("vsvbill0213043", "dorotheawalck5013@tempr.email"),
|
||||
("wadearnett223", "wade.arnett@business.trytrip.click"),
|
||||
("wallypinner76", "shawnlarkin4576@hidebox.org"),
|
||||
("wernerrepin110", "effieconde@0815.ru"),
|
||||
("wesleypickles1", "wesley_pickles@feedback.emailus.click"),
|
||||
("which-tassimo-machine-is-best5526", "caridad_browning@l4q.shenika.top"),
|
||||
("whitneybassler", "kathrynpolglaze@tempr.email"),
|
||||
("wilfred33r9270", "wilfred.witcher2@mailus.wetransfer.click"),
|
||||
("wilheminadevit", "wilhemina_devito45@contactus.wetransfer.click"),
|
||||
("windows-and-doors-upvc8966", "pansy_tindale88@w9my.johana.top"),
|
||||
("wolfgangschard", "jannelle@b.astroaxis.site"),
|
||||
("wrekirby75774", "kirby.ampt@feedback.emailus.click"),
|
||||
("xolalex246551", "codie@b.astroaxis.site"),
|
||||
("yongmcdonald2", "yong.mcdonald9@general.emailus.click"),
|
||||
("wappie", "wappiewap@proton.me"),
|
||||
("yylheike873735", "heike_kohn@general.emailus.click"),
|
||||
("zakup-prawa-jazdy-b5862", "foster-whitis13@sh97.jospeh.top"),
|
||||
("zararemington", "lynnrene@hidebox.org"),
|
||||
("zellaaxt970366", "zella_langton56@questions.emailus.click"),
|
||||
("zellaz6393879", "stacyrandolph2256@mailmenot.io"),
|
||||
("zulmagoldman36", "zulma.goldman65@general.emailus.click"),
|
||||
("zuxnoemi586821", "noemi-hakala@feedback.emailus.click"),
|
||||
("john_doe123", "john.doe123@gmail.com"),
|
||||
("alice.wonderland", "alice.w@example.com"),
|
||||
("bob_smith", "bob.smith@yahoo.com"),
|
||||
("charlie_brown", "charlie.brown@outlook.com"),
|
||||
("diana_prince", "diana.prince@icloud.com"),
|
||||
("evan_rogers", "evan.rogers@proton.me")
|
||||
]
|
||||
|
||||
results = {}
|
||||
for username, email in user_data:
|
||||
score = email_scam_score(username, email)
|
||||
results[email] = score
|
||||
|
||||
above_50 = sum(
|
Loading…
Reference in New Issue
Block a user