diff --git a/README.md b/README.md index 768e5d794..e13b07a50 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,5 @@ # Jarelllama's Scam Blocklist + Blocklist for scam site domains automatically retrieved daily from Google Search and public databases. Automated retrieval is done daily at 00:00 UTC. | Format | Syntax | | --- | --- | @@ -9,9 +10,11 @@ Blocklist for scam site domains automatically retrieved daily from Google Search | [Wildcard Domains](https://raw.githubusercontent.com/jarelllama/Scam-Blocklist/main/lists/wildcard_domains/scams.txt) | scam.com | ## Statistics + [![Build and deploy](https://github.com/jarelllama/Scam-Blocklist/actions/workflows/build_deploy.yml/badge.svg)](https://github.com/jarelllama/Scam-Blocklist/actions/workflows/build_deploy.yml) [![Test functions](https://github.com/jarelllama/Scam-Blocklist/actions/workflows/test_functions.yml/badge.svg)](https://github.com/jarelllama/Scam-Blocklist/actions/workflows/test_functions.yml) -``` + +``` text Total domains: 31426 Statistics for each source: @@ -33,9 +36,11 @@ Today | Yesterday | Excluded | Source *Only active sources are shown. See the full list of sources in SOURCES.md. ``` + All data retrieved are publicly available and can be viewed from their respective [sources](https://github.com/jarelllama/Scam-Blocklist/blob/main/SOURCES.md). ## Light version + Targeted at list maintainers, a light version of the blocklist is available in the [lists](https://github.com/jarelllama/Scam-Blocklist/tree/main/lists) directory.
@@ -55,26 +60,31 @@ Total domains: 2015 ## Sources ### Google Search API + Google provides a [Search API](https://developers.google.com/custom-search/v1/overview) to retrieve JSON-formatted results from Google Search. The script uses a list of search terms almost exclusively used in scam sites to retrieve domains. See the list of search terms here: [search_terms.csv](https://github.com/jarelllama/Scam-Blocklist/blob/main/config/search_terms.csv) #### Effectiveness + Scam sites often do not have long lifespans; malicious domains may be replaced before they can be manually reported. By programmatically searching Google using paragraphs from real-world scam sites, new domains can be added as soon as Google crawls the site. This requires no manual reporting. The list of search terms is proactively updated and is mostly sourced from investigating new scam site templates seen on [r/Scams](https://www.reddit.com/r/Scams/). #### Limitations + The Google Custom Search JSON API only provides ~100 daily free search queries per API key (which is why this project uses two API keys). To optimize the number of search queries made, each search term is frequently benchmarked on its number of new domains and false positives. Underperforming search terms are flagged and disabled. #### Statistics for Google Search source -``` + +``` text Active search terms: 17 Queries made today: 0 Domains retrieved today: 0 ``` ### openSquat + [openSquat](https://github.com/atenreiro/opensquat) is an open-source tool for detecting cybersquatting domains. The detection algorithm takes a list of keywords as input and checks for common cybersquatting techniques like [Typosquatting](https://en.wikipedia.org/wiki/Typosquatting), [Doppelganger Domains](https://en.wikipedia.org/wiki/Doppelganger_domain), and [IDN Homograph Attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack). The keywords are handpicked and include common targets of phishing campaigns such as Google, Amazon, USPS, etc. while also taking into consideration potential false positives. The list of keywords can be viewed here: [opensquat_keywords.txt](https://github.com/jarelllama/Scam-Blocklist/blob/main/config/opensquat_keywords.txt) @@ -82,18 +92,21 @@ The keywords are handpicked and include common targets of phishing campaigns suc To further minimize false positives, the automated detection is intentionally limited to Newly Registered Domains (NRD) comprising domains registered within the last 10 days. #### Effectiveness + New phishing domains are created daily, and unlike other sources that depend on manual reporting, openSquat can effectively detect new phishing domains within days of their registration date. This is aided by an actively updated NRD feed for openSquat to process. The NRD feed can be viewed here: [nrd.txt](https://github.com/jarelllama/Scam-Blocklist/blob/main/lists/wildcard_domains/nrd.txt) #### Limitations + Because the detection is fully automated, false positives may slip through despite the intensive effort put into handpicking and testing the list of keywords. For this reason, the openSquat source is not included in the light version of the blocklist. #### Statistics for openSquat source -``` + +``` text Active keywords: 85 Domains retrieved today: 0 -Domains in NRD feed: +Domains in NRD feed: ``` ### Other sources @@ -103,6 +116,7 @@ All sources used presently or in the past are recorded here: [SOURCES.md](https: The domain retrieval process for all sources can be viewed in the repository's code. ## Filtering process + - The domains collated from all sources are filtered against a whitelist (scam reporting sites, forums, vetted stores, etc.) - The domains are checked against the [Tranco Top Sites Ranking](https://tranco-list.eu/) for potential false positives which are then vetted manually - Common subdomains like 'www' are removed to make use of wildcard matching for all other subdomains @@ -112,18 +126,23 @@ The domain retrieval process for all sources can be viewed in the repository's c The full filtering process can be viewed in the repository's code. ## Dead domains + Dead domains are removed daily using AdGuard's [Dead Domains Linter](https://github.com/AdguardTeam/DeadDomainsLinter). Note that domains acting as wildcards are excluded from this process. Dead domains that are resolving again are included back into the blocklist. ## Parked domains + From initial testing, [9%](https://github.com/jarelllama/Scam-Blocklist/commit/84e682fea95866670dd99f5c98f350bc7377011a) of the blocklist consisted of [parked domains](https://www.godaddy.com/resources/ae/skills/parked-domain) that inflate the number of entries. Because these domains pose no real threat (besides the obnoxious advertising), they are removed from the blocklist daily. A list of common parked domain messages is used to detect these domains and can be viewed here: [parked_terms.txt](https://github.com/jarelllama/Scam-Blocklist/blob/main/config/parked_terms.txt) If these parked sites no longer contain any of the parked messages, they are assumed to be unparked and are added back into the blocklist. + ## Why the Hosts format is not supported + Malicious domains often have [wildcard DNS records](https://developers.cloudflare.com/dns/manage-dns-records/reference/wildcard-dns-records/) that allow scammers to create large amounts of subdomain records, such as 'random-subdomain.scam.com'. Each subdomain can point to a separate scam site and collating them all would inflate the blocklist size. Therefore, only formats supporting wildcard matching are built. ## Resources + - [AdGuard's Dead Domains Linter](https://github.com/AdguardTeam/DeadDomainsLinter): tool for checking Adblock rules for dead domains - [Google's Shell Style Guide](https://google.github.io/styleguide/shellguide.html): Shell script style guide - [Legality of web scraping](https://www.quinnemanuel.com/the-firm/publications/the-legal-landscape-of-web-scraping/): the law firm of Quinn Emanuel Urquhart & Sullivan's memoranda on web scraping @@ -131,13 +150,16 @@ Malicious domains often have [wildcard DNS records](https://developers.cloudflar - [who.is](https://who.is/): WHOIS and DNS lookup tool ## See also + - [Durablenapkin's Scam Blocklist](https://github.com/durablenapkin/scamblocklist) - [Elliotwutingfeng's Global Anti-Scam Organization Blocklist](https://github.com/elliotwutingfeng/GlobalAntiScamOrg-blocklist) - [Elliotwutingfeng's Inversion DNSBL Blocklist](https://github.com/elliotwutingfeng/Inversion-DNSBL-Blocklists) - [Hagezi's DNS Blocklists](https://github.com/hagezi/dns-blocklists) (uses this blocklist as a source) ## Appreciation + Thanks to the following people for the help, inspiration, and support! + - [@bongochong](https://github.com/bongochong) - [@hagezi](https://github.com/hagezi) - [@iam-py-test](https://github.com/iam-py-test) diff --git a/SOURCES.md b/SOURCES.md index 304835fb9..865a088ef 100644 --- a/SOURCES.md +++ b/SOURCES.md @@ -1,25 +1,26 @@ -## Sources +# Sources + All data retrieved from the following sources are publicly available. Sources marked as inactive are not being automatically employed to retrieve domains. -Source | Type | Inactive | Excluded from light -:--- |:--- |:--- |:--- -[ANFRAS](https://anfras.com/fakeshops/) | Fake | Yes | - -[Artists Against 419](https://db.aa419.org/fakebankslist.php) | Advance-fee | | -[DFPI's Crypto Scam Tracker](https://dfpi.ca.gov/crypto-scams/) | Crypto | Yes | - -[Google's Custom Search JSON API](https://developers.google.com/custom-search/v1/introduction) | Fake | | -[GunTab](https://www.guntab.com/scam-websites) | Firearm | | Yes -[Hagezi's NRD List](https://github.com/hagezi/dns-blocklists?tab=readme-ov-file#nrd) | NRD | - | - -[PetScams.com](https://petscams.com/) | Pet | | -[Scam Directory](https://scam.directory/) | Any | | -[Scam.Delivery](https://scam.delivery/) | Non-delivery | Yes | - -[ScamAdvisor](https://www.scamadviser.com/) | Any | | -[Shreshta's NRD List](https://github.com/shreshta-labs/newly-registered-domains) | NRD | - | - -[Stop 419 Scams and Scammers](https://www.stop419scams.com/) | Any | Yes | - -[StopGunScams.com](https://stopgunscams.com/) | Firearm | | -[Tranco List](https://tranco-list.eu/) | Toplist | - | - -[dnstwist](https://github.com/elceef/dnstwist) | Phishing | | Yes -[openSquat](https://github.com/atenreiro/opensquat) | Phishing | | Yes -[r/CryptoScamBlacklist](https://www.reddit.com/r/CryptoScamBlacklist/) | Crypto | Yes | - -[r/Scams](https://www.reddit.com/r/Scams/) | Any | Yes | - +| Source | Type | Inactive | Excluded from light | +|:--- |:--- |:--- |:--- | +| [ANFRAS](https://anfras.com/fakeshops/) | Fake | Yes | - | +| [Artists Against 419](https://db.aa419.org/fakebankslist.php) | Advance-fee | | | +| [DFPI's Crypto Scam Tracker](https://dfpi.ca.gov/crypto-scams/) | Crypto | Yes | - | +| [Google's Custom Search JSON API](https://developers.google.com/custom-search/v1/introduction) | Fake | | | +| [GunTab](https://www.guntab.com/scam-websites) | Firearm | | Yes | +| [Hagezi's NRD List](https://github.com/hagezi/dns-blocklists?tab=readme-ov-file#nrd) | NRD | - | - | +| [PetScams.com](https://petscams.com/) | Pet | | | +| [Scam Directory](https://scam.directory/) | Any | | | +| [Scam.Delivery](https://scam.delivery/) | Non-delivery | Yes | - | +| [ScamAdvisor](https://www.scamadviser.com/) | Any | | | +| [Shreshta's NRD List](https://github.com/shreshta-labs/newly-registered-domains) | NRD | - | - | +| [Stop 419 Scams and Scammers](https://www.stop419scams.com/) | Any | Yes | - | +| [StopGunScams.com](https://stopgunscams.com/) | Firearm | | | +| [Tranco List](https://tranco-list.eu/) | Toplist | - | - | +| [dnstwist](https://github.com/elceef/dnstwist) | Phishing | | Yes | +| [openSquat](https://github.com/atenreiro/opensquat) | Phishing | | Yes | +| [r/CryptoScamBlacklist](https://www.reddit.com/r/CryptoScamBlacklist/) | Crypto | Yes | - | +| [r/Scams](https://www.reddit.com/r/Scams/) | Any | Yes | - | diff --git a/functions/update_readme.sh b/functions/update_readme.sh index 86090ab07..be8938354 100644 --- a/functions/update_readme.sh +++ b/functions/update_readme.sh @@ -14,7 +14,9 @@ readonly YESTERDAY update_readme() { cat << EOF > README.md # Jarelllama's Scam Blocklist -Blocklist for scam site domains automatically retrieved daily from Google Search and public databases. Automated retrieval is done daily at 00:00 UTC. + +Blocklist for scam site domains automatically retrieved daily from Google Search and public sources. Automated retrieval is done daily at 00:00 UTC. + | Format | Syntax | | --- | --- | | [Adblock Plus](https://raw.githubusercontent.com/jarelllama/Scam-Blocklist/main/lists/adblock/scams.txt) | \|\|scam.com^ | @@ -24,9 +26,11 @@ Blocklist for scam site domains automatically retrieved daily from Google Search | [Wildcard Domains](https://raw.githubusercontent.com/jarelllama/Scam-Blocklist/main/lists/wildcard_domains/scams.txt) | scam.com | ## Statistics + [![Build and deploy](https://github.com/jarelllama/Scam-Blocklist/actions/workflows/build_deploy.yml/badge.svg)](https://github.com/jarelllama/Scam-Blocklist/actions/workflows/build_deploy.yml) [![Test functions](https://github.com/jarelllama/Scam-Blocklist/actions/workflows/test_functions.yml/badge.svg)](https://github.com/jarelllama/Scam-Blocklist/actions/workflows/test_functions.yml) -\`\`\` + +\`\`\` text Total domains: $(wc -l < "$RAW") Statistics for each source: @@ -48,9 +52,11 @@ $(print_stats) *Only active sources are shown. See the full list of sources in SOURCES.md. \`\`\` + All data retrieved are publicly available and can be viewed from their respective [sources](https://github.com/jarelllama/Scam-Blocklist/blob/main/SOURCES.md). ## Light version + Targeted at list maintainers, a light version of the blocklist is available in the [lists](https://github.com/jarelllama/Scam-Blocklist/tree/main/lists) directory.
@@ -70,26 +76,31 @@ Total domains: $(wc -l < "$RAW_LIGHT") ## Sources ### Google Search API + Google provides a [Search API](https://developers.google.com/custom-search/v1/overview) to retrieve JSON-formatted results from Google Search. The script uses a list of search terms almost exclusively used in scam sites to retrieve domains. See the list of search terms here: [search_terms.csv](https://github.com/jarelllama/Scam-Blocklist/blob/main/config/search_terms.csv) #### Effectiveness + Scam sites often do not have long lifespans; malicious domains may be replaced before they can be manually reported. By programmatically searching Google using paragraphs from real-world scam sites, new domains can be added as soon as Google crawls the site. This requires no manual reporting. The list of search terms is proactively updated and is mostly sourced from investigating new scam site templates seen on [r/Scams](https://www.reddit.com/r/Scams/). #### Limitations + The Google Custom Search JSON API only provides ~100 daily free search queries per API key (which is why this project uses two API keys). To optimize the number of search queries made, each search term is frequently benchmarked on its number of new domains and false positives. Underperforming search terms are flagged and disabled. #### Statistics for Google Search source -\`\`\` + +\`\`\` text Active search terms: $(csvgrep -c 2 -m 'y' -i "$SEARCH_TERMS" | tail -n +2 | wc -l) Queries made today: $(csvgrep -c 1 -m "$TODAY" "$SOURCE_LOG" | csvgrep -c 2 -m 'Google Search' | csvcut -c 12 | awk '{sum += $1} END {print sum}') Domains retrieved today: $(sum "$TODAY" 'Google Search') \`\`\` ### openSquat + [openSquat](https://github.com/atenreiro/opensquat) is an open-source tool for detecting cybersquatting domains. The detection algorithm takes a list of keywords as input and checks for common cybersquatting techniques like [Typosquatting](https://en.wikipedia.org/wiki/Typosquatting), [Doppelganger Domains](https://en.wikipedia.org/wiki/Doppelganger_domain), and [IDN Homograph Attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack). The keywords are handpicked and include common targets of phishing campaigns such as Google, Amazon, USPS, etc. while also taking into consideration potential false positives. The list of keywords can be viewed here: [opensquat_keywords.txt](https://github.com/jarelllama/Scam-Blocklist/blob/main/config/opensquat_keywords.txt) @@ -97,15 +108,18 @@ The keywords are handpicked and include common targets of phishing campaigns suc To further minimize false positives, the automated detection is intentionally limited to Newly Registered Domains (NRD) comprising domains registered within the last 10 days. #### Effectiveness + New phishing domains are created daily, and unlike other sources that depend on manual reporting, openSquat can effectively detect new phishing domains within days of their registration date. This is aided by an actively updated NRD feed for openSquat to process. The NRD feed can be viewed here: [nrd.txt](https://github.com/jarelllama/Scam-Blocklist/blob/main/lists/wildcard_domains/nrd.txt) #### Limitations + Because the detection is fully automated, false positives may slip through despite the intensive effort put into handpicking and testing the list of keywords. For this reason, the openSquat source is not included in the light version of the blocklist. #### Statistics for openSquat source -\`\`\` + +\`\`\` text Active keywords: $(wc -l < config/opensquat_keywords.txt) Domains retrieved today: $(sum "$TODAY" 'openSquat') Domains in NRD feed: $(wc -l < lists/wildcard_domains/nrd.txt | rev | sed 's/\(...\)/\1,/g' | sed 's/,$//' | rev) @@ -118,6 +132,7 @@ All sources used presently or in the past are recorded here: [SOURCES.md](https: The domain retrieval process for all sources can be viewed in the repository's code. ## Filtering process + - The domains collated from all sources are filtered against a whitelist (scam reporting sites, forums, vetted stores, etc.) - The domains are checked against the [Tranco Top Sites Ranking](https://tranco-list.eu/) for potential false positives which are then vetted manually - Common subdomains like 'www' are removed to make use of wildcard matching for all other subdomains @@ -127,18 +142,23 @@ The domain retrieval process for all sources can be viewed in the repository's c The full filtering process can be viewed in the repository's code. ## Dead domains + Dead domains are removed daily using AdGuard's [Dead Domains Linter](https://github.com/AdguardTeam/DeadDomainsLinter). Note that domains acting as wildcards are excluded from this process. Dead domains that are resolving again are included back into the blocklist. ## Parked domains + From initial testing, [9%](https://github.com/jarelllama/Scam-Blocklist/commit/84e682fea95866670dd99f5c98f350bc7377011a) of the blocklist consisted of [parked domains](https://www.godaddy.com/resources/ae/skills/parked-domain) that inflate the number of entries. Because these domains pose no real threat (besides the obnoxious advertising), they are removed from the blocklist daily. A list of common parked domain messages is used to detect these domains and can be viewed here: [parked_terms.txt](https://github.com/jarelllama/Scam-Blocklist/blob/main/config/parked_terms.txt) If these parked sites no longer contain any of the parked messages, they are assumed to be unparked and are added back into the blocklist. + ## Why the Hosts format is not supported + Malicious domains often have [wildcard DNS records](https://developers.cloudflare.com/dns/manage-dns-records/reference/wildcard-dns-records/) that allow scammers to create large amounts of subdomain records, such as 'random-subdomain.scam.com'. Each subdomain can point to a separate scam site and collating them all would inflate the blocklist size. Therefore, only formats supporting wildcard matching are built. ## Resources + - [AdGuard's Dead Domains Linter](https://github.com/AdguardTeam/DeadDomainsLinter): tool for checking Adblock rules for dead domains - [Google's Shell Style Guide](https://google.github.io/styleguide/shellguide.html): Shell script style guide - [Legality of web scraping](https://www.quinnemanuel.com/the-firm/publications/the-legal-landscape-of-web-scraping/): the law firm of Quinn Emanuel Urquhart & Sullivan's memoranda on web scraping @@ -146,13 +166,16 @@ Malicious domains often have [wildcard DNS records](https://developers.cloudflar - [who.is](https://who.is/): WHOIS and DNS lookup tool ## See also + - [Durablenapkin's Scam Blocklist](https://github.com/durablenapkin/scamblocklist) - [Elliotwutingfeng's Global Anti-Scam Organization Blocklist](https://github.com/elliotwutingfeng/GlobalAntiScamOrg-blocklist) - [Elliotwutingfeng's Inversion DNSBL Blocklist](https://github.com/elliotwutingfeng/Inversion-DNSBL-Blocklists) - [Hagezi's DNS Blocklists](https://github.com/hagezi/dns-blocklists) (uses this blocklist as a source) ## Appreciation + Thanks to the following people for the help, inspiration, and support! + - [@bongochong](https://github.com/bongochong) - [@hagezi](https://github.com/hagezi) - [@iam-py-test](https://github.com/iam-py-test)