Skip to content

Commit

Permalink
Fix markdown
Browse files Browse the repository at this point in the history
  • Loading branch information
jarelllama authored Apr 5, 2024
1 parent 0515fa8 commit a110afb
Show file tree
Hide file tree
Showing 3 changed files with 75 additions and 29 deletions.
30 changes: 26 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Jarelllama's Scam Blocklist

Blocklist for scam site domains automatically retrieved daily from Google Search and public databases. Automated retrieval is done daily at 00:00 UTC.
| Format | Syntax |
| --- | --- |
Expand All @@ -9,9 +10,11 @@ Blocklist for scam site domains automatically retrieved daily from Google Search
| [Wildcard Domains](https://raw.githubusercontent.com/jarelllama/Scam-Blocklist/main/lists/wildcard_domains/scams.txt) | scam.com |

## Statistics

[![Build and deploy](https://github.com/jarelllama/Scam-Blocklist/actions/workflows/build_deploy.yml/badge.svg)](https://github.com/jarelllama/Scam-Blocklist/actions/workflows/build_deploy.yml)
[![Test functions](https://github.com/jarelllama/Scam-Blocklist/actions/workflows/test_functions.yml/badge.svg)](https://github.com/jarelllama/Scam-Blocklist/actions/workflows/test_functions.yml)
```

``` text
Total domains: 31426
Statistics for each source:
Expand All @@ -33,9 +36,11 @@ Today | Yesterday | Excluded | Source
*Only active sources are shown. See the full list of
sources in SOURCES.md.
```

All data retrieved are publicly available and can be viewed from their respective [sources](https://github.com/jarelllama/Scam-Blocklist/blob/main/SOURCES.md).

## Light version

Targeted at list maintainers, a light version of the blocklist is available in the [lists](https://github.com/jarelllama/Scam-Blocklist/tree/main/lists) directory.

<details>
Expand All @@ -55,45 +60,53 @@ Total domains: 2015
## Sources

### Google Search API

Google provides a [Search API](https://developers.google.com/custom-search/v1/overview) to retrieve JSON-formatted results from Google Search. The script uses a list of search terms almost exclusively used in scam sites to retrieve domains. See the list of search terms here: [search_terms.csv](https://github.com/jarelllama/Scam-Blocklist/blob/main/config/search_terms.csv)

#### Effectiveness

Scam sites often do not have long lifespans; malicious domains may be replaced before they can be manually reported. By programmatically searching Google using paragraphs from real-world scam sites, new domains can be added as soon as Google crawls the site. This requires no manual reporting.

The list of search terms is proactively updated and is mostly sourced from investigating new scam site templates seen on [r/Scams](https://www.reddit.com/r/Scams/).

#### Limitations

The Google Custom Search JSON API only provides ~100 daily free search queries per API key (which is why this project uses two API keys).

To optimize the number of search queries made, each search term is frequently benchmarked on its number of new domains and false positives. Underperforming search terms are flagged and disabled.

#### Statistics for Google Search source
```

``` text
Active search terms: 17
Queries made today: 0
Domains retrieved today: 0
```

### openSquat

[openSquat](https://github.com/atenreiro/opensquat) is an open-source tool for detecting cybersquatting domains. The detection algorithm takes a list of keywords as input and checks for common cybersquatting techniques like [Typosquatting](https://en.wikipedia.org/wiki/Typosquatting), [Doppelganger Domains](https://en.wikipedia.org/wiki/Doppelganger_domain), and [IDN Homograph Attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack).

The keywords are handpicked and include common targets of phishing campaigns such as Google, Amazon, USPS, etc. while also taking into consideration potential false positives. The list of keywords can be viewed here: [opensquat_keywords.txt](https://github.com/jarelllama/Scam-Blocklist/blob/main/config/opensquat_keywords.txt)

To further minimize false positives, the automated detection is intentionally limited to Newly Registered Domains (NRD) comprising domains registered within the last 10 days.

#### Effectiveness

New phishing domains are created daily, and unlike other sources that depend on manual reporting, openSquat can effectively detect new phishing domains within days of their registration date. This is aided by an actively updated NRD feed for openSquat to process. The NRD feed can be viewed here: [nrd.txt](https://github.com/jarelllama/Scam-Blocklist/blob/main/lists/wildcard_domains/nrd.txt)

#### Limitations

Because the detection is fully automated, false positives may slip through despite the intensive effort put into handpicking and testing the list of keywords.

For this reason, the openSquat source is not included in the light version of the blocklist.

#### Statistics for openSquat source
```

``` text
Active keywords: 85
Domains retrieved today: 0
Domains in NRD feed:
Domains in NRD feed:
```

### Other sources
Expand All @@ -103,6 +116,7 @@ All sources used presently or in the past are recorded here: [SOURCES.md](https:
The domain retrieval process for all sources can be viewed in the repository's code.

## Filtering process

- The domains collated from all sources are filtered against a whitelist (scam reporting sites, forums, vetted stores, etc.)
- The domains are checked against the [Tranco Top Sites Ranking](https://tranco-list.eu/) for potential false positives which are then vetted manually
- Common subdomains like 'www' are removed to make use of wildcard matching for all other subdomains
Expand All @@ -112,32 +126,40 @@ The domain retrieval process for all sources can be viewed in the repository's c
The full filtering process can be viewed in the repository's code.

## Dead domains

Dead domains are removed daily using AdGuard's [Dead Domains Linter](https://github.com/AdguardTeam/DeadDomainsLinter). Note that domains acting as wildcards are excluded from this process.

Dead domains that are resolving again are included back into the blocklist.

## Parked domains

From initial testing, [9%](https://github.com/jarelllama/Scam-Blocklist/commit/84e682fea95866670dd99f5c98f350bc7377011a) of the blocklist consisted of [parked domains](https://www.godaddy.com/resources/ae/skills/parked-domain) that inflate the number of entries. Because these domains pose no real threat (besides the obnoxious advertising), they are removed from the blocklist daily. A list of common parked domain messages is used to detect these domains and can be viewed here: [parked_terms.txt](https://github.com/jarelllama/Scam-Blocklist/blob/main/config/parked_terms.txt)

If these parked sites no longer contain any of the parked messages, they are assumed to be unparked and are added back into the blocklist.

## Why the Hosts format is not supported

Malicious domains often have [wildcard DNS records](https://developers.cloudflare.com/dns/manage-dns-records/reference/wildcard-dns-records/) that allow scammers to create large amounts of subdomain records, such as 'random-subdomain.scam.com'. Each subdomain can point to a separate scam site and collating them all would inflate the blocklist size. Therefore, only formats supporting wildcard matching are built.

## Resources

- [AdGuard's Dead Domains Linter](https://github.com/AdguardTeam/DeadDomainsLinter): tool for checking Adblock rules for dead domains
- [Google's Shell Style Guide](https://google.github.io/styleguide/shellguide.html): Shell script style guide
- [Legality of web scraping](https://www.quinnemanuel.com/the-firm/publications/the-legal-landscape-of-web-scraping/): the law firm of Quinn Emanuel Urquhart & Sullivan's memoranda on web scraping
- [ShellCheck](https://github.com/koalaman/shellcheck): shell script static analysis tool
- [who.is](https://who.is/): WHOIS and DNS lookup tool

## See also

- [Durablenapkin's Scam Blocklist](https://github.com/durablenapkin/scamblocklist)
- [Elliotwutingfeng's Global Anti-Scam Organization Blocklist](https://github.com/elliotwutingfeng/GlobalAntiScamOrg-blocklist)
- [Elliotwutingfeng's Inversion DNSBL Blocklist](https://github.com/elliotwutingfeng/Inversion-DNSBL-Blocklists)
- [Hagezi's DNS Blocklists](https://github.com/hagezi/dns-blocklists) (uses this blocklist as a source)

## Appreciation

Thanks to the following people for the help, inspiration, and support!

- [@bongochong](https://github.com/bongochong)
- [@hagezi](https://github.com/hagezi)
- [@iam-py-test](https://github.com/iam-py-test)
43 changes: 22 additions & 21 deletions SOURCES.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,26 @@
## Sources
# Sources

All data retrieved from the following sources are publicly available.

Sources marked as inactive are not being automatically employed to retrieve domains.

Source | Type | Inactive | Excluded from light
:--- |:--- |:--- |:---
[ANFRAS](https://anfras.com/fakeshops/) | Fake | Yes | -
[Artists Against 419](https://db.aa419.org/fakebankslist.php) | Advance-fee | |
[DFPI's Crypto Scam Tracker](https://dfpi.ca.gov/crypto-scams/) | Crypto | Yes | -
[Google's Custom Search JSON API](https://developers.google.com/custom-search/v1/introduction) | Fake | |
[GunTab](https://www.guntab.com/scam-websites) | Firearm | | Yes
[Hagezi's NRD List](https://github.com/hagezi/dns-blocklists?tab=readme-ov-file#nrd) | NRD | - | -
[PetScams.com](https://petscams.com/) | Pet | |
[Scam Directory](https://scam.directory/) | Any | |
[Scam.Delivery](https://scam.delivery/) | Non-delivery | Yes | -
[ScamAdvisor](https://www.scamadviser.com/) | Any | |
[Shreshta's NRD List](https://github.com/shreshta-labs/newly-registered-domains) | NRD | - | -
[Stop 419 Scams and Scammers](https://www.stop419scams.com/) | Any | Yes | -
[StopGunScams.com](https://stopgunscams.com/) | Firearm | |
[Tranco List](https://tranco-list.eu/) | Toplist | - | -
[dnstwist](https://github.com/elceef/dnstwist) | Phishing | | Yes
[openSquat](https://github.com/atenreiro/opensquat) | Phishing | | Yes
[r/CryptoScamBlacklist](https://www.reddit.com/r/CryptoScamBlacklist/) | Crypto | Yes | -
[r/Scams](https://www.reddit.com/r/Scams/) | Any | Yes | -
| Source | Type | Inactive | Excluded from light |
|:--- |:--- |:--- |:--- |
| [ANFRAS](https://anfras.com/fakeshops/) | Fake | Yes | - |
| [Artists Against 419](https://db.aa419.org/fakebankslist.php) | Advance-fee | | |
| [DFPI's Crypto Scam Tracker](https://dfpi.ca.gov/crypto-scams/) | Crypto | Yes | - |
| [Google's Custom Search JSON API](https://developers.google.com/custom-search/v1/introduction) | Fake | | |
| [GunTab](https://www.guntab.com/scam-websites) | Firearm | | Yes |
| [Hagezi's NRD List](https://github.com/hagezi/dns-blocklists?tab=readme-ov-file#nrd) | NRD | - | - |
| [PetScams.com](https://petscams.com/) | Pet | | |
| [Scam Directory](https://scam.directory/) | Any | | |
| [Scam.Delivery](https://scam.delivery/) | Non-delivery | Yes | - |
| [ScamAdvisor](https://www.scamadviser.com/) | Any | | |
| [Shreshta's NRD List](https://github.com/shreshta-labs/newly-registered-domains) | NRD | - | - |
| [Stop 419 Scams and Scammers](https://www.stop419scams.com/) | Any | Yes | - |
| [StopGunScams.com](https://stopgunscams.com/) | Firearm | | |
| [Tranco List](https://tranco-list.eu/) | Toplist | - | - |
| [dnstwist](https://github.com/elceef/dnstwist) | Phishing | | Yes |
| [openSquat](https://github.com/atenreiro/opensquat) | Phishing | | Yes |
| [r/CryptoScamBlacklist](https://www.reddit.com/r/CryptoScamBlacklist/) | Crypto | Yes | - |
| [r/Scams](https://www.reddit.com/r/Scams/) | Any | Yes | - |
31 changes: 27 additions & 4 deletions functions/update_readme.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@ readonly YESTERDAY
update_readme() {
cat << EOF > README.md
# Jarelllama's Scam Blocklist
Blocklist for scam site domains automatically retrieved daily from Google Search and public databases. Automated retrieval is done daily at 00:00 UTC.
Blocklist for scam site domains automatically retrieved daily from Google Search and public sources. Automated retrieval is done daily at 00:00 UTC.
| Format | Syntax |
| --- | --- |
| [Adblock Plus](https://raw.githubusercontent.com/jarelllama/Scam-Blocklist/main/lists/adblock/scams.txt) | \|\|scam.com^ |
Expand All @@ -24,9 +26,11 @@ Blocklist for scam site domains automatically retrieved daily from Google Search
| [Wildcard Domains](https://raw.githubusercontent.com/jarelllama/Scam-Blocklist/main/lists/wildcard_domains/scams.txt) | scam.com |
## Statistics
[![Build and deploy](https://github.com/jarelllama/Scam-Blocklist/actions/workflows/build_deploy.yml/badge.svg)](https://github.com/jarelllama/Scam-Blocklist/actions/workflows/build_deploy.yml)
[![Test functions](https://github.com/jarelllama/Scam-Blocklist/actions/workflows/test_functions.yml/badge.svg)](https://github.com/jarelllama/Scam-Blocklist/actions/workflows/test_functions.yml)
\`\`\`
\`\`\` text
Total domains: $(wc -l < "$RAW")
Statistics for each source:
Expand All @@ -48,9 +52,11 @@ $(print_stats)
*Only active sources are shown. See the full list of
sources in SOURCES.md.
\`\`\`
All data retrieved are publicly available and can be viewed from their respective [sources](https://github.com/jarelllama/Scam-Blocklist/blob/main/SOURCES.md).
## Light version
Targeted at list maintainers, a light version of the blocklist is available in the [lists](https://github.com/jarelllama/Scam-Blocklist/tree/main/lists) directory.
<details>
Expand All @@ -70,42 +76,50 @@ Total domains: $(wc -l < "$RAW_LIGHT")
## Sources
### Google Search API
Google provides a [Search API](https://developers.google.com/custom-search/v1/overview) to retrieve JSON-formatted results from Google Search. The script uses a list of search terms almost exclusively used in scam sites to retrieve domains. See the list of search terms here: [search_terms.csv](https://github.com/jarelllama/Scam-Blocklist/blob/main/config/search_terms.csv)
#### Effectiveness
Scam sites often do not have long lifespans; malicious domains may be replaced before they can be manually reported. By programmatically searching Google using paragraphs from real-world scam sites, new domains can be added as soon as Google crawls the site. This requires no manual reporting.
The list of search terms is proactively updated and is mostly sourced from investigating new scam site templates seen on [r/Scams](https://www.reddit.com/r/Scams/).
#### Limitations
The Google Custom Search JSON API only provides ~100 daily free search queries per API key (which is why this project uses two API keys).
To optimize the number of search queries made, each search term is frequently benchmarked on its number of new domains and false positives. Underperforming search terms are flagged and disabled.
#### Statistics for Google Search source
\`\`\`
\`\`\` text
Active search terms: $(csvgrep -c 2 -m 'y' -i "$SEARCH_TERMS" | tail -n +2 | wc -l)
Queries made today: $(csvgrep -c 1 -m "$TODAY" "$SOURCE_LOG" | csvgrep -c 2 -m 'Google Search' | csvcut -c 12 | awk '{sum += $1} END {print sum}')
Domains retrieved today: $(sum "$TODAY" 'Google Search')
\`\`\`
### openSquat
[openSquat](https://github.com/atenreiro/opensquat) is an open-source tool for detecting cybersquatting domains. The detection algorithm takes a list of keywords as input and checks for common cybersquatting techniques like [Typosquatting](https://en.wikipedia.org/wiki/Typosquatting), [Doppelganger Domains](https://en.wikipedia.org/wiki/Doppelganger_domain), and [IDN Homograph Attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack).
The keywords are handpicked and include common targets of phishing campaigns such as Google, Amazon, USPS, etc. while also taking into consideration potential false positives. The list of keywords can be viewed here: [opensquat_keywords.txt](https://github.com/jarelllama/Scam-Blocklist/blob/main/config/opensquat_keywords.txt)
To further minimize false positives, the automated detection is intentionally limited to Newly Registered Domains (NRD) comprising domains registered within the last 10 days.
#### Effectiveness
New phishing domains are created daily, and unlike other sources that depend on manual reporting, openSquat can effectively detect new phishing domains within days of their registration date. This is aided by an actively updated NRD feed for openSquat to process. The NRD feed can be viewed here: [nrd.txt](https://github.com/jarelllama/Scam-Blocklist/blob/main/lists/wildcard_domains/nrd.txt)
#### Limitations
Because the detection is fully automated, false positives may slip through despite the intensive effort put into handpicking and testing the list of keywords.
For this reason, the openSquat source is not included in the light version of the blocklist.
#### Statistics for openSquat source
\`\`\`
\`\`\` text
Active keywords: $(wc -l < config/opensquat_keywords.txt)
Domains retrieved today: $(sum "$TODAY" 'openSquat')
Domains in NRD feed: $(wc -l < lists/wildcard_domains/nrd.txt | rev | sed 's/\(...\)/\1,/g' | sed 's/,$//' | rev)
Expand All @@ -118,6 +132,7 @@ All sources used presently or in the past are recorded here: [SOURCES.md](https:
The domain retrieval process for all sources can be viewed in the repository's code.
## Filtering process
- The domains collated from all sources are filtered against a whitelist (scam reporting sites, forums, vetted stores, etc.)
- The domains are checked against the [Tranco Top Sites Ranking](https://tranco-list.eu/) for potential false positives which are then vetted manually
- Common subdomains like 'www' are removed to make use of wildcard matching for all other subdomains
Expand All @@ -127,32 +142,40 @@ The domain retrieval process for all sources can be viewed in the repository's c
The full filtering process can be viewed in the repository's code.
## Dead domains
Dead domains are removed daily using AdGuard's [Dead Domains Linter](https://github.com/AdguardTeam/DeadDomainsLinter). Note that domains acting as wildcards are excluded from this process.
Dead domains that are resolving again are included back into the blocklist.
## Parked domains
From initial testing, [9%](https://github.com/jarelllama/Scam-Blocklist/commit/84e682fea95866670dd99f5c98f350bc7377011a) of the blocklist consisted of [parked domains](https://www.godaddy.com/resources/ae/skills/parked-domain) that inflate the number of entries. Because these domains pose no real threat (besides the obnoxious advertising), they are removed from the blocklist daily. A list of common parked domain messages is used to detect these domains and can be viewed here: [parked_terms.txt](https://github.com/jarelllama/Scam-Blocklist/blob/main/config/parked_terms.txt)
If these parked sites no longer contain any of the parked messages, they are assumed to be unparked and are added back into the blocklist.
## Why the Hosts format is not supported
Malicious domains often have [wildcard DNS records](https://developers.cloudflare.com/dns/manage-dns-records/reference/wildcard-dns-records/) that allow scammers to create large amounts of subdomain records, such as 'random-subdomain.scam.com'. Each subdomain can point to a separate scam site and collating them all would inflate the blocklist size. Therefore, only formats supporting wildcard matching are built.
## Resources
- [AdGuard's Dead Domains Linter](https://github.com/AdguardTeam/DeadDomainsLinter): tool for checking Adblock rules for dead domains
- [Google's Shell Style Guide](https://google.github.io/styleguide/shellguide.html): Shell script style guide
- [Legality of web scraping](https://www.quinnemanuel.com/the-firm/publications/the-legal-landscape-of-web-scraping/): the law firm of Quinn Emanuel Urquhart & Sullivan's memoranda on web scraping
- [ShellCheck](https://github.com/koalaman/shellcheck): shell script static analysis tool
- [who.is](https://who.is/): WHOIS and DNS lookup tool
## See also
- [Durablenapkin's Scam Blocklist](https://github.com/durablenapkin/scamblocklist)
- [Elliotwutingfeng's Global Anti-Scam Organization Blocklist](https://github.com/elliotwutingfeng/GlobalAntiScamOrg-blocklist)
- [Elliotwutingfeng's Inversion DNSBL Blocklist](https://github.com/elliotwutingfeng/Inversion-DNSBL-Blocklists)
- [Hagezi's DNS Blocklists](https://github.com/hagezi/dns-blocklists) (uses this blocklist as a source)
## Appreciation
Thanks to the following people for the help, inspiration, and support!
- [@bongochong](https://github.com/bongochong)
- [@hagezi](https://github.com/hagezi)
- [@iam-py-test](https://github.com/iam-py-test)
Expand Down

0 comments on commit a110afb

Please sign in to comment.