Skip to content

Commit

Permalink
Update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
jarelllama authored Apr 3, 2024
1 parent 8aec928 commit 2658707
Show file tree
Hide file tree
Showing 3 changed files with 64 additions and 15 deletions.
40 changes: 32 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,16 +43,18 @@ Targeted at list maintainers, a light version of the blocklist is available in t
<ul>
<li>Intended for collated blocklists cautious about size</li>
<li>Does not use sources whose domains cannot be filtered by date added</li>
<li>Only retrieves domains added in the last month by their respective sources (this is not the same as the domain registration date), whereas the full blocklist includes domains added from 2 months back and onwards</li>
<li>! Dead and parked domains that become resolving/unparked are not added back to the blocklist (due to limitations in the way these domains are recorded)</li>
<li>Does not use sources that have an above average false positive rate</li?>
<li>Note that dead and parked domains that become alive/unparked are not added back into the blocklist (due to limitations in the way these domains are recorded)</li>
</ul>
Sources excluded from the light version are marked in SOURCES.md.
<br>
<br>
Total domains: 1970
</details>

## Retrieving scam domains from Google Search
## Sources

### Google Search API
Google provides a [Search API](https://developers.google.com/custom-search/v1/overview) to retrieve JSON-formatted results from Google Search. The script uses a list of search terms almost exclusively used in scam sites to retrieve domains. See the list of search terms here: [search_terms.csv](https://github.com/jarelllama/Scam-Blocklist/blob/main/config/search_terms.csv)

#### Rationale
Expand All @@ -72,7 +74,32 @@ Queries made today: 317
Domains retrieved today: 51
```

#### Regarding other sources
## openSquat
[openSquat](https://github.com/atenreiro/opensquat) is an open-source tool for detecting malicious domains. The tool takes a list of keywords as input for its detection algorithm which checks for common cybersquatting techniques like [Typosquatting](https://en.wikipedia.org/wiki/Typosquatting), [Doppelganger Domains](https://en.wikipedia.org/wiki/Doppelganger_domain), and [IDN Homograph Attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack).

The keywords are handpicked and include common targets of phishing campaigns such as Google, WhatsApp, USPS, etc. while also taking into consideration potential false positives. The list of keywords can be viewed here: [opensquat_keywords.txt](https://github.com/jarelllama/Scam-Blocklist/blob/main/config/opensquat_keywords.txt)

To further minimize false positives, the automated detection is intentionally limited to Newly Registered Domains (NRD) comprising domains registered within the last 10 days.

#### Rationale
New phishing domains are created daily, and unlike other sources that require manual reporting, openSquat can effectively detect new phishing domains within days of the registration date. This is aided by an actively updated NRD feed provided here: [nrd.txt](https://github.com/jarelllama/Scam-Blocklist/blob/main/lists/wildcard_domains/nrd.txt)

#### Limitations
Because the detection is fully automated, false positives may slip through despite the intensive effort put into handpicking and testing the list of keywords.

For this reason, the openSquat source is not included in the light version of the blocklist.

#### Statistics for openSquat source
```
Active keywords:
Domains retrieved today:
Domains in NRD feed:
```

### Other sources

The other sources, active or inactive, can be viewed here: [SOURCES.md](https://github.com/jarelllama/Scam-Blocklist/blob/main/SOURCES.md).

The full domain retrieval process for all sources can be viewed in the repository's code.

## Filtering process
Expand All @@ -96,12 +123,9 @@ If these parked sites no longer contain any of the parked messages, they are ass
## Why the Hosts format is not supported
Malicious domains often have [wildcard DNS records](https://developers.cloudflare.com/dns/manage-dns-records/reference/wildcard-dns-records/) that allow scammers to create large amounts of subdomain records, such as 'random-subdomain.scam.com'. Each subdomain can point to a separate scam site and collating them all would inflate the blocklist size. Therefore, only formats supporting wildcard matching are built.

## Sources
Moved to [SOURCES.md](https://github.com/jarelllama/Scam-Blocklist/blob/main/SOURCES.md).

## Resources
- [AdGuard's Dead Domains Linter](https://github.com/AdguardTeam/DeadDomainsLinter): tool for checking Adblock rules for dead domains
- [Google's Shell Style Guide](https://google.github.io/styleguide/shellguide.html): shell script style guide
- [Google's Shell Style Guide](https://google.github.io/styleguide/shellguide.html): Shell script style guide
- [Legality of web scraping](https://www.quinnemanuel.com/the-firm/publications/the-legal-landscape-of-web-scraping/): the law firm of Quinn Emanuel Urquhart & Sullivan's memoranda on web scraping
- [ShellCheck](https://github.com/koalaman/shellcheck): shell script static analysis tool
- [who.is](https://who.is/): WHOIS and DNS lookup tool
Expand Down
4 changes: 2 additions & 2 deletions functions/test_functions.sh
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,7 @@ TEST_DEAD_CHECK() {

cp "$RAW" "$RAW_LIGHT"
# Expected output for light version
# (resurrected domains are not added back to light)
# (resurrected domains are not added back into light)
grep -vxF 'google.com' out_raw.txt > out_raw_light.txt

# Run script and check exit status
Expand Down Expand Up @@ -223,7 +223,7 @@ TEST_PARKED_CHECK() {

cp "$RAW" "$RAW_LIGHT"
# Expected output for light version
# (Unparked domains are not added back to light)
# (Unparked domains are not added back into light)
grep -vxF 'google.com' out_raw.txt > out_raw_light.txt

# Run script and check exit status
Expand Down
35 changes: 30 additions & 5 deletions functions/update_readme.sh
Original file line number Diff line number Diff line change
Expand Up @@ -58,16 +58,16 @@ Targeted at list maintainers, a light version of the blocklist is available in t
<ul>
<li>Intended for collated blocklists cautious about size</li>
<li>Does not use sources whose domains cannot be filtered by date added</li>
<li>Only retrieves domains added in the last month by their respective sources (this is not the same as the domain registration date), whereas the full blocklist includes domains added from 2 months back and onwards</li>
<li>! Dead and parked domains that become resolving/unparked are not added back to the blocklist (due to limitations in the way these domains are recorded)</li>
<li>Does not use sources that have an above average false positive rate</li?>
<li>Note that dead and parked domains that become alive/unparked are not added back into the blocklist (due to limitations in the way these domains are recorded)</li>
</ul>
Sources excluded from the light version are marked in SOURCES.md.
<br>
<br>
Total domains: $(wc -l < "$RAW_LIGHT")
</details>
## Retrieving scam domains from Google Search
### Google Search API
Google provides a [Search API](https://developers.google.com/custom-search/v1/overview) to retrieve JSON-formatted results from Google Search. The script uses a list of search terms almost exclusively used in scam sites to retrieve domains. See the list of search terms here: [search_terms.csv](https://github.com/jarelllama/Scam-Blocklist/blob/main/config/search_terms.csv)
#### Rationale
Expand All @@ -87,7 +87,32 @@ Queries made today: $(csvgrep -c 1 -m "$TODAY" "$SOURCE_LOG" | csvgrep -c 2 -m '
Domains retrieved today: $(sum "$TODAY" 'Google Search')
\`\`\`
#### Regarding other sources
## openSquat
[openSquat](https://github.com/atenreiro/opensquat) is an open-source tool for detecting malicious domains. The tool takes a list of keywords as input for its detection algorithm which checks for common cybersquatting techniques like [Typosquatting](https://en.wikipedia.org/wiki/Typosquatting), [Doppelganger Domains](https://en.wikipedia.org/wiki/Doppelganger_domain), and [IDN Homograph Attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack).
The keywords are handpicked and include common targets of phishing campaigns such as Google, WhatsApp, USPS, etc. while also taking into consideration potential false positives. The list of keywords can be viewed here: [opensquat_keywords.txt](https://github.com/jarelllama/Scam-Blocklist/blob/main/config/opensquat_keywords.txt)
To further minimize false positives, the automated detection is intentionally limited to Newly Registered Domains (NRD) comprising domains registered within the last 10 days.
#### Rationale
New phishing domains are created daily, and unlike other sources that require manual reporting, openSquat can effectively detect new phishing domains within days of the registration date. This is aided by an actively updated NRD feed provided here: [nrd.txt](https://github.com/jarelllama/Scam-Blocklist/blob/main/lists/wildcard_domains/nrd.txt)
#### Limitations
Because the detection is fully automated, false positives may slip through despite the intensive effort put into handpicking and testing the list of keywords.
For this reason, the openSquat source is not included in the light version of the blocklist.
#### Statistics for openSquat source
\`\`\`
Active keywords: $(wc -l < config/opensquat_keywords.txt)
Domains retrieved today: $(sum "$TODAY" 'openSquat')
Domains in NRD feed: $(wc -l < lists/wildcard_domais/nrd.txt)
\`\`\`
### Other sources
The other sources, active or inactive, can be viewed here: [SOURCES.md](https://github.com/jarelllama/Scam-Blocklist/blob/main/SOURCES.md).
The full domain retrieval process for all sources can be viewed in the repository's code.
## Filtering process
Expand Down Expand Up @@ -116,7 +141,7 @@ Moved to [SOURCES.md](https://github.com/jarelllama/Scam-Blocklist/blob/main/SOU
## Resources
- [AdGuard's Dead Domains Linter](https://github.com/AdguardTeam/DeadDomainsLinter): tool for checking Adblock rules for dead domains
- [Google's Shell Style Guide](https://google.github.io/styleguide/shellguide.html): shell script style guide
- [Google's Shell Style Guide](https://google.github.io/styleguide/shellguide.html): Shell script style guide
- [Legality of web scraping](https://www.quinnemanuel.com/the-firm/publications/the-legal-landscape-of-web-scraping/): the law firm of Quinn Emanuel Urquhart & Sullivan's memoranda on web scraping
- [ShellCheck](https://github.com/koalaman/shellcheck): shell script static analysis tool
- [who.is](https://who.is/): WHOIS and DNS lookup tool
Expand Down

0 comments on commit 2658707

Please sign in to comment.