Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocklist size growth #4830

Closed
friendly-bits opened this issue Jan 10, 2025 · 22 comments
Closed

Blocklist size growth #4830

friendly-bits opened this issue Jan 10, 2025 · 22 comments
Assignees
Labels
in progress A solution is being worked on question Further information is requested

Comments

@friendly-bits
Copy link

Hi Hagezi,

First of all: thanks for the great work!

Second: I am a contributor to the adblock-lean project (implementing adblocking on OpenWrt). adblock-lean recommends your blocklists to users and currently includes 4 pre-defined presets, each one intended for devices with certain memory capacity (64MiB/128MiB/256MiB/512MiB+).

A few months ago when we came up with these presets, we were able to more or less perfectly balance them based on a selection of your blocklists.

However, it seems that in the past few months the domains count in some of the blocklists grew significantly. So for example, the combination of Pro and tif.mini (which is included in our preset intended for devices with 128MiB of memory) grew from ~250k domains a few months ago to ~311k domains now (numbers after deduplication). This is already borderline too much for these devices. Since adblock-lean (like many other adblockers) implements and by default enables automatic blocklist updates, we may soon be getting into territory where users start getting OOMs and dnsmasq crashes.

So I feel that this needs to be addressed.

I am not entirely sure but it seems to me that the main contributor to the domains count growth are the TIF lists. My current idea to deal with the situation for the 128MiB routers is to downgrade them from Pro to Pro.mini, however this is not ideal because this effectively makes them do less adblocking for more harmful domains blocking, which is not necessarily the best tradeoff.

So we would like to ask:

  1. Is it possible to keep the domains count in blocklists more or less constant?
  2. Is it possible to have another TIF blocklist, smaller than current 'tif.mini'?

Thank you again! Your work is very highly appreciated by many people, including us and our users.

@friendly-bits friendly-bits added the question Further information is requested label Jan 10, 2025
@hagezi
Copy link
Owner

hagezi commented Jan 10, 2025

Hi @friendly-bits,

I have made some optimisations to the lists and some malicious NRDs (normal to ultimate) are now also blocked. Furthermore, I can't save myself from active malicious domains at the moment. They are springing up like mushrooms.

For adblockers that have problems with the list size, the mini versions are available, e.g. Pro mini + TIF mini

Alternatively, use the Light, Normal or Pro alone. A useful combination would be to use the Pro alone and also use a Secure DNS such as Quad9 as an upstream.

It is impossible to keep the lists compatible for all adblockers. I don't want the effectiveness of the normal lists to suffer, especially since they don't contain dead domains or domains that won't be called up in this life anyway. That's why there is the mini option. With the mini version, the size remains more or less constant because it only contains domains from current Top 1M lists. Apart from the TIF mini, which has become significantly larger due to the growing volume of malware, scam, phishing and malicious NRDs.

Somewhere you have to make compromises if the “AdBlocker” cannot cope with large lists. It is impossible to achieve the effectiveness of the Pro with half of the domains.
The Mini versions are designed to cover almost all popular ads and tracking - especially since all Mini versions include Light. They are less effective with popup ads and block almost no TIF domains.

What is the target value for desired combinations?

@hagezi
Copy link
Owner

hagezi commented Jan 10, 2025

@friendly-bits For the TIF.mini I could deactivate the NRDs, some of which are already included in the Pro (only from phishing/scam feeds). I'll test that out.

@hagezi
Copy link
Owner

hagezi commented Jan 10, 2025

I'll have another look at what I can do with the Pro without limiting its effectiveness.

@friendly-bits
Copy link
Author

friendly-bits commented Jan 10, 2025

Thank you for the super prompt reply. In order to not make you wait, I'll write a short reply now and follow up with a bit more details later, if needed.

The target maximum domains count for the 128MiB routers is <300k entries total (that includes the pro list and the tif.mini list, deduplicated - as adblock-lean by default deduplicates the blocklists). Currently we have ~311k. Of course, we prefer the actual count to be a bit smaller than 300k, in order to have some headroom. So let's say 250k total.

I also want to better articulate our request. We are not asking you to make blocklists which fit adblock-lean's specific needs. We can and we do pick a combination of available blocklists which fits the various target devices of our users. The fundamental issue is that the size is growing over time, so a user who picked the then-optimal preset (which even had enough headroom for size fluctuations) 5 months ago may suddenly get a misbehaving router after adblock-lean pulls the updated blocklist. So our request is to keep the blocklist sizes constant. I'm not sure what the best strategy for this would be. Maybe just having a small selection of blocklists with fixed sizes (+/- some margin). Maybe when the largest of these lists overflows, add a new blocklist with next size gradation. There is no need to make too many lists available because the standard memory sizes are doubled. So we are talking about 4 or 5 size gradations which would cover everything from 64MiB to 1024MiB+ devices.

I am pretty sure that this will help not only adblock-lean but virtually every adblocker running on devices with limited memory capacity.

@hagezi
Copy link
Owner

hagezi commented Jan 10, 2025

@friendly-bits Thank you, I understand that. I can try to optimize the lists, but it is almost impossible to keep TIF lists somehow constant. Large jumps are almost normal in these lists - depending on what the feeds deliver. For a small amount of RAM, I recommend not using TIF lists in order to have enough headroom.

@friendly-bits
Copy link
Author

We may have to give up on TIF lists for the smaller memory capacities then. If there is no choice then we will do it. That said, this would be unfortunate. If it's a technical issue then I think it might be solvable with a bit of automation. I could contribute a shell+awk script which would automate that if this helps.

@hagezi
Copy link
Owner

hagezi commented Jan 10, 2025

@friendly-bits I've made a few changes, let's see where we end up with the next release in a few hours. I am currently also struggling with the size of the TIF full and am trying to get this under control.

@friendly-bits
Copy link
Author

Thank you! And I am serious about the offer to help. So please feel free to ping me if/when you are interested. I helped implement blocklist processing in adblock-lean, which (AFAIK) currently has the fastest and most memory efficient processing among the available adblockers for that platform.

@yokoffing
Copy link
Contributor

@hagezi For the TIF.mini I could deactivate the NRDs

I think that makes sense.

@hagezi
Copy link
Owner

hagezi commented Jan 10, 2025

@friendly-bits

Pro       196275
TIF mini  84032

Let's see how the size changes over the next few releases.

@jarelllama
Copy link
Contributor

@hagezi would it be worth it removing parked domains from the mini list? I understand some parked domains may still host malicious content in their subfolders, but removing parked domains from your latest tif mini can remove ~4000 domains.

I did a test run here: https://github.com/jarelllama/Parked-Checker/actions/runs/12720680472. With the resulting parked domains here: https://github.com/jarelllama/Parked-Checker/blob/main/data/parked_domains.txt

@hagezi
Copy link
Owner

hagezi commented Jan 11, 2025

Many thanks @jarelllama, I hadn't even seen that you have a checker for it, great. That would definitely be something for the other TIF versions too.

@jarelllama
Copy link
Contributor

jarelllama commented Jan 11, 2025

Many thanks @jarelllama, I hadn't even seen that you have a checker for it, great. That would definitely be something for the other TIF versions too.

That repo is more of a proof of concept I came up with after reading this issue. It's using the same code as in my scam blocklist, which I admit is a little janky and I have not tested it with anything larger than my blocklist size. So I am unsure it it could apply practically to your TIF lists.

If you are interested, we can discuss this further. The code can be found here: https://github.com/jarelllama/Scam-Blocklist/blob/main/scripts/check_parked.sh
Currently my workflow for my own blocklist checks for unparked domains daily and adds them back, while checking for parked domains weekly to remove them.

@hagezi
Copy link
Owner

hagezi commented Jan 11, 2025

Thanks @jarelllama, an implementation like your scam lists would be interesting, at least for the TIF.medium and Ultimate. Preferably also for the TIF full. This will then run for days, but that's not a problem, it doesn't have to be checked daily. What do you think?

If you have the time and interest to implement this here, or to expand your parked checker, please feel free. If you want to implement it here in the repo, I can give you the appropriate rights to the repo.

@friendly-bits
Copy link
Author

To update on the current situation with our preset targeting the 128MiB routers: it's back into reasonable territory now with 255k domains after deduplication.

So the immediate problem has been solved, for now. Thank you @hagezi and @jarelllama. I'm still not sure whether we can and should count on list sizes staying more or less fixed in the long term. This is important in the context of embedded devices which people typically set up once and then they just run with those settings for a long time.

@hagezi hagezi added the in progress A solution is being worked on label Jan 11, 2025
@jarelllama
Copy link
Contributor

Thanks @jarelllama, an implementation like your scam lists would be interesting, at least for the TIF.medium and Ultimate. Preferably also for the TIF full. This will then run for days, but that's not a problem, it doesn't have to be checked daily. What do you think?

If you have the time and interest to implement this here, or to expand your parked checker, please feel free. If you want to implement it here in the repo, I can give you the appropriate rights to the repo.

I'll give it some thought. I would not recommend removing parked domains from the main TIF since some parked domains still host malicious content on their subfolders.

I also recommend checking for unparked domains as some of these domains do get unparked after some time. This would be similar to checking for dead domains that have become resolving again. Perhaps if I have insight into how you currently handle dead and resurrected domains in your build process, I can think of how to integrate the parked domain check.

@hagezi
Copy link
Owner

hagezi commented Jan 11, 2025

Yes @jarelllama, thank you, you're right. I'll have to give it some thought. The domains identified as dead are also regularly checked for domains that are active again. I also count SERVFAIL as dead, as they often reappear as NOERROR.

@friendly-bits
Copy link
Author

friendly-bits commented Jan 12, 2025

@jarelllama I took a peek at your script. First [edited out]. Second, I think some optimizations could make the script faster. Don't know if by much or not. Would you be interested in some PRs from me?

@jarelllama
Copy link
Contributor

jarelllama commented Jan 12, 2025

@friendly-bits indeed some malicious sites can fake being parked like that. I guess it matters less to my scam blocklist since most scam/phishing sites are meant to look genuine. This is also why I think checking for parked domains is more suited for TIF mini, and not the more aggressive lists; not all parked domains are benign.

Feel free to make PRs! The parked check script is something I had to come up myself since there wasn't much reference online, thus, its jank

Do note that in the script I had to split the different functions into two parts to get around the github job time limit

@hagezi
Copy link
Owner

hagezi commented Jan 14, 2025

I have tested the removal of “parked domains”. It is too risky to remove them for the TIF.

@hagezi
Copy link
Owner

hagezi commented Jan 14, 2025

The size seems to be stable, I don't know what it will look like in a few weeks. I can't fix the size, I can't do more than remove inactive domains for now.
If necessary, the use of a TIF version will have to be abandoned if there is insufficient RAM.

@hagezi hagezi closed this as completed Jan 14, 2025
@jarelllama
Copy link
Contributor

I have tested the removal of “parked domains”. It is too risky to remove them for the TIF.

All good 👍 If I have any more ideas on how to reduce blocklist size I'll let you know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in progress A solution is being worked on question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants