-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reliable WordPress detection #8
Comments
I saw this tweet - https://twitter.com/hdjirdeh/status/1092875246309265408 happy to help. Since WordPress 4.4, there should be a link element in the header with a rel attribute equal to https://api.w.org/ and a href attribute equal to the site's URL. That's pretty good for recent versions. There are some endpoints that have to exist. wp-admin.php, wp-cron.php, etc. I wonder if checking for those is enough? If site.com/wp-admin.php returns a status code for unauthroized access, probably a WordPress site -- or a security feature blocked it. |
Not a WordPress expert by any means, but here's my two cents from experience:
This two points are what Wappalyzer does, not sure why this is considered as an overkill. But please, pretty please, don't poke around URLs like |
PS: the Definitely go with |
That's a great option. Thanks.
Making new network requests is out of scope for us, but good thinkin'.
We are considering all network requests, so if the page makes requests to
On my WP sites, in addition to the |
@machour https://developer.wordpress.org/rest-api/using-the-rest-api/frequently-asked-questions/#can-i-disable-the-rest-api recommends against disabling and doesn't even document how to anyway.. But have you seen sites that do disable it for security reasons or something? |
@paulirish here are some insights Link is generated by this function function rest_output_link_header() {
if ( headers_sent() ) {
return;
}
$api_root = get_rest_url(); // <--
if ( empty( $api_root ) ) {
return;
}
header( 'Link: <' . esc_url_raw( $api_root ) . '>; rel="https://api.w.org/"', false );
}
So any developer could write a filter that simply returns |
@paulirish ->
Yes, some plugins do this for "security" reasons. That's a) a bad idea with little upside b) probably going to be combined with security through slight obscurity measures like changing the location of wp-login.php or wp-content dir. c) Not super common. |
@machour and @Shelob9 very useful, thank you. For anyone else visiting this thread.. I'm still interested in ideas regarding detection with JS. For example: // passes if any favicons, stylesheets are provided by the theme
!!document.querySelector('link[href*="wp-content"]')
// same but including scripts, too..
!!document.querySelectorAll('link[href*="wp-content"], script[src*="wp-content"]').length It's probably possible for a WP site to not trigger the second detect, though IMO it'd be very uncommon. |
Note that may both fail for WordPress sites that are using the AMP plugin, since AMP disallows external stylesheets and custom scripts. If they haven't set a favicon, then there won't be any such links for icons. On the other hand, the majority of WordPress sites keep the generator meta tag intact: <meta name="generator" content="WordPress 5.0.3"> So the very first thing to check for is whether it exists in the page: !!document.querySelector('meta[name=generator][content^="WordPress"]') |
@paulirish images uploaded through WordPress are usually available under "/wp-content/uploads/", so you may want to extend the selector to check |
wp-content is the default it can be changed by setting a constant in wp-config. wp-includes (effectively) can't be changed. |
So it looks like we can do something like this:
I think that should work, especially since |
A PR with this in place: https://github.com/johnmichel/Library-Detector-for-Chrome/pull/131/files |
I think going for file requests is the only (halfway) reliable way here. Maybe the way web-policies work could be an option here. |
@Lefaux I agree that current logic might miss some sites, but we're to shooting for 100%. If you have a prod policy that strips any and all identifiers, a tool auditing the public site may not detect it as such — WAI. However, LH can also be run locally and against development environments; the prod policy can be configured to provide the identifier to your IP or based on logged in state, etc. Which is to say, it's possible to configure the environment to emit the necessary signals, if you're motivated to do so.. The whitelist solution won't scale to a distibuted & self-hosted ecosystem, and it opens a yet different can of worms about out of band requests to a 3P. |
Yeah, I see the problem with scaling, too. Since the whitelisting stuff is not really an option what's your thought on having a |
Yeah, why not check whether |
How would one protect such an endpoint? As in, exposing such a mechanism would work against the very reason why you were stripping the platform+version information. Further, I don't think we can or should rely on new endpoints..
That requires an additional out of band request which, while not impossible, is something we'd like to avoid. Stepping back, I'll come back to what I said earlier: if your site is designed to hide all platform information, then the fact that LH is not able to detect it is not a bug, it's WAI. Despite that, developers that want to see stack specific advice can still get access to it.. by, for example, configuring the environment to expose those signals under certain conditions. Alternatively, one could also imagine a UI where you can manually pick which stack pack strings LH shows, even if LH is not able to detect that platform itself. Does that seem reasonable? :) |
One requirement for adding a stack pack is reliably detecting that the stack/library/platform is being used by the page. We want this detection to be as reliable and bulletproof as possible.
Wappalyzer uses a few approaches which seem overkill and not something we can reuse. We'd like something much more lightweight.
Primary question: Can we detect wordpress via via clientside JS running in the page? (Naturally, it has full access to
window
and the DOM.)Secondary question: Is there another reliable detect based on the network request metadata? We'd like to avoid parsing the response of any network resources (so no looking for patterns in HTML, JS or CSS files). But considering
response headers or paths in urls (like wp-content, etc) is fine.
Could some WordPress experts chime in?
The text was updated successfully, but these errors were encountered: