Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed document page when URL query contains slash #2214

Open
aplhk opened this issue Sep 13, 2021 · 8 comments
Open

Malformed document page when URL query contains slash #2214

aplhk opened this issue Sep 13, 2021 · 8 comments
Labels
bug Something that does not look or behave correctly

Comments

@aplhk
Copy link

aplhk commented Sep 13, 2021

I came across a few links from Google search and found out that precedence of slash (/) in the URL query string will lead to malformed / unresponsive document page.

Example of malformed page: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html?example.com/a

I believe the root cause is in the TOC fetching script:

var url = location.href.replace(/[^\/]+$/, 'toc.html');
var toc = $.get(url, {}, function(data) {
right_col.append(data);
init_toc(LangStrings);
utils.open_current(location.pathname);
}).always(function() {
init_headers(right_col, LangStrings);
});

In this case location.href is https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html?example.com/a, and after replacing the string it will fetch and append https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html?example.com/toc.html which causes infinite loop and unresponsive page.

@gtback
Copy link
Member

gtback commented Sep 13, 2021

Thanks, @aplhk! 🙇🏻 I can reproduce the error you're seeing.

Are you able to share the Google search and/or pages that directed you to that URL? They seem malformed in the first place, so in addition to fixing the behavior, I'd like to fix the URLs at the source if that's something we have control of.

@gtback gtback added the bug Something that does not look or behave correctly label Sep 13, 2021
@gtback gtback self-assigned this Sep 13, 2021
@aplhk
Copy link
Author

aplhk commented Sep 13, 2021

I think this Google dork cover some of the URLs:
site:www.elastic.co/guide inurl:ref

https://www.google.com/search?q=site%3Awww.elastic.co%2Fguide+inurl%3Aref

@gtback
Copy link
Member

gtback commented Sep 13, 2021

Thanks again, @aplhk!

@AnneB-SEO Do you know where these URLs might be coming from? I don't think we use the ?ref= query parameter anywhere within the docs. Are we able to tell Google not to index these sorts of URLs? I can work on the underlying code that's causing the infinite loop.

@AnneB-SEO
Copy link

AnneB-SEO commented Sep 14, 2021

Do you know where these URLs might be coming from?

I'll need to look into it but upon quick glance it looks like the links could coming form 3rd-party sites, like hackermoon.co and driverlayer.com

I don't think we use the ?ref= query parameter anywhere within the docs.

Likely not

Are we able to tell Google not to index these sorts of URLs?

Yes, but only when we are adding the parameters. If they are coming from a 3rd-party, then we can't instruct Google to ignore them

Let me look into it and also yet loop in @brianjolly for good measure : )

@brianjolly
Copy link
Contributor

It looks like Google's URL Parameters tool might be able to help.

https://support.google.com/webmasters/answer/6080548

It says the requirements for using the tool are:

  • Your site has more than 1,000 pages, AND
  • In your logs, you see a significant number of duplicate pages being indexed by Googlebot, in which all duplicate pages vary only by URL parameters (for example: example.com?product=green_dress and example.com?type=dress&color=green).

Would you say this issue falls in that category?

@gtback
Copy link
Member

gtback commented Sep 14, 2021

Thanks, @brianjolly , that looks promising. I'd want to first confirm that the equivalent pages are getting indexed without the ?ref parameter, but if so, I think we can tell it to ignore any pages with a ref query param.

@AnneB-SEO
Copy link

@brianjolly & @gtback - The parameter exclusion only applies to pages we create versus pages created by others. Even so I added the ref parameter on 9/14

Ref-parameter-exclusion-added-09-14-2021

@AnneB-SEO
Copy link

This problem is more extensive and expanding. When this was originally raised there were ~7 URLs from 2 different site (hackermoon.co and driverlayer.com). Today there are over 80 and more than docs are being targeted including Elasticon.

We'll need to file a DMCA takedown notice with Google thru Legal based on:

Ref-parameter-SERPs-hackermoon-09-22-2021

Thanks for finding and raising @aplhk aplhk. Let's leave this one open until we file. Thanks all!!!

@gtback gtback removed their assignment Feb 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that does not look or behave correctly
Projects
None yet
Development

No branches or pull requests

4 participants