Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

noindex/nofollow on a per-page basis #4

Open
2 tasks
pierremanceaux opened this issue Oct 1, 2020 · 10 comments
Open
2 tasks

noindex/nofollow on a per-page basis #4

pierremanceaux opened this issue Oct 1, 2020 · 10 comments

Comments

@pierremanceaux
Copy link

pierremanceaux commented Oct 1, 2020

Opening an issue, following this #coderedcms thread.

Prerequisites

To understand better how noindex/follow work, this is a good starting point.

Goal

Ability to control on a per-page basis the presence of noindex and nofollow robot instructions.

How do I imagine it?

From the top of my mind, the most needed feature would first to be able to noindex a page, in order to have a better control over what part of a website should be indexed by search engines.

The most basic approach would be to have a checkbox in the "SEO" tab, like the following:

  • Exclude this page from being indexed by search engines

or another version for more technical users:

  • "noindex" this page (using "<meta>" tag)

It would then output <meta name="robots" content="noindex"> in the head of the page. That's it.

Edit: Exclude no-indexed pages from the sitemap. See this comment.

Suggestions to make it better

  • Let the user choose if the noindex directive should be applied via a tag in the , or via request headers. Could be a global setting (site-wide), or local to each page, not sure what would be best.
  • Add a similar checkbox to enable nofollow
  • Apply noindex or nofollow on the current page, and all child pages too. Useful when for example you have a pure SEA marketing group of pages that you want to be isolated. For instance www.company.com/lp/, every child pages could have noindex preset, inheriting this setting from /lp/.
  • A way to control which bots to target (tricky, and quite advanced I guess).
@vsalvino
Copy link
Contributor

vsalvino commented Oct 1, 2020

Would it make sense to also add the nofollow URLs into the robots.txt?

@pierremanceaux
Copy link
Author

pierremanceaux commented Oct 1, 2020

@vsalvino I honestly don't know. I don't know what the SEO best practices look like in 2020. I personally like the modularity of the meta tags in the head. It makes it also simple to debug when you see everything in the page source, I don't imagine myself checking a robots.txt file to know what's excluded. An SEO expert could help you on this, I'm not the right person :)

@moojen
Copy link

moojen commented Oct 7, 2020

Would it make sense to also add the nofollow URLs into the robots.txt?

Yes it would. This still is used by Google and others to see what is allowed to index. So I would definitely include this as something which should be included, either in the same feature, or in a separate feature.

@pierremanceaux
Copy link
Author

One more point I forgot to raise: the sitemap. Today we have a custom solution in place (but far from perfect) to generate the sitemap based on pages not having the noindex flag. It would apparently be a bad practice to have in the sitemap pages that should not be indexed.

It means that we cannot use the Wagtail implementation to generate our sitemap (see https://docs.wagtail.io/en/v2.1.1/reference/contrib/sitemaps.html#basic-configuration). So it would be important to take that in consideration when working on this issue, I believe.

@vsalvino
Copy link
Contributor

It would definitely make sense for us to provide a better sitemap, if Wagtail's is limiting. I'd be happy to review a PR @pierremanceaux if you would be willing to share your implementation?

@pierremanceaux
Copy link
Author

Hey @vsalvino , here is what we have for now. Keep in mind that this code it 4 years old and probably needs some polishing, but hopefully it helps! ;)

View

@never_cache
def sitemap_view(request):
    cache_key = 'wagtail-sitemap:' + str(request.site.id)
    sitemap_xml = cache.get(cache_key)

    if not sitemap_xml:
        sitemap = Sitemap()
        sitemap_xml = sitemap.render()

        cache.set(cache_key, sitemap_xml, getattr(settings, 'WAGTAILSITEMAPS_CACHE_TIMEOUT', 6000))

    response = HttpResponse(sitemap_xml)
    response['Content-Type'] = "text/xml; charset=utf-8"

    return response

Sitemap generation

class Sitemap(object):
    EXCLUDED_TYPES = [
        JobSinglePage
    ]
    template = 'sitemap.xml'

    @staticmethod
    def _get_urls():
        site = Site.objects.filter(is_default_site=True).select_related("root_page").get()
        pages_qs = site.root_page.get_descendants(
            inclusive=True
        ).live().public().exclude(basepage__seo_robot_meta__icontains="noindex").order_by('path')\
            .specific()

        for page in pages_qs.iterator():
            # TODO: replace this by filtering this in the queryset
            if type(page) in Sitemap.EXCLUDED_TYPES:
                continue
            for url in page.get_sitemap_urls():
                yield url

    def render(self):
        return render_to_string(self.template, {
            'urlset': self._get_urls()
        })

Template

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
{% spaceless %}
{% for url in urlset %}
  <url>
    <loc>{{ url.location }}</loc>
    {% if url.lastmod %}<lastmod>{{ url.lastmod|date:"Y-m-d" }}</lastmod>{% endif %}
    <changefreq>weekly</changefreq>
   </url>
{% endfor %}
{% endspaceless %}
</urlset>

@anefta
Copy link

anefta commented Nov 30, 2020

Hi,
the per page noindex and nofollow is super critical nowadays in SEO nowadays.
SEOs mostly use robots.txt to block admin areas and disallow bots and not pages.

I hope this feature will be added to next release!

@benlamptey-gocity
Copy link

Has this been achieved?

@gideonaa
Copy link

Hi, checking in about 4 years later, wondering if this has been implemented? Or perhaps people are using some other solution?

@vsalvino
Copy link
Contributor

This has not been a priority or a need for us. However, if someone is willing to implement it, including tests and docs, I would be willing to review and merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants