-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable search engine indexing #55
Comments
@PVautour after our meeting with Google, did you ever look further into the SEO kit |
I don't know about an SEO kit, but I believe the main solution we concluded would be suitable was to manually create a robots.txt that would contain urls to every public page that we want crawled. Would that suit our needs ? |
I will defer to @indiciumx for direction. |
Ok so it's not actually the robots.txt file that is appropriate for this. It is the sitemap.xml. I am throwing together a quick function to fetch id's from the database and generate sitemap files pointing to those pages. One sitemap is limited to 50MB (uncompressed) and 50,000 URLs, but you can create one root sitemap that points to the others. In theory this means that it should not be much of an issue to list all of our pages. For now i'm using a lambda and will drop the sitemaps in s3. We can then use web sub to inform search engines of any updates to our sitemap. Here is documentation for google that lists all we need to do: Build and Submit a Sitemap |
Hello, I have a lambda generating the sitemaps for all files in a bucket here: Just have to set the correct permissions and environment variables to generate the files and store them in s3. Two questions:
Feel free to give me a call this morning or Monday afternoon. |
@indiciumx @PVautour should we put the Lambda code in this repo? |
I don't think so no. It should probably be eater:
Though we could also do a monorepo style system. I wich case the source code/cloudformation for the lambda could be put in a folder within the monorepo. I dont think we want to deploy the resulting sitemaps in this repo. I will discuss deployment of the sitemaps with chris this morning. I expect we would rather add the sitemaps to the root of the site without managing them in git. (Server redirects, automatic build process, etc) Also interesting thing to note,I think this repo currently contains build output, but no actual source code. On a sidenote, AWS has always fought me when trying to use git to manage it. At my current level of knowledge, I expect versioning release cloudformations is a realistic middle ground, but if you know of a successful way of decoupling the dev process from the AWS web ui/live environment I would be happy to learn from you! I know cloudformations technically can do that, but it hasn't been super practical for me irl unfortunately. |
@indiciumx Ok so the sitemaps are here:
there is a @jvanulde There is still a few improvements to do to the lambda before we can close the issue, but this should be suitable for our immediate needs. |
@indiciumx how are we going to test this? I suppose we can put it in a public directory and have Google index it. |
I looks like this work was done on dev/stage which has records that only exist on dev/stage. The lambda should be run on prod. For example, here are links that work (green) and don't exist on prod (red) <?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
+ <loc>https://app.geo.ca/result?lang=en&id=0b229ec0-da50-4b29-88da-49c85a5944e2</loc>
</url>
<url>
+ <loc>https://app.geo.ca/result?lang=en&id=0b2303be-ef05-49a8-8082-44a3eabcfa57</loc>
</url>
<url>
- <loc>https://app.geo.ca/result?lang=en&id=0b258202-2271-4ad2-a44b-f9d8c9281342</loc>
</url>
<url>
- <loc>https://app.geo.ca/result?lang=en&id=0b346aa1-3090-4223-ac84-ed7287bc78a9</loc>
</url>
<url>
- <loc>https://app.geo.ca/result?lang=en&id=0b35fc92-9e28-49c7-b1eb-607d2e608509</loc>
</url>
<url>
+ <loc>https://app.geo.ca/result?lang=en&id=0b399378-eff8-4cea-97b8-b307c9b2398a</loc>
</url>
<url>
+ <loc>https://app.geo.ca/result?lang=en&id=0b442f1b-1951-45c8-80ee-cfb8bceb1d72</loc>
</url>
<url>
+ <loc>https://app.geo.ca/result?lang=en&id=0b50b49e-aadc-24c4-ec85-148df785fe5e</loc>
</url> |
So the reason this was not run in prod, Is because we wanted to be able to release the sitemaps as content and not infra/code. The plan was to generate the files in staging and then release them as the records where expected to have been the same. We could eater reassess and deploy the code to prod on a timer, or realign content on staging and prod and rerun it. |
For this time, I think it is okay. Let's see if Google is able to pick up the sitemap. For the next time, let me know and I will sync staging before the xml files are generates. |
Ok cool thanks bo! |
@PVautour please close if indexing is successful. |
@indiciumx can you check if indexing is working? |
Indexing seems to be correct for resources in the google console. I am still waiting for actual indexing as status is currently discovered - not indexed. aka pending. The lambda that generates indexes now needs to be tweaked to replicate manual fixes to root sitemap. |
All the sitemaps are currently indexed. The resources within them are all pending indexing. All seems good. |
@jvanulde I was hoping the pages would be indexed by now, but they are not. I'm not shure what we should do about this, but here are some options:
Here is what I did this morning:
Tell me if you wanna chat about this issue or have any ideas. |
Update: The sitemap generation lambda has been updated to a more permanent solution. It's now production ready and should not require any manual intervention. Still waiting on google to index though. I've looked at documentation about posting sitemaps and the console and still don't see any option other than waiting for indexation to happen unfortunately. |
Currently there is no sitemap for https://geo.ca URLs and there should be. The sitemap that was submitted contains only https://app.geo.ca URLs and those pages aren't really very search engine friendly. It might be better to focus on getting the more SEO/user friendly URLs on https://geo.ca indexed before moving on to indexing the map results across the various subdomains. If https://geo.ca is going to be the main entry point for users then we should focus our attention there. Before trying to get those subdomains indexed we should try to get some of the SEO fundamentals sorted out too. For example the https://app.geo.ca results are missing the It is unlikely that you'll be able to get Google to expedite indexing. Google will crawl it when it feels like it basically. Google has historically had trouble crawling and indexing sites built with React or other Javascript libraries that rely on client side rendering of content. While this has gotten better this could be part of the problem. I have some suggestions that while they may not directly impact indexing would be helpful for SEO. I'd like to get to know about more app.geo.ca so I can make appropriate suggestions. @jvanulde @PVautour If there is a chat to be had about this issue please include me. |
@PVautour Is it possible to exclude URLs that return "No results." Example: https://app.geo.ca/result?lang=en&id=a8c04b4f-8c62-4d47-b41f-ab81c9865b09 These are going to return a soft 404 and if Google is only crawling X amount of URLs each time it hits the sitemap then we are prolonging the process with URLs that don't add value. |
Thanks for the input, we can certainly have a chat. I'm pretty shure we want to get the app.geo.ca stuff indexed. For shure the pages themselves can be improved though. I can show you the search console. The pages are actually marked as queued for idexing. Just hasn't been done yet. I'd be interested in picking your brain about those issues see what we can do. Maybe we can have a meeting tomorrow? |
Pascal and I had a very productive meeting to hash out some details on how to best move forward with getting the app.geo.ca datasets indexed. To start we should remove the currently submitted sitemap. It's not being crawled, generating large amounts of indexing errors and the results that are being indexed are creating poor quality results that have little chance of being found by users or generating clicks. Results that don't generate clicks will rank poorly further reducing the chances of them being found. The errors along with the poor quality results generated from the sitemap are reducing the site's crawl budget and limiting the ability for these datasets to be found and indexed. Please note that this is not to say that the datasets themselves are of poor quality just that in Google's eyes what we are currently serving them makes for poor quality search results or content that they have deemed is not valuable for search users.
At the end of the day, Google is a third party and they move at their own pace. To get them indexing faster we need to give them what they want. Geo.ca SitemapA sitemap should be generated Geo.ca and submitted to search console. The Yoast SEO plugin takes care of this sitemap generation and we will just need to apply a find and replace on the domain in the URLs to move it through the different deployment stages. This sitemap will be a lot smaller and should hopefully be processed quickly. There shouldn't be an issue getting any of the GEO.ca URLs indexed and while those pages still need some SEO work they do have the basics and should hopefully be able to start ranking for some long tail searches. If users can find Geo.ca then they will hopefully click on and follow links to the featured datasets on app.geo.ca and start exploring the data there. Geo.ca links to several datasets on app.geo.ca and as apart of the crawl process Google will organically find and follow those links. While this is only only a very small subset of the data available the context provided by Geo.ca surrounding those links may help with indexing and assist Google in generating better search results for those featured datasets Before resubmitting a new sitemap for app.geo.ca the following items should be doneThe Automatically Extracted Buildings dataset will be used as the basis for below examples. Results Page TitlesThe Example of an updated title: Meta DescriptionsThe Example: Note that I stripped the line breaks Permalink StructureThe current query strings are not user or search friendly and should if possible be converted to a pretty permalink structure. The will allow keywords to be part of the URL which is a ranking signal, provide more human readable URLs and can also increase the shareability of the URLs. The ID may need to be added to the end or a unique slug system created to prevent page title collisions. The pretty permalink should be served in the sitemap and should be set as the canonical version. Example: A unique slug system would be saving a sanitized URL friendly version of the dataset title (automatically-extracted-buildings) in addition to the unique ID of records to avoid collisions. In scenarios where two or more pages have the same title the slugs can be kept unique by appending a number to the end of sanitized data title. The first result to have it's slug generated would have no number. This would be recommended over appending the ID for shorter more friendly URLs though there may be another identifier that could be used to keep these unique and this is open to suggestions. Example:
Ideally the query string versions would redirect to their permalink. When linking to the results from other sites or sharing URLs the permalink should be used. Slug Definition: A URL slug refers to the end part of a URL after the backslash (“/”) that identifies the specific page or post. Canonical TagThe canonical tag allows us to explicitly tell search engines the version of a URL we want indexed. This prevents multiple versions of the same page from being indexed, crawled or flagged as duplicate content. The pretty permalink should be used for this. This would also come into play should we start attempting to track any additional data via a query string that has no bearing on the content. An absolute URL should be used for the canonical. Example: Examples of URLs that would all be the same:
Learn More about Canonical URLs Alternate Links for Localized versionsDatasets are available in both English and French. To assist search engines and ultimately assist our users in finding content in their preferred language we can take advantage of Example: Learn More about Localized Page Versions On a side note the Another side note to consider with translation is having the root URL include the language for at least the French version. When toggling between languages this gives the user a defined URL for their preferred language that they can bookmark, link to or share. This may have an affect on the permalink structure as language should potentially always follow the domain in the url. Example: Lastmod tagThe No results and soft 404s
Datasets that return "No results." should not be included in the sitemap. These pages create soft 404 indexing errors and have a negative impact on the crawl budget. This was caused by staging and production being out sync. Care should be taken to ensure there is a 1:1 match between the environments when generating the sitemap. Additionally if possible checks should be made to exclude missing datasets. Ideally a 404 http status code should be returned for missing results though this may not be possible with the current architecture. Server Side RenderingThis would be a nice to have and is something to consider for down the road. Server Side Rendering could help improve page performance which in turn would help with crawl efficiency. Page performance is also used as a ranking signal. Summary (TLDR)The current app.geo.ca sitemap is hurting more than helping and should be removed. Some basic SEO should be done and permalinks implemented on app.geo.ca so we can resubmit a clean search engine friendly sitemap. A sitemap for geo.ca should be generated and added as soon as possible to get it indexed, ranking and provide a path for search users to find geo.ca and in turn discover app.geo.ca. app.geo.ca Tasks
geo.ca tasks
@PVautour would handle changes on the app.geo.ca side and I would handle changes for geo.ca. @jvanulde can you please provide your approval for this path forward? @sean-eagles please assign me to this task. Thank you. |
Impressive write up jared i'm glad we got you on the team! In my opinion: What is proposed > Waiting longer for my sitemap to be indexed |
Thanks Jared, very nice writeup, you have been assigned. |
Problematic sitemaps have been deleted in prod. |
We now set lastmod in sitemaps. This is in staging and should be reflected when we do redeploy the sitemaps. |
I've setup Simply Static to use absolute URLs. The sitemap index and sitemaps are now included in the static site generation. Additional I've added all of the /Viewer/ links to the sitemap. Next time we redeploy the static site we can submit the sitemap to search console. |
The sitemap index has been submitted to search console. |
It's been almost two weeks(13 days) since I submitted the sitemap and Google is slowly but surely indexing the links within the sitemap. Of the 73 links currently in the sitemap we have gone from 17 -> 28 -> 34 pages now indexed. There are some indexing issues that are currently flagged for the sitemap URLs but those are slowly resolving themselves as google recrawls the pages. |
In preparation of the official launch of Geo.ca in mid-November we need to ensure that search engine indexing is enabled and functioning.
The text was updated successfully, but these errors were encountered: