-
Notifications
You must be signed in to change notification settings - Fork 40
Search.gov sitemap setup for Global and Policy‐guidance site‐search
The site-search of fec.gov uses the General Service Administration's search.gov search engine in addition to the FEC API for Candidates and Committees (See fec/search/views.py).
To get access to Search.gov dashboard, slack or email John Carroll ([email protected]) or Pat Phongverijati.([email protected])
The search.gov search engine indexes sitemaps for both Global site-search
and Policy and other guidance search
by referencing the production robots.txt. We serve a different robots.txt in other environments (dev, stage, feature) that disallows crawling so that these subdomain URLs never get indexed.
Production robots.txt
(https://www.fec.gov/robots.txt):
User-agent: usasearch
Crawl-delay: 2
Allow: /
Disallow: /search/?*
Disallow: /data/legal/search/?*
Disallow: /data/search/?*
Sitemap: https://www.fec.gov/sitemap-wagtail.xml
Sitemap: https://www.fec.gov/resources/cms-content/documents/sitemap_pdf.xml
Sitemap: https://www.fec.gov/resources/cms-content/documents/sitemap_html.xml
User-agent: *
Crawl-delay: 10
Disallow: /search/?*
Disallow: /data/legal/search/?*
Disallow: /data/search/?*
etc...
Dev, stage, feature robots.txt
(https://dev.fec.gov/robots.txt):
User-agent: *
Disallow: /
Accessed via the search box on every page of fec.gov:
The global sitemap, https://www.fec.gov/sitemap-wagtail.xml/ , is auto-generated by enabling the built-in Wagtail sitemap feature. This only indexes live, public Wagtail pages. Certain page-types can be excluded by overriding get_sitemap_urls()
for a page model(See banner models in models.py
). Other important, non-Wagtail pages like data-tables, calendar , etc. are put in the Best Bets section of the search.gov dashboard and show up as suggested results at the top of search results. For even more options for future expansion of our searchable content, see the section titled: " Expanding Global site-search beyond just Wagtail pages."
Accessed via the search box on this page: https://www.fec.gov/legal-resources/policy-and-other-guidance/guidance-documents/ :
Search results are limited to only items in the two Policy and other guidance sitemaps. These are uploaded as documents in Wagtail and we reference these urls in robots.txt
:
- https://www.fec.gov/resources/cms-content/documents/sitemap_pdf.xml
- https://www.fec.gov/resources/cms-content/documents/sitemap_html.xml
A copy of the latest-uploaded version of each sitemap is also included the fec-cms Github repo just for version control, these files are not exposed to the web through the CMS.
- https://github.com/fecgov/fec-cms/blob/develop/fec/search/management/data/sitemap_pdf.xml
- https://github.com/fecgov/fec-cms/blob/develop/fec/search/management/data/sitemap_html.xml
The separation of Policy and other guidance search results from the Global search results is achieved by putting the documents and pages in a dedicated directory(in S3 or Wagtail) and then limiting the search to those directories using the domains section of search.gov dashboard.
Note: Currently, Global search sitemap entries are not available to Policy and other guidance search, but the Policy and other guidance sitemap entries are available to the global search. We will likely isolate both in the future so that they are mutually exclusive but we are still discussing this because it has the benefit of making most FEC-form PDFs searchable in the global site-search.
Documents:
- Ask a developer to upload the document to the production S3 bucket at
resources/cms-content/documents/policy-guidance
- Update the sitemap:
- If it is a new PDF, add the path to the PDF to
sitemap_pdf.xml
and update the lastmod date. - If it is existing, just update the lastmod date.
- If it is a new PDF, add the path to the PDF to
- Replace
sitemap_pdf.xml
in Wagtail documents area with the updated version. - Create a Github PR to also update the sitemap in the repo with the latest version.
Webpages:
- The webpages, most of which are in
/updates
in Wagtail, are kept in their original location and an alias is created under the/updates/guidance-search/
parent in Wagtail. - To add a new page, simply create the page wherever is logical in Wagtail. Create an alias of it by using the
Copy
option in Wagtail and click theAlias checkbox
and choose/updates/guidance-search/
as its parent.(Wagtail WIKI on aliasing pages) - To edit an existing page, simply edit the page and publish. The changes will be reflected in the existing alias.
- Update the sitemap:
- If it is a new page, add the path to the page to
sitemap_html.xml
and update the lastmod date. - If it is existing, just update the lastmod date.
- If it is a new page, add the path to the page to
- Replace
sitemap_html.xml
in Wagtail documents area with the updated version. - Create a Github PR to also update the sitemap in the repo with the latest version.
When a Wagtail page is unpublished or changed to draft or private, it is removed from the Global sitemap. For Policy and other guidance, an item must be manually removed from either of the sitemaps (PDF or HTML). However, once an item has been indexed by search.gov, removing it from a sitemap does not automatically remove it from search results. This applies to both Global
and Policy and other Guidance
. You can send an email to search.gov support at [email protected] to request items be removed from the index immediately. Otherwise, items no longer on the sitemaps will be removed from the index after 30 days. See search.gov's more detailed explanation below:
The difference between updating indexing vs. indexing new content
- Ingesting new content is done every two hours off of sitemaps. this picks up URLs we didn’t know about before.
- At the same time, we scan the sitemaps for any updated timestamps on URLs we already knew about and re-fetch those to get the updates.
- For URLs that do not or cannot show up on a sitemap with updated timestamps, we have a job that will recheck each URL if it’s been 30 days since we last fetched it. So, today, for example, we’re checking any URLs that were last fetched on July 17, 2023. This is to pick up updates to the pages that didn’t get a new date and to find updated response codes, like 301s or 404s, and remove those URLs from the index. Usually, URLs that are 301ing and 404ing are not included on the sitemaps, so we have to detect them separately.
These are search suggestions that you can manually add (or add in bulk by uploading a spreadsheet) which map a URL to a specific set of search keywords. For Gloabal search, Best Bets will be returned at the top of the search results in a section titled "Suggested results". For Policy and other guidance, the Best Bets are always pushed to the top of the result list, but do not have a separate section heading.
- Best Bets for Global search : https://search.usa.gov/sites/6738/best_bets_texts
- Best Bets for Policy and other guidance search: https://search.usa.gov/sites/8042/best_bets_texts
You can test search results on the search.gov dashboard for each affiliate by going to the Preview dashboard. Keep in mind that search terms are cached. One tip suggested by search.gov to test a term that has already been cached, is to slightly change your search term by its capitalization or punctuation (i.e. "Form 3p", "form-3p") . You can see cached queries by going to Analytics > Queries
in the dashboard.
To allow your local site's search boxes to return search results, grab the following environment variables from cf target -s prod
and export them in your terminal window or your shell configuration (.bash_profile or your shell’s equivalent).
export SEARCHGOV_API_ACCESS_KEY=<>
export SEARCHGOV_POLICY_GUIDANCE_KEY=<>
-
You can query to the search.gov API directly at https://api.gsa.gov/technology/searchgov/v2/results/i14y. Three parameters are required: (1) affiliate, (2) access_key, and (3) query (docs):
Example query: https://api.gsa.gov/technology/searchgov/v2/results/i14y?affiliate=betafec_api&access_key=xxxxxxxxxx&query=reporting dates
We can dynamically generate sitemaps to index more that just Wagtail pages, using the Django sitemap framework. Two examples would be to index all of the documents in Reports about the FEC and to index all of our form PDFs. This WIP PR has an example of both. The sitemap view-code is in urls.py
for demo purposes, but it would ultimately be created as its own file like sitemap-views.py
and imported into urls.py
.
We always have the option to manually write additional sitemaps if necessary, although dynamically created and updated sitemaps are obviously more ideal.