A Guide to Adding a Sitemap to Your Robots.txt File

A fundamental aspect of creating content online is wanting it to show up on the search engine results page (SERP) so your audience can interact with your work. In order for your content to appear in the SERPs, your website needs to tell the search engine crawlers which pages should show up and rank. To do so, you need to make sure you’ve correctly added a Robots.txt to your website’s Sitemap.

What is Robots.txt?

Robots.txt is a text file that sits in your site’s root directory. Through a series of inputs, you create a set of instructions to tell the search engine robots which pages on your website they should—and should not—crawl.

You can manually tell it which web pages to ignore by setting them to “disallow” while the “user-agents” portion of the Robots.txt text file should include all the URLs each specific crawler (Googlebot, for example) should access. This process allows the crawlers to crawl specific portions of your site that can then appear in the SERPs and rank.

Robots.txt files also need to include the location of a crucial file for your website: the XML Sitemap. This outlines which pages on your website that you would like to be discovered on the search engine results page.

What is an XML sitemap?

An XML sitemap is an .xml file that lists all the pages on a website that you would like the Robots.txt file to discover and access.

For example, if you have an ecommerce website with a blog that covers various topics in your industry, then you would add the blog subfolder to the XML sitemap in order for crawlers to access and rank these pages in the SERPs. But you would leave out the store, cart, and checkout pages on the XML sitemap, because these are not good landing pages for potential customers to visit. Your customers will naturally go through these pages when purchasing one of your products, but they certainly wouldn’t start their conversion journey on the checkout page, for instance.

XML sitemaps also carry important information about each URL through its meta-data. This is important for SEO (search engine optimization) as the meta-data contains crucial ranking information that will allow the URLs to rank against competitors in the SERPs.

An important note regarding subdomains

Do you use subdomains on your site? If so, you will need a Robots.txt file for each subdomain. For example, if you split up your website into various subdomains, such as: www.example.com, blog.example.com, shop.example.com, etc, you should allocate a Robots.txt file for each.

A brief history of sitemaps

In the mid-2000s, Google, Bing, and Yahoo! joined together to support a system that automatically checks for XML sitemaps on websites via the Robots.txt file. This was known as Sitemaps Autodiscovery and it means that even if users do not submit their website’s sitemap to multiple search engines, the sitemap would automatically be discovered from your site’s Robots.txt file.

This makes your Robots.txt file even more significant because it can allow you to skip the step of submitting your sitemap to major search engines. Just be sure to add the sitemap to other search engines if you are trying to show up on other search engines besides Google, Bing, and Yahoo!.

While adding a sitemap URL to your Robots.txt file may make the process of submitting a sitemap directly to search engines optional, there are distinct advantages to utilizing Google Search Console and Bing Webmaster Tools to send your sitemap directly to the search engine. The reporting system included by Google and Microsoft within these tools allows site owners to maintain a more granular view of which URLs have been crawled and/or indexed.

It should be noted that both of these tools also provide excellent resources for managing your robots.txt file itself. Both offer a Robots.txt tester that allows you to ensure proper syntax when building the file. This ensures the web crawlers won’t ignore directives issued by the robots.txt file, which could lead to problematic access issues and the need to utilize NOINDEX tags to remove sensitive pages from Google or Bing’s index.

How to add your XML sitemap to your Robots.txt file

Below are the three simple steps to adding the location of your XML sitemap to your Robots.txt file:

Step 1: How to locate the sitemap

If you worked with a third-party developer to develop your site, contact them to see if they provided your site with an XML sitemap. However, by default, the URL of your sitemap is /sitemap.xml. This means that the example site www.example.com would have its sitemap at www.example.com/sitemap.xml.

Some larger sites require a sitemap for all of their sitemaps. This is known as a sitemap index and it follows the same protocol for location as does the standard sitemap. Instead of /sitemap.xml, the sitemap index can be found at /sitemap_index.xml.

Alternatively, if Google previously crawled and indexed your website, then you can find it by using these search operators:

site:example.com filetype:xml, or filetype:xml site:example.com inurl:sitemap

If your website needs to create a sitemap, consider using an XML sitemap generator such as xml-sitemaps.com, which provides a free sitemap for websites with fewer than 500 pages.

Step 2: How to locate the Robots.txt file

Similar to locating the Sitemap, you can check if your website has a robots.txt file by typing /robots.txt after the domain such as www.example.com/robots.txt.

If you do not have a robots.txt file, you will have to create a simple text file with a “.txt” extension and add it to the root directory of your web server.

The easiest way to set up a Robots.txt file that does not restrict any crawling access to your website is with this augmentation:

User-agent: *

Disallow:

If you would prefer to give more specific instructions to the crawling robots about which URLs to disavow, or to not crawl, you must fill out which “user-agents” you want to provide access to or which URLs you would like to “disallow.”

Step 3: How to add a sitemap to the Robots.txt file

Start by locating the sitemap with this directive: Sitemap: https://www.example.com/sitemap.xml. This means that the robots.txt file will become: Sitemap: https://www.example.com/sitemap.xml User-agent:* Disallow:

As mentioned, when creating a sitemap, it is important to remember to add and edit the “User-agent: * Disallow:“ augmentation at the end of the sitemap to tell the robots which URLs to crawl.

Last but not least, if your site is large or has several subdomains or subsections, you will need to create more than one sitemap with a Sitemap Index file. This will also allow you to keep your sitemap more organized as it will be categorized in large sections instead of being grouped together under one large singular sitemap.

There are two options to add a sitemap index file into your robots.txt:

You can report your sitemap index file URL within the robots.txt file. The result would look like this:

Sitemap: https://www.example.com/sitemap_index.xml

You can also report separate URL sitemap files for each section of your website such as blogs, posts, etc:

Sitemap: https://www.example/sitemap_posts.xml

Sitemap: https://www.example.com/sitemap_blogs.xml

Conclusion

Adding your sitemap in the robots.txt file tells search engine bots where to find the sitemap and how to use it to crawl and index your site. This improves the site’s crawlability and leads to better indexing. Plus, when you provide the search engines with a clear understanding of the structure and content of your site, the sitemap can help improve your overall search engine rankings.

Don’t miss out on this critical step in setting up your website’s architecture!