There are many reasons to examine an XML sitemap. For an SEO or analyst it can provide a means of quickly understanding a client’s site structure. From a technical perspective, it can be compared with a site crawl to discover excluded pages. Furthermore, mobile and desktop versions of the sitemap can be compared to illuminate discrepancies.

In any case, the first step is getting the page URLs from the sitemap into a more manageable data structure – a task made simple with Python. In this post we’ll explicitly show how this can be done for an XML sitemap. Following this we’ll discuss a tool that can categorize and visualize the resulting list of URLs.

In overview, we will show and describe Python scripts for:

  • Extracting page URLs from an XML sitemap
  • Categorizing URLs by page type
  • Plotting a sitemap graph tree

The scripts are compatible with Python 2 and 3, and instructions for installing the library dependencies can be found in the source code repository.

Extracting URLs from an XML Sitemap

We will use the www.sportchek.ca XML sitemap as our case example. Like most large e-commerce sites, the page URLs are split among multiple XML files. In general this is done either due to the limitation of an XML sitemap (50,000 URLs or 10MB) or as a preference for better indexing of the site’s content and quickly troubleshooting any errors that may occur. An XML sitemap index is created to store the addresses of additional sitemaps, and this single file can be submitted to search engines.

We start by opening the XML sitemap index URL in Python using the Requests library and then instantiate a “soup” object containing the page content.

import requests
from bs4 import BeautifulSoup

url = 'https://www.sportchek.ca/sitemap.xml'
page = requests.get(url)
print('Loaded page with: %s' % page)

sitemap_index = BeautifulSoup(page.content, 'html.parser')
print('Created %s object' % type(sitemap_index))

Loaded page with: <Response [200]>
Created <class 'bs4.BeautifulSoup'> object

Next we can pull links to the indexed XML sitemaps, which live within the <loc> tags.

urls = [element.text for element in sitemap_index.findAll('loc')]
print(urls)

['https://www.sportchek.ca/sitemap01.xml',
'https://www.sportchek.ca/sitemap02.xml',
'https://www.sportchek.ca/sitemap03.xml',
'https://www.sportchek.ca/sitemap04.xml',
'https://www.sportchek.ca/sitemap05.xml',
'https://www.sportchek.ca/sitemap06.xml',
'https://www.sportchek.ca/sitemap07.xml',
'https://www.sportchek.ca/sitemap08.xml']

With some investigation of the XML format within each of the files above, we would again see that page URLs can be identified by searching for <loc> tags. These URLs can therefore be extracted in the same way as the above links were from the XML sitemap index. We loop over the XML documents, appending all page URLs to a list.

def extract_links(url):
    ''' Open an XML sitemap and find content wrapped in loc tags. '''

    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    links = [element.text for element in soup.findAll('loc')]

    return links

sitemap_urls = []
for url in urls:
    links = extract_links(url)
    sitemap_urls += links

print('Found {:,} URLs in the sitemap'.format(len(sitemap_urls)))
Found 52,552 URLs in the sitemap

Let’s write these to a file named “sitemap_urls.dat” which contains one URL per line and can be opened in any text editor.

with open('sitemap_urls.dat', 'w') as f:
    for url in sitemap_urls:
        f.write(url + '\n')

 

Automated URL Categorization

Site-specific categorization of pages containing display listings, products, blog posts, articles, campaigns, etc., can be done by applying custom filters over the URL list. Python is great for this because filters can be very detailed and chained together, plus your results can be reproduced by simply rerunning the script.

On the other hand, we could take a different approach and – instead of filtering for specific URLs – apply an automated algorithm to peel back our sites layers and find the general structure. This algorithm allows for a site’s structure to be quickly revealed, and it has been built into our categorize_urls.py Python script. This takes the “sitemap_urls.dat” file (generated in the previous section) as input and it can be run by navigating to the local file location in the terminal and executing, for example:

python categorize_urls.py --depth 3

This will create a file named “sitemap_layers.csv” that can be opened in Excel and will look something like this:

spreadsheet example

This type of categorization lets us understand the site layout and page distribution at a glance. Counts are included to indicate the number of pages containing the specified path.

The depth argument can be tuned to your liking, allowing the list of URLs to be probed with varying amounts of granularity. We’ll experiment more with this feature when creating visualizations in the next section.

Visualizing the Categorized Sitemap

Tables are an excellent way to store data, but they are not always the best way to look at it. This is especially true when sharing your data with others.

Our categorized sitemap can be nicely visualized using Graphviz, where paths are illustrated with nodes and edges. The nodes in our graphs will contain URL layers and the edges will be labelled by the number of sub-pages existing within that path.

This graphing algorithm is built into our visualize_urls.py Python script that takes the “sitemap_layers.csv” file (generated in the previous section) as input. The script can be run by navigating to the local file location in the terminal and executing, for example:

python visualize_urls.py --depth 2 --title "Sitemap Overview" --size "10"

XML sitemap graph visualization

Setting layers=3 we see that our graph already becomes very large! Here we set size=”35″ to create a higher resolution PDF file where the details are clearly visible.

python visualize_urls.py --depth 3 --size "35"

XML sitemap visualization

Another useful feature built into the graphing tool is the ability to limit branch size. This can let us create deep sitemap visualizations that don’t grow out of control. For example, by limiting each branch to the top three (in terms of recursive page count) we can go deeper:

python categorize_urls.py --depth 5   
python visualize_urls.py --depth 5 --limit 3 --size "25"

XML sitemap graph visualization

Summary and Source Code

In this post we have shown how Python can be used to extract, categorize and visualize an XML sitemap. The code we looked at to extract URLs has been aggregated into a Python script for your personal use. This and the other two scripts can be downloaded here:

XML sitemap visualization tool

These can be leveraged to automate URL extraction, categorization and visualization for the sitemap of your choice. For usage instructions, please make sure to check out the source code documentation.

Thanks for reading, we hope you found this tutorial useful. If you run into any problems using our automated XML sitemap retrieval scripts we are here to help! You can reach us on twitter @ayima.

Written By Alex Galea
BACK TO TOP