Monitoring Sitemap for a Website Using a Crawler

Sitemap monitoring is currently in alpha and available in the Professional plan and above.

What is a Sitemap?

A sitemap is an essential tool that helps search engines understand the structure of a website and crawl its content. It usually contains all URLs of a website, including those that may not be easily discoverable by search engines or visitors.

Why do you need a Sitemap crawler?

There are different formats of sitemaps, but the most common and widely supported format is the XML sitemap. While XML sitemaps are widely used, they may not be updated regularly or generated automatically. So, if you want to receive updates about new links added to a website, monitoring it’s sitemap file may not reflect the current state, and you may be missing out on new or updated content.

To overcome this limitation, you can use Distill’s crawler to index all URLs on a website. A crawler navigates through a website, discovering and indexing pages just like a search engine would. This ensures that all URLs are extracted from the website, even the hidden or dynamically generated pages.

How does Distill work for sitemap monitoring?

You can keep track of URLs belonging to a particular website using sitemap monitoring. Distill achieves this by first creating a list of URLs through the use of crawlers. The crawlers start at the source URL and search for all links within it. These links are then added to a list, and the crawlers proceed to go through each link one by one, crawling them to search for any additional links on the new page. The crawlers continue this process, crawling all the links they find until there are no more left to crawl.

However, there are some exceptions to what links the crawlers will crawl. They will not crawl links outside of the website’s domain, links that are not a subpath of the domain, or links that match the regular expression mentioned in the Exclude option while adding. By following these rules, Distill is able to create a comprehensive and accurate list of all URLs belonging to a website that can be monitored for changes.

How to create a Sitemap monitor?

Distill monitors a sitemap of a website in two steps:

  1. Crawler - Finds out all the links present on a website and creates a list of crawled URLs. You can set up the frequency and exclusion for the crawlers at this stage.
  2. Monitor - Adds the crawled list of URLs for tracking. You can configure the monitor’s settings, like actions and conditions at this step.

Here are the steps that you can follow to monitor the sitemap:

  1. Open the Watchlist from the Web app at https://monitor.distill.io

  2. Click Add Monitor -> Sitemap.

    button to add a sitemap monitor

  3. On the source page, add the Start URL. Links will be crawled from the same subdomain and the same subpath and not the full domain. So, you can add the URL from where you want to start the crawl. For example, if URL is https://distill.io, all subpaths like https://distill.io/blog, https://distill.io/help, etc will be crawled. But it will not crawl URLs like, https://forums.distill.io as it is a separate subdomain.

  4. If you want to exclude any links for monitoring, you can use the regular expression filter as mentioned below for exlusion.

  5. Click Done. It will open the Options page for the monitor’s configuration. You can configure actions and conditions at this page. Save when done.

You can use the Regular Expression filter to exclude links from crawling. This option is available on the source page when you first set up a crawler for sitemap monitoring. Alternatively, you can modify the configuration of any existing crawler from the detail page. Regex filter for exclusion

How to change the frequency of crawling?

By default, the crawler frequency is set to 1 day. However, you have the option to modify the crawling frequency according to your requirements. You can do this by navigating to the crawler’s page as shown below:

view crawler's detail page

Then you can click on the “Edit Crawler” option and change the schedule.

edit crawler's config

Was this article helpful? Leave a feedback here.