SiteDiff - Compare Multiple Versions of a Website

25/01/2017 · par Jigar Mehta

If you've ever edited code for a website, you know that a seemingly simple change to just one page can unexpectedly cause other pages to change. Similarly, if you've used content management systems like Drupal, WordPress or Joomla, you've likely seen platform updates cause unexpected problems. Especially for a website with thousands of pages, it is nearly impossible to visit every page to ensure that nothing broke after an update.

SiteDiff to the rescue! SiteDiff is a command-line tool which helps you compare two versions of your site—for example, a known-good version versus an updated version. SiteDiff makes it easy to see how a website changes. It is useful for performing QA on re-deployments, site upgrades, and more!


    To demonstrate the usage of SiteDiff, I'll show you what I did when upgrading the website of Evolving Web from Drupal 8.1.x to Drupal 8.2.x. I wanted to check if the upgrade caused anything to break on our site, so I needed to classify all changes to the site's HTML as either harmless or undesirable. I used the following process:

    1. Install and setup sitediff.
    2. Run sitediff diff to find differences between the old version of the site and the updated version.
    3. If differences are found, SiteDiff produces a report listing them:
      • Pick a difference, eg: "every menu link has a new class in 8.2.x".
      • Determine whether the difference is harmless, or causes problems.
        • If the difference causes a problem, fix the site so that it functions the way it used to.
        • If the difference is harmless, configure SiteDiff so that it ignores the difference when you run it again.
      • Go back to step 2 to find remaining differences.
    4. If there are no differences remaining, that means I've successfully classified all changes. I'll know my updated site is in good shape!

    Installing SiteDiff

    To install SiteDiff, you can follow the SiteDiff installation instructions in the SiteDiff GitHub repository. It involves just three basic steps:

    • Installing Ruby, if it's not already installed. Refer to the SiteDiff's README for Ruby version requirements.
    • Installing other dependencies.
    • Installing SiteDiff itself, with sudo gem install sitediff.


    To get started, we need to tell SiteDiff which website we wish to work with. We'll use the command sitediff init before-url after-url. To get two different versions of the site running, you can:

    • Use the live version of your site as the before version, and a development version of the modified site as the after version. The disadvantage of using the live site is that SiteDiff will crawl all your site's pages, which might lead to heavy load on your live site. This only has to happen once, though—SiteDiff can cache the results of its crawl.
    • Set up two development environments - one with a copy of the live version of the site and the other with the changed version of the site, which in my case was the Drupal 8.2 version of the site.

    I used Docker to set up 2 containers - one running the original site at http://localhost:10380/ and the other running the upgraded Drupal 8.2 version of the site at http://localhost:10480/. Once these two sites were up and running, I ran sitediff init http://localhost:10380/ http://localhost:10480/. This crawled my whole site, and automatically created the configuration file sitediff/sitediff.yaml. We'll edit this sitediff.yaml file as we go along. You can refer to the example sitediff.yaml file for more information.

    $ sitediff init http://localhost:10380 http://localhost:10480
    [sitediff] Visited http://localhost:10480, cached
    [sitediff] Visited http://localhost:10380, cached
    [sitediff] Visited http://localhost:10480/about-evolvingweb, cached
    [sitediff] Visited http://localhost:10480/blog, cached
    [sitediff] Visited http://localhost:10480/feed, cached
    [sitediff] Created /path/to/project/sitediff/sitediff.yaml
    [sitediff] You can now run 'sitediff diff'

    To see a list of all available parameters for this command, you can use sitediff help init.

    Finding differences

    This is where things get interesting. You can now issue the command sitediff diff -q --cached=all and you'll see a report of paths which have changed like this:

    $ sitediff diff -q --cached=all
    [sitediff] Reading config file: /path/to/project/sitediff/sitediff.yaml
    [sitediff] Using sites from cache: after, before
    [sitediff] FAILURE /
    [sitediff] SUCCESS /about-evolvingweb
    [sitediff] SUCCESS /blog
    [sitediff] FAILURE /contact
    [sitediff] SUCCESS /feed

    Here, I used two optional parameters to modify SiteDiff's output:

    • -q: Without this optional parameter, SiteDiff shows a diff of each page as the output of the diff command. I chose to run SiteDiff in quiet mode as I wanted to view a detailed report of these changes using sitediff serve (explained in the next step).
    • --cached=all: With this command, I tell SiteDiff to use the cached version of both the before and after versions of the site to make the diff work faster. Without this parameter only the before site would be read from cache and the after site would be read at run-time.

    To see a list of all available parameters for this command, you can use sitediff help diff. To see a detailed report of the exact changes which were found per-page, we can use the command sitediff serve, and SiteDiff renders a nice HTML page with a list of all changes found. This gives you the option to view the before and after versions side-by-side, or view the textual diff of any particular page.

    SiteDiff HTML report
    sitediff --serve shows a list of all pages with changed pages highlighted in red.

    Handling acceptable differences

    When you run sitediff diff, some pages are highlighted in red—these are the pages which have differences. Clicking on the DIFF link for a particular page, we see the full HTML source of the page, along with the changes SiteDiff found. Initially, all of your pages might be highlighted in red! But that's nothing to be worried about. These differences are often harmless or even expected.

    In these cases, we just want to ignore the change, which we'll do using rules in our sitediff.yaml file.

    Normalizing output: Sanitization rules

    In my case, one site ran on http://localhost:10380 and the other on http://localhost:10480, which caused differences in CSS/JS link tags:

    SiteDiff report showing domain differences.

    We know these differences don't reflect a real change in the site's structure or markup, and can safely be ignored. To handle cases like this, we can define rules in the sitediff.yaml file. These rules are evaluated by SiteDiff during the diff operation, so that these unimportant differences do not appear in the report. To handle the difference above, we use the following sanitization rule in the configuration file:

    - title: Strip domain names from absolute URLs
      pattern: http:\/\/localhost:10[34]80
      substitute: http://localhost

    This rule asks SiteDiff to look for the regular expression defined in the pattern element and replace it with the text given in the substitute element. The title is for us to know as to what the rule is intended to do. We can add such a rule under the global sanitization key to apply to both sites we're comparing; or put a sanitization key in the before or after sections of the file, to limit its application to the before or after version of the site respectively.

    Some other cases where we can use sanitization rules are:

    • Removing randomized content added to parts of the page. For example, Drupal forms are expected to have random form IDs.
    • Handling changes that we like! For example, an update to a Drupal module might correctly add type="email" to email fields. After checking that this looks and behaves ok, we approve the change and write a rule to exclude it from the report.

    Be careful writing your regex patterns. Patterns including .+ or .* are greedy, and might eat up more characters than you intend. It's better to use more restricted patterns like [^"]+.

    Normalizing output: DOM transformation rules

    Similarly, there are cases when we might wish to remove certain DOM elements or unwrap them to normalize certain differences between the two versions of the site. For this we can use DOM transformations, such as:

    • unwrap: This transformation removes a given HTML element, but keeps its contents without the wrapping element. For example, if one version of the site wraps articles inside an article element, while the other version does not have that additional article element, we can use the unwrap transformation to remove the article tag while keeping its contents.
    • remove: As the name suggests, we can use this to remove a given HTML element. For example, if we have a block containing random articles or the current time, we can remove the block to normalize the two versions of the site.

    You can see a full list in the SiteDiff docs. Here's what the syntax for a transformation in your sitediff.yaml looks like:

    - type: unwrap
      selector: article

    It's usually best to use DOM transformation rules instead of regex sanitization rules when possible, since they're harder to get wrong, and easier to read and understand later.

    Handling undesirable differences

    Not all differences are trivial or harmless, like those above! In my case, I found an unexpected change - a div element acting as a wrapper for a group of checkboxes was changed to a fieldset element due to an update in a module.

    SiteDiff report showing a div replaced by a fieldset

    The HTML still looks reasonable—but since the change was unexpected, I needed to verify that it hadn't caused any problems. I opened the changed page in my browser and found that it looked wrong! Due to different HTML structure, our existing CSS rules turned a label from black to red!

    Form item label looks different after update.
    The form item's label is all in red after the update.

    SiteDiff helped me discover this bug in my site—thanks SiteDiff! I probably wouldn't have found this if I had just updated my site and manually checked a few pages.

    For each confirmed bug, we have to fix the problem in the site's code or CSS. Sometimes we'll reproduce the exact HTML the site used to have. Other times, we'll adapt to the HTML change, and then add a sanitization rule or DOM transformation to ignore the change on future runs, since we've taken care of it.

    Moving along

    Now that you've classified a couple of differences, you might think you're done. But probably not yet! It often takes a few passes to classify all the differences, so you should now run sitediff diff again. Here's a tip to make things go quicker: As you're testing new rules or changes, don't run sitediff diff on your entire site. Instead, you can make SiteDiff look at only the paths you know are relevant. You can do this with the --paths parameter, for example, sitediff diff --paths /path/one /path/two.

    Eventually, you'll have handled every last difference, and SiteDiff's report will contain nothing in red. Congratulations! Now you can deploy the updated site knowing you haven't broken anything.

    Next steps

    • Read SiteDiff documentation to learn more about it.
    • See sitediff.yaml.example file for example rules.
    • Try out SiteDiff the next time you make updates and/or upgrades to a website.
    • Read YAML documentation to understand the sitediff.yaml file better.