If you've ever edited code for a website, you know that a seemingly simple change to just one page can unexpectedly cause other pages to change. Similarly, if you've used content management systems like Drupal, WordPress or Joomla, you've likely seen platform updates cause unexpected problems. Especially for a website with thousands of pages, it is nearly impossible to visit every page to ensure that nothing broke after an update.
SiteDiff to the rescue! SiteDiff is a command-line tool which helps you compare two versions of your site—for example, a known-good version versus an updated version. SiteDiff makes it easy to see how a website changes. It is useful for performing QA on re-deployments, site upgrades, and more!
To demonstrate the usage of SiteDiff, I'll show you what I did when upgrading the website of Evolving Web from Drupal 8.1.x to Drupal 8.2.x. I wanted to check if the upgrade caused anything to break on our site, so I needed to classify all changes to the site's HTML as either harmless or undesirable. I used the following process:
- Install and setup sitediff.
sitediff diffto find differences between the old version of the site and the updated version.
- If differences are found, SiteDiff produces a report listing them:
- Pick a difference, eg: "every menu link has a new class in 8.2.x".
- Determine whether the difference is harmless, or causes problems.
- If the difference causes a problem, fix the site so that it functions the way it used to.
- If the difference is harmless, configure SiteDiff so that it ignores the difference when you run it again.
- Go back to step 2 to find remaining differences.
- If there are no differences remaining, that means I've successfully classified all changes. I'll know my updated site is in good shape!
- Installing Ruby, if it's not already installed. Refer to the SiteDiff's README for Ruby version requirements.
- Installing other dependencies.
- Installing SiteDiff itself, with
sudo gem install sitediff.
To get started, we need to tell SiteDiff which website we wish to work with. We'll use the command
sitediff init before-url after-url. To get two different versions of the site running, you can:
- Use the live version of your site as the before version, and a development version of the modified site as the after version. The disadvantage of using the live site is that SiteDiff will crawl all your site's pages, which might lead to heavy load on your live site. This only has to happen once, though—SiteDiff can cache the results of its crawl.
- Set up two development environments - one with a copy of the live version of the site and the other with the changed version of the site, which in my case was the Drupal 8.2 version of the site.
I used Docker to set up 2 containers - one running the original site at
http://localhost:10380/ and the other running the upgraded Drupal 8.2 version of the site at
http://localhost:10480/. Once these two sites were up and running, I ran
sitediff init http://localhost:10380/ http://localhost:10480/. This crawled my whole site, and automatically created the configuration file
sitediff/sitediff.yaml. We'll edit this
sitediff.yaml file as we go along. You can refer to the example sitediff.yaml file for more information.
$ sitediff init http://localhost:10380 http://localhost:10480 [sitediff] Visited http://localhost:10480, cached [sitediff] Visited http://localhost:10380, cached [sitediff] Visited http://localhost:10480/about-evolvingweb, cached [sitediff] Visited http://localhost:10480/blog, cached [sitediff] Visited http://localhost:10480/feed, cached ... [sitediff] Created /path/to/project/sitediff/sitediff.yaml [sitediff] You can now run 'sitediff diff'
To see a list of all available parameters for this command, you can use
sitediff help init.
This is where things get interesting. You can now issue the command
sitediff diff -q --cached=all and you'll see a report of paths which have changed like this:
$ sitediff diff -q --cached=all [sitediff] Reading config file: /path/to/project/sitediff/sitediff.yaml [sitediff] Using sites from cache: after, before [sitediff] FAILURE / [sitediff] SUCCESS /about-evolvingweb [sitediff] SUCCESS /blog [sitediff] FAILURE /contact [sitediff] SUCCESS /feed ...
Here, I used two optional parameters to modify SiteDiff's output:
-q: Without this optional parameter, SiteDiff shows a diff of each page as the output of the diff command. I chose to run SiteDiff in quiet mode as I wanted to view a detailed report of these changes using
sitediff serve(explained in the next step).
--cached=all: With this command, I tell SiteDiff to use the cached version of both the before and after versions of the site to make the diff work faster. Without this parameter only the before site would be read from cache and the after site would be read at run-time.
To see a list of all available parameters for this command, you can use
sitediff help diff. To see a detailed report of the exact changes which were found per-page, we can use the command
sitediff serve, and SiteDiff renders a nice HTML page with a list of all changes found. This gives you the option to view the before and after versions side-by-side, or view the textual diff of any particular page.
Handling acceptable differences
When you run
sitediff diff, some pages are highlighted in red—these are the pages which have differences. Clicking on the DIFF link for a particular page, we see the full HTML source of the page, along with the changes SiteDiff found. Initially, all of your pages might be highlighted in red! But that's nothing to be worried about. These differences are often harmless or even expected.
In these cases, we just want to ignore the change, which we'll do using rules in our
Normalizing output: Sanitization rules
In my case, one site ran on
http://localhost:10380 and the other on
http://localhost:10480, which caused differences in CSS/JS link tags:
We know these differences don't reflect a real change in the site's structure or markup, and can safely be ignored. To handle cases like this, we can define rules in the
sitediff.yaml file. These rules are evaluated by SiteDiff during the
diff operation, so that these unimportant differences do not appear in the report. To handle the difference above, we use the following sanitization rule in the configuration file:
sanitization: - title: Strip domain names from absolute URLs pattern: http:\/\/localhost:1080 substitute: http://localhost
This rule asks SiteDiff to look for the regular expression defined in the
pattern element and replace it with the text given in the
substitute element. The
title is for us to know as to what the rule is intended to do. We can add such a rule under the global
sanitization key to apply to both sites we're comparing; or put a
sanitization key in the
after sections of the file, to limit its application to the before or after version of the site respectively.
Some other cases where we can use sanitization rules are:
- Removing randomized content added to parts of the page. For example, Drupal forms are expected to have random form IDs.
- Handling changes that we like! For example, an update to a Drupal module might correctly add
type="email"to email fields. After checking that this looks and behaves ok, we approve the change and write a rule to exclude it from the report.
Be careful writing your regex patterns. Patterns including
.* are greedy, and might eat up more characters than you intend. It's better to use more restricted patterns like
Normalizing output: DOM transformation rules
Similarly, there are cases when we might wish to remove certain DOM elements or unwrap them to normalize certain differences between the two versions of the site. For this we can use DOM transformations, such as:
unwrap: This transformation removes a given HTML element, but keeps its contents without the wrapping element. For example, if one version of the site wraps articles inside an
articleelement, while the other version does not have that additional
articleelement, we can use the
unwraptransformation to remove the
articletag while keeping its contents.
remove: As the name suggests, we can use this to remove a given HTML element. For example, if we have a block containing random articles or the current time, we can remove the block to normalize the two versions of the site.
You can see a full list in the SiteDiff docs. Here's what the syntax for a transformation in your
sitediff.yaml looks like:
dom_transform: - type: unwrap selector: article
It's usually best to use DOM transformation rules instead of regex sanitization rules when possible, since they're harder to get wrong, and easier to read and understand later.
Handling undesirable differences
Not all differences are trivial or harmless, like those above! In my case, I found an unexpected change - a
div element acting as a wrapper for a group of checkboxes was changed to a
fieldset element due to an update in a module.
The HTML still looks reasonable—but since the change was unexpected, I needed to verify that it hadn't caused any problems. I opened the changed page in my browser and found that it looked wrong! Due to different HTML structure, our existing CSS rules turned a label from black to red!
SiteDiff helped me discover this bug in my site—thanks SiteDiff! I probably wouldn't have found this if I had just updated my site and manually checked a few pages.
For each confirmed bug, we have to fix the problem in the site's code or CSS. Sometimes we'll reproduce the exact HTML the site used to have. Other times, we'll adapt to the HTML change, and then add a sanitization rule or DOM transformation to ignore the change on future runs, since we've taken care of it.
Now that you've classified a couple of differences, you might think you're done. But probably not yet! It often takes a few passes to classify all the differences, so you should now run
sitediff diff again. Here's a tip to make things go quicker: As you're testing new rules or changes, don't run
sitediff diff on your entire site. Instead, you can make SiteDiff look at only the paths you know are relevant. You can do this with the
--paths parameter, for example,
sitediff diff --paths /path/one /path/two.
Eventually, you'll have handled every last difference, and SiteDiff's report will contain nothing in red. Congratulations! Now you can deploy the updated site knowing you haven't broken anything.