4 things to learn from Load Testing Tesco.com – the world’s biggest Grocery site

Date: 9th April 2011
Author: Deri Jones

One of the nice things in my job, is I get days out of the office spending time with some cool eCommerce guys who have to manage some big and interesting sites, planning some innovative and interesting website testing and monitoring.

Even more interesting is sharing a speaking engagement with people we’ve worked with, and hearing what they are willing to say in public about the web performance projects we’ve shared!

Luke Fairless at Tesco.com is one of those guys. He juggles keeping the world’s biggest grocery site running, with a continual stack of projects to add functionality for customers.

We spoke together at Internet Retail Expo in March, and the videos of the day have just gone live – click on “Luke and Deri: Tesco and SciVisum

Luke raises a number of great points in his presentation on how to do website load testing. Taking just four:

Test the Live site

Nobody has a test environment that is specc’ed the same as their live environment:  the bean counters will never sign off spending for the same level of kit on a test environment.  So the only way to know what your site can handle, is to load test the live website.  It’s hard – hard to switch off certain things (Tesco allowed a percentage of the card transactions to go through to the bank system), hard to clean up databases before real customers are back on the site again and so on.  But it’s worth it.

Only then can you tell the business the limits of your site based on evidence; rather than assumptions.

As Life Like as Possible

That’s Luke’s description of what we call  ‘Do What the Customer Does’.

One of the things we get most praise from our clients, is our ability to test and monitor their website with Dynamic User Journeys that do what the customer does.

For Tesco for example, we run a User Journey with the innocuous title ‘Add to Basket Grocery Favourites’ – but what that journey does find and add to Basket 65 different items of grocery! Phew – we do that because that’s what the average Tesco shopper does, they put way more items into the basket per order than your average retail website.

Every time a Dynamic User Journey runs it randomly looks into the page and pulls out a random subcategory or chooses a random product from the list offered.  Load testing just doesn’t emulate the real world users if it’s always putting the same product in the basket!  You want  to see a huge spread of products being handled otherwise technology features, like caching, mean that  the load on the site is way lower in testing than for the same user activity level with real users.

Test it – Break it – Fix it – Test it Again

Luke’s focus when load testing on his main money-making website, the Grocery site; was to break the site each time. His team do this in order to generate a stack of good log data about the bottlenecks and errors, to provide the raw material for offline error analysis during the week, ensuring everything is ready before the next big overnight load test the following early hours of Sunday morning.

This approach means fixes can be implemented during the week, and be tested on the Sunday, for quick feedback – did they help or not.

There’s two diametrically opposed keynote approaches to planning  the load levels for a website load test – either build up from the bottom or push down from the top.

Using the bottom up approach you test each User Journey in isolation, and ramp up until you find the limit, the point at which the number of completed User Journeys starts to go down, even as the number of  journeys you are starting continues to rise.  This gives clearer and more explicit data per Journey and often saves time later, because you can see the Journeys that are already performing above requirements, and so need less attention than those that  are clearly struggling.  At the end of the project, you then run all Journeys mixed together and take the whole system to it’s limits.

The top down method, the approach Tesco wanted us to take, starts by breaking the site. Instead of adding in the User Journeys one at a time you start with all Journeys running, and at a high throughput, don’t start low. Start at the level of the busiest hour ever from history and ramp up from there.

For us that was a scary approach. We were already conscious that we were testing one of the biggest websites in the UK (certainly it has the highest number of items added to basket per hour)  so we knew it was going to be pushing our own testing infrastructure to the limits too.  It would have been kinder on us to start low and ramp up – but Tesco wanted to start high and break it, and so we did.  Which meant that the first night we were embarrassed, when our system needed more time to clean up between tests than we wanted… embarrassing when there were maybe a dozen Tesco guys on the conference call, and everyone is waiting for SciVisum’s kit to catch up, and kick off the next test.

We learnt some unexpected things about the memory handling properties of our own engine that night – it wasn’t just Tesco that went away with some things to fix before the next weekend!

Get the Numbers Right

This was perhaps the most impressive part of the planning Tesco did  – they really wanted to be sure that the mix of Journeys we had created, each run at a different percentage in the mix, each with a different number of steps: that all that matched up with the profile of the real world traffic – the same  traffic levels hitting each of the various site functions.

The same ratio of hits as real traffic between  the site features of: Login, Add-to-Basket, Search, CheckOut, DeliverySlot selection etc.

Not only did that mean it was more effort in test planning: to match the planned journey mix with the complexities of the large data set Tesco had of real world analytic logs- even that data itself was a brain cruncher to understand.

But it also meant an engineering challenge each night. We had to tweak the Journeys, to ensure that the ratios stayed within the target ranges, even as errors occured and journey slowed down.  For example, if a Journey  that starts with a Login, and then does an Add to Basket starts to Error during the load test so that, say, 20% fail just before the crucial Add to Basket step what you are left hitting the site with is 100% logins but only 80% Add-to baskets, whereas you’d planned for it to be 100:100, like it is at low load.

So as the load increased on the site, confidence that we were still making the right ratios had to be checked, and the ratio and mix of Journeys, and even the steps within Journeys, altered, to get us back to right mix despite the errors being thrown.

And given the approach taken of ‘lets start by breaking it’, there were lots of errors early on, to accomodate.

So that was lots of tweaking between tests.

So in conclusion,  the result of the in-depth load testing Tesco did with us, the Christmas rush was handled by their system.

Even the horribly nightly peaks in the exact 5 minute slots each night, which line up with the adverts in East Enders!

Enjoy the video where Luke explains what happens if those short but spikey peaks aren’t handled by your site… that’s a warning that applies to any retailer, if their traffic exceeds the system capacity, for no matter how short an interval, it can seriously hit sales.

Luke Fairless and Deri Jones speak at Internet Retailing Expo 2011

Top