Aftermath

The grim march in the days and weeks following launch produced impressive improvements. The CDN’s engineers redeemed themselves for their “sneak preview” error before launch. In one day, they used their edge server scripting to help shield the site from some of the worst offenders. They added a gateway page that served three critical capabilities. First, if the requester did not handle cookies properly, the page redirected the browser to a separate page that explained how to enable cookies. Second, we could set a throttle to determine what percentage of new sessions would be allowed. If we set the throttle to 25 percent, then only 25 percent of requests for this gateway page would serve the real home page. The rest of the requests would receive a politely worded message asking them to come back later. Over the next three weeks, we had an engineer watching the session counts at all times, ready to pull back on the throttle anytime the volume appeared to be getting out of hand. If the servers got completely overloaded, it would take nearly an hour to get back to serving pages, so it was vital to use the throttle to keep them from getting saturated. By the third week, we were able to keep the throttle at 100 percent all day long.

The third critical capability we added was the ability to block specific IP addresses from hitting the site. Whenever we observed one of the shopbots or request floods, we would add them to the blocked list.

All those things could’ve been done as part of the application, but in the mad scramble following launch, it was easier and faster to have the CDN handle them for us. We had our own set of rapid changes to pursue.

The home page was completely dynamically generated, from the JavaScript for the drop-down category menus to the product details and even to the link on the bottom of the page for “terms of use.” One of the application platform’s key selling points was personalization. Marketing was extremely keen on that feature but had not decided how to use it. So this home page being generated and served up five million times a day was exactly the same every single time it got served. There wasn’t even any A/B testing. It also required more than 1,000 database transactions to build the page. (Even if the data was already cached in memory, a transaction was still created because of the way the platform worked.) The drop-down menus with nice rollover effects required traversal of eighty-odd categories. Also, traffic analysis showed that a significant percentage of visits per day just hit the main page. Most of them didn’t present an identification cookie, so personalization wasn’t even possible. Still, if the application server got involved in sending the home page, it would take time and create a session that would occupy memory for the next thirty minutes. So we quickly built some scripts that would make a static copy of the home page and serve that for any unidentified customers.

Have you ever looked at the legal conditions posted on most commerce sites? They say wonderful things like, “By viewing this page you have already agreed to the following conditions....” It turns out that those conditions exist for one reason. When the retailer discovers a screen scraper or shopbot, they can sic the lawyers on the offending party. We kept the legal team busy those first few days. After we identified another set of illicit bots hitting the site to scrape content or prices, the lawyers would send cease-and-desist notices; most of the time, the bots would stop. They never stayed away for long, though.

This particular application server’s session failover mechanism was based on serialization. The user’s session remains bound to the original server instance, so all new requests go back to the instance that already has the user’s session in memory. After every page request, the user’s session is serialized and sent over the wire to a “session backup server.” The session backup server keeps the sessions in memory. Should the user’s original instance go down—deliberately or otherwise—the next request gets directed to a new instance, chosen by the load manager. The new instance then attempts to load the user’s session from the session backup server. Normally the session only includes small data, usually just keys such as the user’s ID, her shopping cart ID, and maybe some information about her current search. It would not be a good idea to put the entire shopping cart in the session in serialized form, or the entire contents of the user’s last search result. Sadly, that’s exactly what we found in the sessions. Not only the whole shopping cart, but up to 500 results from the user’s last keyword search, too. We had no choice but to turn off session failover.

All these rapid response actions share some common themes. First, nothing is as permanent as a temporary fix. Most of these remained in place for multiple years. (The longest of them—rolling restarts—lasted a decade and kept going through more than 100 percent turnover in the team.) Second, they all cost a tremendous amount of money, mainly in terms of lost revenue. Clearly, customers who get throttled away from the site are less likely to place an order. (At least, they are less likely to place an order at this site.) Without session failover, any user in the middle of checking out would not be able to finish when that instance went down. Instead of getting an order confirmation page, for example, they would get sent back to their shopping cart page. Most customers who got sent back to their cart page, when they’d been partway through the checkout process, just went away. Wouldn’t you? The static home page made personalization difficult, even though it’d been one of the original goals of the whole rearchitecture project. The direct cost of doubling the application server hardware is obvious, but it also brought added operational costs in labor and licenses. Finally, there was the opportunity cost of spending the next year in remediation projects instead of rolling out new, revenue-generating features.

The worst part is that no amount of those losses was necessary. Two years after the site launched, it could handle more than four times the load on fewer servers of the same original model. The software has improved that much. If the site had originally been built the way it is now, the engineers would have been able to join marketing’s party and pop a few champagne corks instead of popping fuses.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset