Friday, September 26, 2008

Notes from before the second Great Depression


"May you live in interesting times" is an interesting curse (or complement) and it certainly applies today. I find myself wanting to preserve this moment for posterity. It will be fun to look back after we know what happens after Americas':
  • nationalization of Fannie Mae and Freddie Mac (basically the mortgage industry)
  • bailout of American International Group, Inc. (AIG)
  • largest Bank failure in history (WaMu)
  • failure, purchase, or conversion of every investment bank on Wall Street
  • a government ban on short selling
  • republicans arguing amongst themselves about nationalizing banks, creating new government regulations and agencies, and spending $700B of taxpayer money
  • all in the middle of fighting wars on two fronts and an election campaign

Monday, August 11, 2008

More than two ways to scale

We know from basic physics that not everything scales up so simply. Just because a small model of a bridge doesn’t fall down, does not mean that an exact replica of the design, made a hundred times bigger, will stand. It won’t. It's just that simple. However whenever we talk about how to scale a computer architecture of online service from a small number of users to a much larger number the option seemed limited to two simple options (endlessly debated) scale-up, or scale-out.

To scale-out(or horizontally) means to put more nodes into the system. If four machines running web servers can’t handle the traffic, then add four more and balance the load across all eight. If the bridge is full of traffic, build another bridge right beside the first and split the load. The extra capacity is simple to implement (just copy exactly) which is the appeal, but it also comes at a price. Two bridges cost (almost) twice as much to build as one.

To scale-up (or vertically) means to put more resources into a single node. This happens when we throw more CPUs, memory, or server instances on a box. It’s like building a double-decker bridge. The attraction is that the costs don’t double because so much more of the infrastructure is shared. There’s just a limit to how many decks you can reasonably build into a bridge.

The problem here is that the scale-up and scale-out approaches not only increase capacity, but also complexity and cost. Twice the CPUs (regardless of how which boxes they’re in) mean more power, more heat, and more cost. Scale-up and scale-out do not help to build a bridge at half the cost, nor one that spans twice as far. To achieve those goals one has to focus on making stronger building materials and investigating alternative construction techniques. There is no consensus on what to call some of these more innovative approaches but I like to call them scale-in and scale-down.

To scale-in means to put the nodes closer together. When two computers communicate in an interdependent way, it really matters what the latency is between them, particularly when they have to exchange messages several times in order to complete a transaction. Financial institutions co-locate their program trading engines inside the stock exchange’s datacenter just to reduce the speed of light propagation delay between the nodes. Same machines, same software, approximately same cost, but millions of dollars in difference in performance. The proliferation of cache’s, data grids, and the practice of pushing content to the edges of content delivery networks are all examples of the scaling-in of data. Think about what Yahoo! would have to do to build your my.yahoo.com page if every single piece of personalized data (news, blogs, stock quotes, and weather) represented another query between their various servers instead of a high-speed lookup in the web/app servers’ local memory. Scaling in also happens regularly in the semiconductor industry. 90nm wafer fabrication is replaced with the smaller 65 nm process, and then 45 nm, placing the on-chip blocks of logic closer together. Each new smaller generation is more efficient and more powerful.

To scale-down means to replace the higher level logic of a node with lower level logic. At first this sounds like a contradiction in terms. Higher level means better and more powerful right. Let me explain what I mean by high and lower. Computations today happen at mind boggling layers of abstraction. A simple A=B comparison statement can be running in a scripting language, inside a plug in, inside an integration server, inside an EJB, inside an app server, inside a java virtual machine, inside a guest operating system, inside a virtual machine, inside a host operating system, running on a CPU, inside a server, with remote storage (holding A), and distributed memory (holding B), across two networks, spanning data centers, running VLANs, across multiple switches, and on, and on. A=B is something than can consume an enormous amount of resources, or, it can run in the simplest low level instruction set built into a digital watch. Don’t get me wrong, I’m not a saying we should all go back to programming assembler code for better efficiency. These layers of abstraction add value too or they wouldn’t be so popular. They help scale in the all important time-to-market, ease-of-use, and number of available programmer’s dimensions. However, there are all kinds of examples of very highly repetitive logic that is executed at the highest layers of the abstraction stack that could be more efficiently done down lower, scaled-down. For example, do we really need SOAP based WS-Notification messages to tell us “Status OK” when using HTTP response codes (ala REST) work fine?

The absolute biggest “bang for the buck” scale-down situation is when highly predictable and repeatable functions that are currently done in software, on general purpose CPUs, can be shrunk down into the hardware layer in the form of FPGAs, ASICs, or even in some cases, in the extended instruction sets of the next generation of CPUs. Do we need to waste general purpose CPU’s running java-based load balancing algorithms when a perfectly good F5 box can do the same thing many orders of magnitude times faster? Imagine if IP routers had remained, as they started, entirely written in software. How many of the machines in Google’s datacenter would be dedicated to forwarding IP packets back and forth to the rest of the cloud? The cloud could not exist as we now know it without the tremendous advances in scaling-down software logic into the purpose built chipsets that power the high-speed and cost effective network equipment that we all take for granted.

Friday, August 8, 2008

Why it's too soon to call Cuil, uncool

It didn't take long for Dave Kellogg to trash Cuil, the new search engine making bold claims like "the world’s biggest search engine". While time may show that Cuil is not up to the hype, it really is too soon to level criticism. Let me explain why.

Evaluating search results on launch day is like drag racing a new car as you drive it off the dealers lot, or using a new catchers mitt in a game before you've properly broken it in.

Modern Internet Search engines all use some form of relevance ranking algorithm that takes into account, not only the content crawled on the Internet, but also the users queries, click through rates, and other useage data that simply is not present on day 1. Time and usage are needed to build up the sufficient quantity of user data to affect results. The size of the user base and historic data are some of Google's biggest advantages when people compare search results against newcommers.

Mr. Kellogg should remember the value of analysing historical data from his days at Business Objects and cut Cuil some slack. The Cuil founders have at least openly acknowledged the challanges of the first 48 hours. Let the site run for a while and then see if the results still suck.

Thursday, August 7, 2008

My Pet Peeve "On Steroids"

I have a pet peeve about the use of the term "on steroids" when describing a product or service. Take for example the Fast Company article entitled FriendFeed -Twitter on Steroids or the BusinessWeek piece Intel's WiMax: Like Wi-Fi On Steroids. I'm pretty sure the authors of these articles aren't trying to say that WiMax and FriendFeed are somehow cheating in order to appear better than Wi-Fi and Twitter. There is no mention of the bad side effects of using these products. WiMax doesn't give you liver damage, or ruin your reproductive system. FriendFeed doesn't destroy your reputation and ruin your career. This whole "on steroids" term is just a bad metaphor and clearly out of touch with the times. Can we please figure a way to purge the use of this phrase in time that I don't have to explain to my kids that "on steroids" means something good and better in every way but in real life.

Wednesday, August 6, 2008

What does it mean to be Cloud-Class?

In the world of software and hardware infrastructure we routinely label things as personal, departmental, enterprise class, or carrier grade. So what does it mean to be a cloud class infrastructure? It’s clear that it’s not just a simple matter of performance. Rather than rehash a lot of weasel words like “robust”, “scalable”, and “reliable” let’s instead look at what features are valued by cloud builders. Many simply do not exist (or are much less important) at the other Quality of Service levels.

Multi-tenant – much of the cost benefit of the cloud comes from the religious adherence to the architectural principal of sharing common resources and never dedicating systems to specific users or groups (tenants). The simple fact that not every user is active at the same time is what makes this model work. However, multitenancy also means that cloud builders needs to play close attention to governance and security so that resource hogs and malicious users don’t adversely impact other clients. Tenants that don’t feel safe or well served in a shared space will simply choose leave.

Usage-based metering and accounting – making everyone pay their fare share of the usage is the model of most successful large scale utilities (gas, electricity, water, etc.). It’s what encourages conservation and allows the little guy to avoid being unduly taxed to make up for the wasteful practices of others. All-you-can-consume, per user, subscription models leads to over provisioning (read overpaying) and wasteful consumption (read overusing). Cloud builders can’t cop out on building the infrastructure required for detailed usage metering and billing without paying a price. That price comes in the form of either i) addressable market (ie. your subscription price builds in a buffer that prices out the long tail of little guys) or ii) cost of infrastructure (i.e. you are forced to pay the cost of servicing the unprofitable power users)

Automated provisioning and self-healing – Enterprises are teaming with service people to take your order, service your order, and comfort you when your order wasn’t handled right. There is no help desk in the cloud. No customer service representative to calm you down when your job didn’t execute. The users have to self service and the system has to automatically provision for them. Instances need to be stopped and started dynamically as demand, location, and availability demand. This is a level of automation rarely achieved on smaller scale systems.

Loosely Coupled – Systems need to be engineered and developed with minimal assumptions (coupling) between the components. Any large system of dynamically moving parts will grind to a halt quickly if the parts are too interdependent on one another’s location, version (format), or availability (time). Separation of application code from all underlying resources is of paramount importance to a cloud builder. The cloud must be a platform for continuous, non-disruptive upgrades and improvements.

Continuously available – 100% uptime of the entire system as a whole. Not 5 nines, or HA. The cloud can’t go down, ever. Just because a single components claims to be 99.999% available doesn’t mean when you add a bunch of them into the mix that you magically get 99.999% (or better) uptime. Usually it’s just the opposite. More things to fail often mean lower uptime, unless of course you architect it right. Even very well architected and tested systems are vulnerable to catastrophic systemic failures (like the famous AT&T SS7 crash, or the Ohio-Michigan electric grid meltdown in 2003).

These are some serious architectural goals to achieve, and these are just five off the top of my head. You can see why I have the utmost respect for cloud builders.

Thursday, July 31, 2008

Amazon S3 Outage

Reuven Cohen's recent blog post asks "how can we avoid the full system reboot" for cloud infrastructure such as recently happened do to a bug in Amazon's S3 Gossip Protocol.

Inter machine "gossiping" is typically done with homegrown, open source, or commercial RPC or publish/subscribe message queue middleware. Facebook uses Thrift, Google uses Protocol Buffers w/ a homegrown RPC, eBay uses TIBCO Rendezvous, others try and get by with Apache ActiveMQ or something similar in their favorite language (such as Starling for the Ruby crowd). Some people prefer to allow "gossiping" at a higher level of abstraction such as ESBs, and Data Grids.

So how to avoiding the full system reboot problem?

A good start would be to use a highly tested and widely implemented infrastructure and not build it yet again from scratch. XMPP is an interesting candidate for that very reason. I believe that the really large scale, low latency middleware and hardware in use in the financial services markets would represent an even more battle tested "gossip" infrastructure. Amazon, however, was using a homegrown system, and there will always be unexpected corner cases that get debugged in production when the system is a work in progress.

Gossip-as-a-Service anyone? And, no you can't build it on top of Amazon SQS.