The sorry state of server utilization and the post-hypervisor era (2013)

nickpsecurity · on May 17, 2015

The article is looking at it all wrong. To solve a problem, start by looking at those that already solved it. Then, see if you can apply that. Mainframes have long had ridiculously high utilization and throughput. Secret is their I/O architecture: computing happens on compute nodes and I/O is managed by I/O processors, both of which are well-integrated. If Intel etc copy this, they'll get much higher utilization and throughput. Smart embedded engineers do the same thing albeit with microcontrollers.

https://en.wikipedia.org/wiki/I/O_channel

blincoln · on May 17, 2015

It's 2015, not 1985. Most people are not paying IBM for every CPU cycle (used or not) on a mainframe. Should IT staff try to look "good" on a "CPU utilization" report that belongs in a history book by buying lower-end hardware, or should they spend a tiny amount of extra money to ensure that customers get good performance during peak periods?

bhousel · on May 17, 2015

These aren't really startling findings. Most apps in the enterprise require separate instances for development, staging, production, and a hot standby for continuation of business. And you need each of those environments for multiple tiers (db, app server, etc). And you need the entire stack replicated to each local datacenter because of latency (so the idea of having the APAC users use the database at night and the NAM users use it during the day just doesn't work in practice). So a typical business app can easily require >10 server instances, most of which will sit idle most of the time.

parasubvert · on May 17, 2015

It also reflects a very stubborn unwillingness to actually use virtualization, ie. To collect capacity optimization metrics and to let that drive the placement of VMs with appropriate over-provisioning.

Over-provisioning of RAM is dicey, and I/O-aware placement is still a black art, but CPU is a no brainier. I routinely find places that refuse to anything but 1:1 vCPU to physical core ratios, or even to enable VMware DRS/HA. Mainly because they bought virtualization for convenience but then didn't update their capacity and ITIL processes from the 90s where assets are pegged to a physical CPU for "regulatory" reasons and capacity is still fear-driven rather than data driven. Or, also common, vendors of packaged or platform software ... and bad Dev teams ... love to blame virtualization for performance problems, rather than actually analyze and fix the problem. So over provisioning becomes a political decision made by managers rather than a technical one made by the ops staff.

I also don't see many places just allowing "shutdown/archival" of Dev/test environments that are clearly not being used by metrics, or even to just have a process that tells the ops team to press a button when project funding ceases. It's obvious and simple but politically it is "risky" because some VP's pet project having resources reclaimed makes them feel weak or something.

Then I find occasional data centers running widely over provisioned and high (60%+) utilization, and life is fine, but for some reason these surveys never make it to those places. So the laggards never rally find out that's it's "ok" to stack VMs.

Now with container clusters like Mesos/marathon, Lattice/CF, and Kubernetes, we are going to see some interesting behavior. A lot of companies are very uncomfortable with the whole "you don't really know/care which physical machine gets a container instance, it is fair share schedules as a whole". It forces them again to admit their supporting processes are antiquated.

lsc · on May 17, 2015

"A post-hypervisor world "

lol. I've been predicting a backlash against this virtualization hype since 2005, and this is the first time I've heard anyone else mention anything like it.

Of course, if you had told me in 2005 that we would be switching from hypervisors back to containers, I would have broke down crying.

Is our industry run by masochists? or just the inexperienced, who don't know any better?

parasubvert · on May 17, 2015

This was written by a VC hoping that Docker is going to be worth more than VMware. I suspect he may be disappointed.

lsc · on May 17, 2015

A lot of the value of vmware is in the sales channels. Why use VMware rather than QEMU/KVM? it used to be that VMware came with support. But now that KVM is owned by RedHat, which in my experience, gives way better than average support? yeah.

but, yeah. Docker doesn't solve the "take this ancient rack of failing servers and consolidate them down to one server... without updating the software" use case that VMware is so often used for.

parasubvert · on May 17, 2015

Why use VMware? Is indeed the existential question that's facing them. For now, it's because many IT shops can't wrap their heads around the alternatives or justify the switch. Lack of skills (lots of Windoes centric shops), deep love for DRS/Vmotion/HA, deep support for fibre channel setups, etc. Btw this is arguably why VMware announced Photon recently, to go after RedHat. Eat at their Linux monopoly.

That said, VMware did basically invent x86 virtualization as we know it today, and that's justified the many billions in wealth it has generated to date. Docker is (so far) a registry and a CLI wrapper around a Linux kernel feature. It can and will be more, but it's not clear what.

otterley · on May 17, 2015

To be fair, containers aren't a virtualization solution; they're more of a packaging mechanism.

lsc · on May 17, 2015

that is... a healthy way to look at it.

My experience has been that using containers to go multi-tenant leads only to misery and pain.

But it does seem a reasonable-ish way to handle packaging, though I have less experience with that use case. It does seem like it would work, assuming you still have a way to update everything, and assuming everything is happy with the same kernel.

noblethrasher · on May 17, 2015

Please, expound.

justincormack · on May 17, 2015

If the issue is running out of memory before running out of CPU times, then containers wont help much, apart from to the extent that memory is overallocated with static amounts to vms. The solution is either larger memory systems, which are now much more widely available since this article was written, or using less memory for applications.

vimes656 · on May 17, 2015

Is hypervisor memory ballooning widespread in major cloud providers these days? How does it compare to bare-metal kernel memory allocation?

justincormack · on May 17, 2015

No it is not widespread. Underprovisioning is a bit of a dirty word too - it breaks isolation.

The Google Borg paper says they use non production batch jobs to eat the spare, so you can kill them if necessary. Cloud providers could offer this as a service in theory, although they are not really architected that way.

mark_l_watson · on May 17, 2015

Looking at this as an environmental problem makes some sense. I used to rent cheap hosted servers and moved to virtualized systems like AWS, Azure, and AppEngine partially because of environmental impact and partially out of convenience.

We need staging servers and redundant backups so getting really high utilization is not possible but I hope to see a lot of improvement.

The big companies seem to be doing things better. At Google, I had a bit of angst running 10k processor jobs but they do use solar, set up data centers near hydro electric sources, etc. Same as Amazon, Microsoft, etc.

PaulHoule · on May 17, 2015

Well, over provisioning is good for perceived performance. Let corporate it increase efficiency in the same ham handed way and you might find low latency needs a new advocate.

LamaOfRuin · on May 17, 2015

The idea that Google was industry leading on non-batch loads in 2013 seems wrong to me. They were not selling those services then, so they did not have a positive profit motive to optimize that usage (only a motivation to cut costs, which I'm told is not nearly as effective). Amazon has had that motivation (and necessity with their non-existent margins in every other part of their business) for long enough to actually accomplish something.

parasubvert · on May 17, 2015

at Google's scale, one doesn't need a lot of incentive to improve utilization. Every IT shop has wanted the cost reduction of improved utilization since the dawn of the PC era.

The difference is in process. Google's approach to workload placement is automated by software, driven by engineering decisions and data.

Many IT shops' placement is political (new servers = new capital = power).

LamaOfRuin · on May 17, 2015

At Google's scale you need more much incentive to get anything done. This is even more true when it is something that will touch every division, product, and service.

What every IT shop wants doesn't necessarily relate in any straightforward way to what any IT shop invests resources in getting. Every IT shop prioritizes many other things above utilization (and are right to do so).

All decisions, engineering or otherwise, are political. Different environments involve different politics, but it's all still politics.

parasubvert · on May 17, 2015

All decisions are political (ie. Power interests), but not all orginzations are configured to be primarily driven by power. This is especially true for young organizations, or those that have gone through a cycle of renewal.

Google decided early on to drive towards an operational architecture that allows individuals to act at scale on their infrastructure. A developer deploys into production, it launches thousands of new containers and disposes thousands of old containers. A batch job is run, same thing. Deploying services is uniform across the board. Thus, optimizing utilization through improved container scheduling is something that the core site reliability engineering team could do independently of individual services.

Google's early adoption of data center sized computing by Hozle & team was unique, along with Amazon's CEO-diktat move to decentralized service-oriented architecture, or Netflix's rewrite and move to cloud. Which is why you have articles like this, written by a VC, that want to repackage this thinking and sell it back to old school IT.

LamaOfRuin · on May 17, 2015

> Thus, optimizing utilization through improved container scheduling is something that the core site reliability engineering team could do independently of individual services.

But is that something it is known they prioritized, or was there perhaps more interest in optimizing the efficiency of deploying thousands of containers on every deploy, across data centers, with reliable testing, without killing in flight processing, and scaling for subsecond response to bursty demand? Who sets the priorities for what is most important, and how much of one they're willing to sacrifice to improve physical utilization?

I have absolutely no doubt they had as many resources as any other company dedicated to finely tuning their data centers and related infrastructure. I question whether they had the same motivation as a company like Amazon (who was deriving direct profit from selling this resource) to prioritize the optimization of utilization.

cm2187 · on May 17, 2015

But is average utilisation the right metric? The work day is only like 8-10 hours, I would expect many corporate infrastructures to be only active during that period. Plus you don't size your infrastructure to a typical workload, you size it to be able to accommodate higher than usual peak workload otherwise you will be down at the busiest period.

dsr_ · on May 17, 2015

Which of these hypothetical situations is more realistic?

CEO: "I see that we had 99.994% uptime for the last six months, and we came in very close to the forecasted budget. Well done, engineers!"

CEO: "I see that we had 99.9% efficient usage for the last six months, and we reduced our budget. Well done, engineers!"

Neither scenario is realistic, of course. Uptime is nice and efficiency is nice and budgets are nice, but what the CEO is actually interested in is:

VP Customer Support: "Our satisfaction rate is up, call quality metrics are great. I looked over the call stats and it looks like we're no longer getting complaints about performance or unreachability."

parasubvert · on May 17, 2015

I've worked with the executives of some large banks, telecoms, and transportation companies. The CEO and board generally only has held the IT team accountable to budgetary performance and risk (uptime, intrusion, regulatory) metrics. The only IT impact on customer sat is uptime, by the traditional view.

One bank IT group I know that reports on uptime to their business partners relative to operating expense, prints the charts and graphs on plotter paper weekly and posts them in the cafeteria. Most of their bonus is directly tied to those numbers. So, "cut costs and keep me up".

Delivery IT groups are very rarely measured by customer satisfaction, they're measured by project and budget performance to baseline (on time, on budget, etc). Customer sat is the responsibility of the business partners that drive the requirements, programs, etc.

This this effective? Not really. If they recognized Lean product development principles they'd incentivize everything by end-to-end cost of delay first, and risk reduction second.

otterley · on May 17, 2015

Unfortunately it's usually some layer of middle management that is charged with making the infrastructure capacity planning decisions, and their performance is often measured by different metrics than those that the CEO cares about. Diverging incentives lead to absurd outcomes.