woah, wait..... It "geneerally works if" you " rebuild docker hosts on a daily o...

jacques_chester · on Aug 28, 2016

The security exec at Pivotal, where I work, has been talking about "repaving" servers as a security tactic (along with rotating keys and repairing vulnerabilities).[0]

The theory runs that attackers need time to accrue and compound their incomplete positions into a successful compromise.

But if you keep patching continuously, attackers have fewer vulnerabilities to work with. If you keep rotating keys frequently, the keys they do capture become useless in short order. And if you rebuild the servers frequently, any system they've taken control of simply vanishes and they have to start from scratch.

I'm not completely sold on the difference between repair and repave, myself. And I expect that sophisticated attackers will begin to rely more on identifying local holes and quickly encoding those in automated tools so that they can re-establish their positions after a repaving happens.

But it raises the cost for casual attackers, which is still worthy.

[0] https://medium.com/built-to-adapt/the-three-r-s-of-enterpris...

tptacek · on Aug 28, 2016

Having everything patched as soon as patches are available (or within, say, 6 hours of availability, for "routine" patches, with better responsiveness for critical patches) is a win.

The rest: not so much.

Rebuilding continuously for security is not something I would recommend.

jacques_chester · on Aug 28, 2016

> Rebuilding continuously for security is not something I would recommend.

So that I understand, could you elaborate?

Particularly, do you mean "not recommend" as in "recommend against" or "not worth the bother"?

tptacek · on Aug 28, 2016

It's not worth the bother. Apart from keeping patches up today --- which is a good idea --- it's probably not really buying you anything.

It's not crazy to periodically rotate keys, but attackers don't acquire keys by, you know, stumbling over them on the street or picking them up when you've accidentally left them on the bar. They get them because you have a vulnerability --- usually in your own code or configuration. Rebuilding will regenerate those kinds of vulnerabilities. Attackers will reinfect in seconds.

lawnchair_larry · on Aug 29, 2016

A lot of companies do lose their keys that way. www roots, gists, hardcoded in products, github history, etc.

The win to rotating them is not so much because you'll be regularly evicting attackers you didn't know had your keys, but because when you do have a fire, you won't be finding out for the first time that you can't actually rotate them.

It also forces you to design things much more reliably which helps continuity in non-security scenarios.

After redeploying and realizing that Todd has to ssh in and hand edit that one hostname and fix a symlink that was supposed to be temporary so the new version of A can talk to B, that's going to get rolled in pretty quickly. Large operations not doing this tend to quickly end up in the "nobody is allowed to touch this pile of technical debt because we don't know how to re-create it anymore" problem.

tkiley · on Aug 28, 2016

It seems like it's good to be able to rebuild everything at a moment's notice after patching against a major exploit, though. You should have a fast way to rebuild secrets and servers after the next heartbleed-scale vulnerability.

tptacek · on Aug 28, 2016

Being able to rebuild critical infrastructure from source, and know that you'll be able to reliably deploy it, is a _huge_ win for security.

After a bunch of harrowing experiences with clients, I'm pretty close to believing "using packages for critical infrastructure is a bad idea".

helloiamaperson · on Aug 28, 2016

Being able to rebuild critical infrastructure from source, and know that you'll be able to reliably deploy it, is a _huge_ win for security.

In that case, you might be interested in bosh: http://bosh.io/docs/problems.html (the tool that enables the workflow jacques_chester was describing). It embraces the idea of reliably building from source for the exact reasons you've mentioned.

Gigablah · on Aug 29, 2016

I'm confused now, earlier you recommended patches over rebuilding continuously from source, but this seems like the opposite?

Noumenon72 · on Aug 29, 2016

What does "packages" mean here? Sorry.

edmccard · on Aug 29, 2016

My guess is that "packages" is shorthand for "binary packages", as opposed to being able to redeploy from source.

tptacek · on Aug 29, 2016

Gigablah · on Aug 29, 2016

I'm guessing they meant to write "patches".

jacques_chester · on Aug 28, 2016

That was my hunch too. Thanks. I'll ask more about whether I missed something on the other side of the argument.

StreamBright · on Aug 28, 2016

Depends, usually you have to be able to re-build your prod infra within minutes or maximum hours, otherwise you are doing devops wrong. The whole point of automation is reproducible infrastructure that you can stand up quickly. With stateless approach you can just do this. Why would you do that? Imagine an outage in one of the 3 datacenters you are running your infra in the same region. You need to move 1/3 of the capacity to the remaining 2 datacenters. This is not too much different to re-building it.

falcolas · on Aug 28, 2016

Aaah, the classing "you're doing <it> wrong" argument. I can come up with dozens of different environments where it is simply not feasable to rebuild an environment within two hours.

- Any infrastructure with lots of data. Data just takes time to move; backups take time to restore.

- You're on bare metal because running node on VMs isn't fast enough.

- You're in a secure environment, where the plain old bureaucracy will get in the way of a full rebuild.

- Anytime you have to change DNS. That's going to take days to get everything failed over.

- Clients (or vendors) whitelist IPs, and you have to work through with them to fix the IPs.

- Amazon gives you the dreaded "we don't have capacity to start your requested instance; give us a few hours to spin up more capacity"

> Imagine an outage in one of the 3 datacenters you are running your infra in the same region. You need to move 1/3 of the capacity to the remaining 2 datacenters.

Oh, this is very different. If your provider loses a datacenter, and your existing infrastructure can't handle it, you're already SOL - the APIs for spinning up instances and networking is going to be DDOSed to death by all of the various users.

Basic HA dictates that you provision enough spare capacity that a DC (AZ) can go down and you can still serve all of your customers.

StreamBright · on Aug 29, 2016

I mostly disagree with your points, with the exception of the last one.

I used to work in the team that runs Amazon.com. All of the systems serving the site can be re-built within hours and nothing can serve the site that cannot be rebuilt within a very thing SLA. However, I understand that not all the companies have this requirement. This feature is only relevant when a site downtime hurting the company too much, so it could not be allowed.

Reflecting to your points:

- Lots of data -> use S3 with de-normalized data, or something similar

- Running a VM has 3% overhead in 2016, scalability is much more important than a single node performance

- High security environments are usually payment processing systems, downtime there can be a bit more tolerated, delaying transactions is ok

- Amazon uses DNS for everything, even for datacenter moves. It is usually done within 5 minutes

- This is a networking challenge, using something like EIP (where the public facing IP can be attached to different nodes) makes this a non-issue

- Amazon has an SLA, they extremely rarely have a full region outage, so you can juggle capacity around

Losing a dc out of 3 does not require work because you can't handle the load, it is required to have the same properties (same extra capacity for example) just like before. Spinning up instances should not DDOS anything, it is with constant load on the supporting infrastructure.

The last point I agree with.

falcolas · on Aug 29, 2016

First, two important assumption I'm making when I say this (and I feel they are reasonable assumptions). I'm not just talking about bringing a production environment back up in the same or adjacent AZ; I'm talking about true DR, where you're moving regions. I'm also not limiting my discussion to AWS' infrastructure - not with Google, Rackspace, Cloudflare and others in the space as well.

> Lots of data -> use S3 with de-normalized data, or something similar

S3's use case does not match up with many different computing models (hadoop clusters, database tables, state overflowing memory), and moving data within S3 between regions is painful. Also, not all cloud providers have S3.

> Running a VM has 3% overhead in 2016, scalability is much more important than a single node performance

Not when you have a requirement to respond to _all_ requests in under 50ms (such as with an ad broker).

> High security environments are usually payment processing systems

Or HIPPA, or government.

> delaying transactions is ok

Not really. When I worked for Amazon, they were still valuing one second of downtime at around $13k in lost sales. I can't imagine this has gone down.

> Amazon uses DNS for everything, even for datacenter moves. It is usually done within 5 minutes

Amazon also implements their own DNS servers, with some dynamic lookup logic; they are an outlier. Fighting against TTL across the world is a real problem for DR type scenarios.

> EIP (where the public facing IP can be attached to different nodes) makes this a non-issue

EIPs are not only AWS specific, but they can not traverse across regions, and rely on AWS' api being up. This is not historically always the case.

> they extremely rarely have a full region outage, so you can juggle capacity around

Not always. Sometimes, you can. But not always. Some good examples from the past - anytime EBS had issues in us-east-1, the AWS API would be unavailable. When an AZ in us-east-1 went down, the API was overwhelmed and unresponsive for hours afterwards.

> Spinning up instances should not DDOS anything, it is with constant load on the supporting infrastructure.

See above. There's nothing constant about the load when there is an AWS outage; everyone is scrambling to use the APIs to get their sites backup. There's even advice to not depend on ASGs for DR, for the very same reason.

AWS is constantly getting better about this, but they are not the only VPS provider, nor are they themselves immune to outages and downtime which requires DR plans.

alrs · on Aug 29, 2016

"Any infrastructure with lots of data. Data just takes time to move; backups take time to restore."

Exactly. Don't put data in Docker. Files go in an object store, databases need to go somewhere else.

vacri · on Aug 29, 2016

> Any infrastructure with lots of data.

OP's first point is 'don't put data in docker'. Docker is not for your data. But more to the point, if you're rebuilding your data store a couple of times every day, a couple of hours downtime isn't going to be feasible.

> You're on bare metal because running node on VMs isn't fast enough

In such a situation, you should be able to image bare metal faster than 2 hours. DD a base image, run a config manager over it, and you should be done. Small shops that rarely bring up new infra wouldn't need this, but anyone running 'bare metal to scale' should.

> bureaucracy

Isn't part of the infra rebuild per se.

> Anytime you have to change DNS. That's going to take days

Depends on your DNS timeouts, but this is config, not infra. Even if it is infra, 48-hour DNS entries aren't a best-practice anymore (and if you're on AWS, most things default to a 5 min timeout)

> Clients (or vendors) whitelist IPs, and you have to work through with them to fix the IPs

I'd file this under 'bureaucracy' - it's part of your config, not part of your prod infra (which the GP was talking about).

> Amazon gives you the dreaded...

Well, yes, but this is on the same order as "what if there's a power outage at the datacentre". Every single deploy plan out there has an unknown-length outage if the 'upstream' dependencies aren't working. "What if there's a hostage event at our NOC?" blah blah.

The point is that with upstream working as normal, you should be able to cover the common SPOFs and get your prod components up in a relatively short time.

falcolas · on Aug 29, 2016

> OP's first point is 'don't put data in docker'. Docker is not for your data.

I agree, but I (and the GP, from my reading) was not speaking about only Docker infrastructure.

> Isn't part of the infra rebuild per se.

I can see your point, and perhaps these points don't belong in a discussion purely about rebuilding instances discussion. That said, I have a very hard time focusing just on the time it takes to rebuilding capacity when discussing a DC going down; there's just too many other considerations that someone in Operations must consider.

When I have my operations hat on, I consider a DC going down to be a disaster. Even if the company has followed my advice and the customers do not notice anything, we're now at a point where any other single failure will take the site down. It's imperative to get everything taken down with that DC back up; and it's going to take more than an hour or two.

cortesoft · on Aug 28, 2016

I know you probably aren't trying to address all cases, but just because you can't re-build your prod infrastructure in minutes or hours doesn't mean you aren't doing devops right.

Many larger companies can't do this; my company has 70+ datacenters with tens of thousands of servers. We can't re-build our prod infra in minutes or hours. We are still doing devops right :D

Like I said, I know you aren't talking about my situation when you made your statement... I just get frustrated when people act like there are hard and fast rules for everyone.

StreamBright · on Aug 28, 2016

Well I am not talking about a 1M node outage. That you cannot fix with anything. I am talking about a maximum datacenter wide outage, that actually happens pretty often. Amazon has game days, Netflix has chaos monkeys for the same reason. Make sure that you can rebuild parts of your infra pretty quick.

toomuchtodo · on Aug 28, 2016

As long as your VMs are prebaked and your cloud provider supports ASG-esq primatives (i.e. "I need X instances running at a time, instantiate with such and such meta data") anyone can rebuild their prod infra quickly. You don't need Docker or containers to do that.

cortesoft · on Aug 28, 2016

Some of us work for those cloud providers :)

toomuchtodo · on Aug 29, 2016

I have no doubt ;) it's why I always strive for accuracy here!

cortesoft · on Aug 28, 2016

Oh, of course. Our datacenters often go offline (both planned and unplanned), and we are always ready to handle that. We are pretty much constantly re-provisioning servers... with so many physical machines, hard drive and other hardware failures are a daily occurrence.

ckdarby · on Aug 28, 2016

I'm going to have to call bullshit, Where do you work or what company do you own?

I work at MindGeek, depending on the time of the year it would be fair to say we rank within the top 25 bandwidth users in the world. We are not even close to that amount of servers and we deal with some of the largest traffic in the world. What company is running in 70+ datacenters!? World's largest VPN provider? Security company providing all the data to the NSA?

Maybe it is just my broad assumptions but I would hope that the major big 10 that come to mind such as Google, Amazon, Microsoft, etc would be able to rebuild their production regions in hours.

Sanddancer · on Aug 28, 2016

Not the OP, but first thing that comes to mind as to why you'd need a lot of datacenters are CDNs like Cloudflare or Akamai. Stuff like that, you need lots of servers, lots of storage, and low latency. You'd also need a good amount of configurations because things like video streaming would require different server settings than, say, protecting a site that's being DDOSed.

cortesoft · on Aug 28, 2016

Ding ding ding :D

I don't work for one of those two, but I do work for a very large CDN.

ceejayoz · on Aug 29, 2016

> I work at MindGeek, depending on the time of the year it would be fair to say we rank within the top 25 bandwidth users in the world.

"I'd have thought I've had heard of them..."

One Wiki search later... yup. I've heard of them.

Steeeve · on Aug 29, 2016

Ha! I went to their website... Hmm... never heard of them. Went to Wikipedia... Oh, now I know who they are! Yeah. Lot's of bandwidth.

fnord123 · on Aug 29, 2016

>What company is running in 70+ datacenters!?

People running edge networks and therefore need servers local to everywhere in the world to keep latencies down. Maybe it's not as much of an issue for MindGeek (the parent company for a lot of video streaming sites). I would guess you guys need a lot of throughput but latency isn't so much of a problem. Or you simply don't need to serve some parts of the world where it might be illegal to distribute some types of content.

FWIW, Cloudflare has 86 data centers: https://www.cloudflare.com/network-map/

cortesoft · on Aug 29, 2016

Or they are customers of CDNs.

user5994461 · on Aug 29, 2016

[Note: MindGeek = the pornhub network]

Short version: That's a video streaming websites which is rather simple, yet bandwidth intensive.

Outsourcing the caching and video delivery means MindGeek can do with little servers and a few locations.

Nonetheless the CDN you're outsourcing to does need a lot of servers at many edge locations.

Actually, if we think in terms of "top bandwith users in the world". It's possible that your company is far from being in the list. It's likely dominated by content delivery / ISP / and other providers, most of which are unknown to the public.

ckdarby · on Sept 3, 2016

Would you say Netflix & Youtube are rather simple? Handling +100 million users is never rather simple...

user5994461 · on Sept 3, 2016

Youtube had the challenge of being one of the first streaming services, and it's operating at an unprecedented scale. I am actually wondering, how many orders of magnitude youtube has more traffic than pornhub?

I am in distributed systems and try to work exclusively on hard problems. So when I say "simple", that is biased on the high end of the spectrum.

If you go to pornhub.com and looks at "popular keyword", you'll only find thousand or ten thousands videos. In a way, there is not that much content on pornhub.

All major websites have challenge. Pornhub is a single purpose website and a lot of the challenge is in video delivery, which can be outsourced to a CDN nowadays.

"simple" is maybe too strong a word. I am trying to convey the idea that it has limited scope and [some of] the problems it's facing are understood by now and have [decent] solutions.

That's not to say it's easy ;)

cortesoft · on Aug 28, 2016

I work for a large CDN.

cerebellum42 · on Aug 28, 2016

I think the point of the comment you're replying to is not really what you are replying to. Having the ability to rebuild quickly and frequently is great, and something you should aim for, but actually being forced to do it regularly whether you need it or not is pretty bad.

Edit: didn't notice someone else had already said this, opened this tab like 15 minutes ago

StreamBright · on Aug 28, 2016

I see your point. The source of force that makes you do it should not come from Docker I agree. :) We are just lucky that we can do it when it happens.

raverbashing · on Aug 28, 2016

"have to be able" is very different from "have to"

Sure, you want to be able to deploy quickly. But if there's no reason to, then don't.

And I would be very scared if Docker images had a 1 day uptime max

marcosdumay · on Aug 28, 2016

Can != should

siliconc0w · on Aug 28, 2016

We see a couple of different bugs that are best solved by simply rebuilding the container host. To docker's credit these tend to decrease with high versions.

We also see them mostly in non-prod environments where we have greater container/image churn. We use AWS autoscale and Fleet so containers just get moved to other hosts when we terminate them. We have actually thought about scheduling a logan's run type job that kills older hosts automatically - it's in the backlog.

majewsky · on Aug 29, 2016

Bugs that can be solved by a rebuild are not restricted to Docker. We had an interesting week when the build was red all the time for various reasons, and then prod started failing. Usually we deploy once a day, and not deploying for a week caused several small memory leaks to turn into big ones.

nolok · on Aug 28, 2016

> So, can you elaborate on why rebuilding the containers is good advice?

While I sincerely hope I'm wrong, I assume it's because you reset the clock on the probability something goes very wrong.

sp527 · on Aug 28, 2016

The "have you tried turning it off and on again" of DevOps. It makes a surprising amount of sense though, as long as your service is truly stateless, the restart can be easily orchestrated, and it results in no difference in operational costs.

creshal · on Aug 28, 2016

If it's stateless, then why does rebuilding it change anything about the frequency of bugs popping up?

XorNot · on Aug 28, 2016

Ooh I can answer this one: because ask people if their container root is writeable, and get amused at the blank stares you get back.

I am currently fighting an ongoing battle at work to point out that the plans for our Mesos cluster have not factored in that the first outage we have will be when someone fills up the 100gb OS SSD because no one's given any thought to where the ephemeral container data goes.

piaste · on Aug 29, 2016

I am a layman to devops. By "ephemeral container data" do you mean temporary files created by the service, temporary files created by the OS / other applications, or something else?

kbar13 · on Aug 28, 2016

if both the code and the infra it's running on is stateless, then yeah.

creshal · on Aug 29, 2016

So we're replacing somewhat not fully stable VMs running on somewhat not fully stable virtualization infrastructure with theoretically stable containers running on violently unstable container infrastructure?

tristor · on Aug 29, 2016

Pretty much, yes.

pmarreck · on Aug 28, 2016

Mutability is the root of all computing evils

andrewprock · on Aug 29, 2016

And the source of all computing value.

jjn2009 · on Aug 28, 2016

The host isn't the container itself. They want to re-provision the host likely not because of something wrong with the application but instead docker is in some state which is non-recoverable, or at least not recoverable by automatic means.

benologist · on Aug 28, 2016

Because then you always know you can always rebuild automatically and that's being tested constantly while developers work, a bit like how Netflix crashes everything all the time randomly to ensure they can always automatically recover from every dependency.

It also naturally rewards optimizing around time-to-redeploy, probably a lot of benefits there.