Depends, usually you have to be able to re-build your prod infra within minutes or maximum hours, otherwise you are doing devops wrong. The whole point of automation is reproducible infrastructure that you can stand up quickly. With stateless approach you can just do this. Why would you do that? Imagine an outage in one of the 3 datacenters you are running your infra in the same region. You need to move 1/3 of the capacity to the remaining 2 datacenters. This is not too much different to re-building it.
Aaah, the classing "you're doing <it> wrong" argument. I can come up with dozens of different environments where it is simply not feasable to rebuild an environment within two hours.
- Any infrastructure with lots of data. Data just takes time to move; backups take time to restore.
- You're on bare metal because running node on VMs isn't fast enough.
- You're in a secure environment, where the plain old bureaucracy will get in the way of a full rebuild.
- Anytime you have to change DNS. That's going to take days to get everything failed over.
- Clients (or vendors) whitelist IPs, and you have to work through with them to fix the IPs.
- Amazon gives you the dreaded "we don't have capacity to start your requested instance; give us a few hours to spin up more capacity"
> Imagine an outage in one of the 3 datacenters you are running your infra in the same region. You need to move 1/3 of the capacity to the remaining 2 datacenters.
Oh, this is very different. If your provider loses a datacenter, and your existing infrastructure can't handle it, you're already SOL - the APIs for spinning up instances and networking is going to be DDOSed to death by all of the various users.
Basic HA dictates that you provision enough spare capacity that a DC (AZ) can go down and you can still serve all of your customers.
I mostly disagree with your points, with the exception of the last one.
I used to work in the team that runs Amazon.com. All of the systems serving the site can be re-built within hours and nothing can serve the site that cannot be rebuilt within a very thing SLA. However, I understand that not all the companies have this requirement. This feature is only relevant when a site downtime hurting the company too much, so it could not be allowed.
Reflecting to your points:
- Lots of data -> use S3 with de-normalized data, or something similar
- Running a VM has 3% overhead in 2016, scalability is much more important than a single node performance
- High security environments are usually payment processing systems, downtime there can be a bit more tolerated, delaying transactions is ok
- Amazon uses DNS for everything, even for datacenter moves. It is usually done within 5 minutes
- This is a networking challenge, using something like EIP (where the public facing IP can be attached to different nodes) makes this a non-issue
- Amazon has an SLA, they extremely rarely have a full region outage, so you can juggle capacity around
Losing a dc out of 3 does not require work because you can't handle the load, it is required to have the same properties (same extra capacity for example) just like before. Spinning up instances should not DDOS anything, it is with constant load on the supporting infrastructure.
First, two important assumption I'm making when I say this (and I feel they are reasonable assumptions). I'm not just talking about bringing a production environment back up in the same or adjacent AZ; I'm talking about true DR, where you're moving regions. I'm also not limiting my discussion to AWS' infrastructure - not with Google, Rackspace, Cloudflare and others in the space as well.
> Lots of data -> use S3 with de-normalized data, or something similar
S3's use case does not match up with many different computing models (hadoop clusters, database tables, state overflowing memory), and moving data within S3 between regions is painful. Also, not all cloud providers have S3.
> Running a VM has 3% overhead in 2016, scalability is much more important than a single node performance
Not when you have a requirement to respond to _all_ requests in under 50ms (such as with an ad broker).
> High security environments are usually payment processing systems
Or HIPPA, or government.
> delaying transactions is ok
Not really. When I worked for Amazon, they were still valuing one second of downtime at around $13k in lost sales. I can't imagine this has gone down.
> Amazon uses DNS for everything, even for datacenter moves. It is usually done within 5 minutes
Amazon also implements their own DNS servers, with some dynamic lookup logic; they are an outlier. Fighting against TTL across the world is a real problem for DR type scenarios.
> EIP (where the public facing IP can be attached to different nodes) makes this a non-issue
EIPs are not only AWS specific, but they can not traverse across regions, and rely on AWS' api being up. This is not historically always the case.
> they extremely rarely have a full region outage, so you can juggle capacity around
Not always. Sometimes, you can. But not always. Some good examples from the past - anytime EBS had issues in us-east-1, the AWS API would be unavailable. When an AZ in us-east-1 went down, the API was overwhelmed and unresponsive for hours afterwards.
> Spinning up instances should not DDOS anything, it is with constant load on the supporting infrastructure.
See above. There's nothing constant about the load when there is an AWS outage; everyone is scrambling to use the APIs to get their sites backup. There's even advice to not depend on ASGs for DR, for the very same reason.
AWS is constantly getting better about this, but they are not the only VPS provider, nor are they themselves immune to outages and downtime which requires DR plans.
OP's first point is 'don't put data in docker'. Docker is not for your data. But more to the point, if you're rebuilding your data store a couple of times every day, a couple of hours downtime isn't going to be feasible.
> You're on bare metal because running node on VMs isn't fast enough
In such a situation, you should be able to image bare metal faster than 2 hours. DD a base image, run a config manager over it, and you should be done. Small shops that rarely bring up new infra wouldn't need this, but anyone running 'bare metal to scale' should.
> bureaucracy
Isn't part of the infra rebuild per se.
> Anytime you have to change DNS. That's going to take days
Depends on your DNS timeouts, but this is config, not infra. Even if it is infra, 48-hour DNS entries aren't a best-practice anymore (and if you're on AWS, most things default to a 5 min timeout)
> Clients (or vendors) whitelist IPs, and you have to work through with them to fix the IPs
I'd file this under 'bureaucracy' - it's part of your config, not part of your prod infra (which the GP was talking about).
> Amazon gives you the dreaded...
Well, yes, but this is on the same order as "what if there's a power outage at the datacentre". Every single deploy plan out there has an unknown-length outage if the 'upstream' dependencies aren't working. "What if there's a hostage event at our NOC?" blah blah.
The point is that with upstream working as normal, you should be able to cover the common SPOFs and get your prod components up in a relatively short time.
> OP's first point is 'don't put data in docker'. Docker is not for your data.
I agree, but I (and the GP, from my reading) was not speaking about only Docker infrastructure.
> Isn't part of the infra rebuild per se.
I can see your point, and perhaps these points don't belong in a discussion purely about rebuilding instances discussion. That said, I have a very hard time focusing just on the time it takes to rebuilding capacity when discussing a DC going down; there's just too many other considerations that someone in Operations must consider.
When I have my operations hat on, I consider a DC going down to be a disaster. Even if the company has followed my advice and the customers do not notice anything, we're now at a point where any other single failure will take the site down. It's imperative to get everything taken down with that DC back up; and it's going to take more than an hour or two.
I know you probably aren't trying to address all cases, but just because you can't re-build your prod infrastructure in minutes or hours doesn't mean you aren't doing devops right.
Many larger companies can't do this; my company has 70+ datacenters with tens of thousands of servers. We can't re-build our prod infra in minutes or hours. We are still doing devops right :D
Like I said, I know you aren't talking about my situation when you made your statement... I just get frustrated when people act like there are hard and fast rules for everyone.
Well I am not talking about a 1M node outage. That you cannot fix with anything. I am talking about a maximum datacenter wide outage, that actually happens pretty often. Amazon has game days, Netflix has chaos monkeys for the same reason. Make sure that you can rebuild parts of your infra pretty quick.
As long as your VMs are prebaked and your cloud provider supports ASG-esq primatives (i.e. "I need X instances running at a time, instantiate with such and such meta data") anyone can rebuild their prod infra quickly. You don't need Docker or containers to do that.
Oh, of course. Our datacenters often go offline (both planned and unplanned), and we are always ready to handle that. We are pretty much constantly re-provisioning servers... with so many physical machines, hard drive and other hardware failures are a daily occurrence.
I'm going to have to call bullshit, Where do you work or what company do you own?
I work at MindGeek, depending on the time of the year it would be fair to say we rank within the top 25 bandwidth users in the world. We are not even close to that amount of servers and we deal with some of the largest traffic in the world. What company is running in 70+ datacenters!? World's largest VPN provider? Security company providing all the data to the NSA?
Maybe it is just my broad assumptions but I would hope that the major big 10 that come to mind such as Google, Amazon, Microsoft, etc would be able to rebuild their production regions in hours.
Not the OP, but first thing that comes to mind as to why you'd need a lot of datacenters are CDNs like Cloudflare or Akamai. Stuff like that, you need lots of servers, lots of storage, and low latency. You'd also need a good amount of configurations because things like video streaming would require different server settings than, say, protecting a site that's being DDOSed.
People running edge networks and therefore need servers local to everywhere in the world to keep latencies down. Maybe it's not as much of an issue for MindGeek (the parent company for a lot of video streaming sites). I would guess you guys need a lot of throughput but latency isn't so much of a problem. Or you simply don't need to serve some parts of the world where it might be illegal to distribute some types of content.
Short version: That's a video streaming websites which is rather simple, yet bandwidth intensive.
Outsourcing the caching and video delivery means MindGeek can do with little servers and a few locations.
Nonetheless the CDN you're outsourcing to does need a lot of servers at many edge locations.
Actually, if we think in terms of "top bandwith users in the world". It's possible that your company is far from being in the list. It's likely dominated by content delivery / ISP / and other providers, most of which are unknown to the public.
Youtube had the challenge of being one of the first streaming services, and it's operating at an unprecedented scale. I am actually wondering, how many orders of magnitude youtube has more traffic than pornhub?
I am in distributed systems and try to work exclusively on hard problems. So when I say "simple", that is biased on the high end of the spectrum.
If you go to pornhub.com and looks at "popular keyword", you'll only find thousand or ten thousands videos. In a way, there is not that much content on pornhub.
All major websites have challenge. Pornhub is a single purpose website and a lot of the challenge is in video delivery, which can be outsourced to a CDN nowadays.
"simple" is maybe too strong a word. I am trying to convey the idea that it has limited scope and [some of] the problems it's facing are understood by now and have [decent] solutions.
I think the point of the comment you're replying to is not really what you are replying to. Having the ability to rebuild quickly and frequently is great, and something you should aim for, but actually being forced to do it regularly whether you need it or not is pretty bad.
Edit: didn't notice someone else had already said this, opened this tab like 15 minutes ago
I see your point. The source of force that makes you do it should not come from Docker I agree. :) We are just lucky that we can do it when it happens.