I think it can be unfair to characterise single zone failures as being an failure to adequately deploy or architect.
There's many opportunities for failure even if only a single zone goes away; most (if not nearly all) database solutions elect leaders for example, and "brown-outs" (as in, not total failures) can lead to the leader maintaining leadership status, or at least messing with quorum.
other situations can exist where the migration out of a zone leads to hardware becoming unavailable for consumption for other people, after all, the cloud is not magic and if peoples workloads auto shift to the surrounding (unaffected) zones then it will impact peoples ability to do the same migration as all the free hardware could be used up.
I can think of dozens of examples honestly where even if you had built everything multi-zonal you could be down due to a single zone; for instance if some unknown subsystem was zonal (like IAM?) or you use regionally available persistent disks and now they suddenly perform extremely bad with writes because they can't sync to the unavailable datacenter.
I believe multi-zone is less possible than we would like it to be, there are many cases where you can commit no error but still be completely at the mercy of a single zone going away.
> I believe multi-zone is less possible than we would like it to be, there are many cases where you can commit no error but still be completely at the mercy of a single zone going away.
There are many understandable ways to accidentally have a single point of failure. But if your conclusion after the outage is that there was no mistake, you have made two of them, and the second is much less understandable.
If even their own operational telemetry were misrepresented, one might have started wondering about the accuracy of what they're reporting to their customers.
So, I agree, but it's not like they really had a choice :)
It's well known that the AWS status page doesn't reflect the actual uptime. Anytime an massive outage is repored here on HN the AWS status page shows all green.
Anytime this debate comes up there are two camps of thought -- the AWS camp where "measuring service degradation" is one of the most insanely complex problems that we lesser devs will never understand -- and the other camp is StatusGator a cheap service that seems to be easily able to tell you when one of your services you use is having trouble - including AWS.
Service should be nearly restored at this point. We apologize for the significant interrupt to everyones days. As most of you all figured out at this point the issue was with a regional failure in GCP (central, which is our primary zone). We're going to be exploring what we can do to lessen a single zone failure, and more importantly, reduce/remove any impact of the storage services being unavailable.
For folks reading this - and we'll solve this at a global scale - check your `SENTRY_DSN` setting and if its not using something like `oXXXXX.ingest.sentry.io` you should consider rolling out the updated value (located in your project settings). Our master domain routing layer is a more complicated failure point, and moving to the ingest based domains pushes you to our edge layer which _should_ be extremely reliable as well as much reduced global latency.
Sentry is a great product, I love even more that you can self-host it; otherwise I would be one of the people denigrating everyone for running their entire operation on SaaS products.
As it stands if their reliability leaves something to be desired: run your own.
These things happen, they’re more common the more complicated or huge your setup.
Is there a good name for this theory that self-hosting something is more reliable? I mean yeah, if you have an outage it only affects you and your customers so I guess it's an improvement in that regard, but thinking you're better at running software than a SaaS is hubris.
Yes. It’s called “understanding your constraints”.
The inverse is “appeal to higher power”; (implication being that its almost religion since it’s based on faith).
However: if you think that running large systems is hard, then it follows that it’s better to deploy small systems.
There can be times to outsource for sure, and I’m being relatively glib: but this idea of rent vs own is quite pervasive in the industry and it leads to dangerous situations where entire companies grind to a halt because these trade offs were talked down.
Not sure about the name, but there are reasons to expect more reliability. You're only handling your traffic, not all the customers at once. You don't need to scale/distribute that far. And you can predict the traffic/storage needs better. In SaaS case someone does it for you, but it's a much bigger problem and they care less about you (as a single customer id 37826) than you do.
> but thinking you're better at running software than a SaaS is hubris.
Disagreed on several points.
First off, if you're using something like Sentry, chances are that building and running software is your job so it should not be a problem. If self-hosting Sentry is a problem I would start doubting the skills of your tech team.
Second, running a service for your own use is very different to running a service that has to scale to the entire world. Your service won't need as many moving & distributed parts as the SaaS Sentry since it will only ever have to handle a fraction of the latter's traffic.
Finally, if you're in control, you can schedule risky maintenance operations during times where a potential accident (operator error, etc) won't affect your business. You can't do that with an SaaS.
> First off, if you're using something like Sentry, chances are that building and running software is your job so it should not be a problem. If self-hosting Sentry is a problem I would start doubting the skills of your tech team
Sentry is a pretty complex piece of software with a lot of moving parts. They have good orchestration around it, either Helm charts for Kube deployments or a giant bash script managing docker-compose for mono-machine, and it just works. However understanding either of them, and being capable of debugging aren't skills i expect your average developer to have. It's more for SRE/Infra/Platform/etc. folk. Like i don't expect the average developer to be capable of debugging Kubernetes on their own.
Of course, many of them absolutely can, but many more would prefer that to people who do that for a living. The type of developer using Heroku and other PaaSes precisely to avoid getting their hands too deep into infrastructure stuff.
> If self-hosting Sentry is a problem I would start doubting the skills of your
> tech team.
I can't think of any team I've worked at in the past year for whom reading the self-hosted Sentry docs and standing up an instance would be impossible.
However, my experience is by and large at early-stage startups, and so I also can't think of any instance where I'd want to have my team working on that when they could be working on adding value to our product. If I can pay Sentry to handle setup, scaling, maintenance, etc (and I do!), then that's worth it when weighed against the dollar and opportunity cost of having my team handle it.
That's not to mention maintenance or other issues.
Just so we're clear: I completely agree with that stance. My grandparent comment is quipping that the option is rarely on the table.
We as an industry build our businesses on proprietary solutions that are difficult (nearing on impossible) or extremely expensive to self-host.
I love that there are options for SaaS; I strongly dislike being locked into SaaS -- and it feels like a lot of the companies I join are locked into some SaaS offering with no possible migration plan, despite dubious uptimes, awful support, rising prices and overall dissatisfaction with the product. (atlassian springs to mind immediately)
It's tech debt, of a certain type; and some tech debt is acceptable, especially in a startup.
Having an option such as Sentries to self-host is basically the best you're going to get... Gitlab is the same: Pay them to host, or host it yourself, the product is the same, the only "debt" incurred is the setup and migration, which is much lower than, say, Jira -> YouTrack.
It's not going to be more reliable probably, especially if you are trying to host a app with tens of services talking to each other. Scale/load might make a difference sometimes. I'm thinking github's recent issues here. But I wouldn't want to self-host github/gitlab, that would become a full time job probably :) If we decided github was not reliable enough, I'd just make sure we would not be dependend on it for normal work and maybe run my own git server and mirror it to GH.
I am however a proponent of self-hosting open source things you could get as a managed service. I.e databases, elastic search, redis etc. Provided you have the expertise to do that and benifit from it.
For example, I used to run a redis setup that would failover in <5s on crashes and with basically 0 downtime for upgrades etc. Now we use ElastiCache and an upgrade will cause serveral minutes of downtime. I haven't had redis crash ever, but I have had elasticache & rds instances dissapear. Sometimes failover works, sometimes not and you have little to no wayt ot find out what is going on.
I am using it, but it is not fun to use. It spawns 28 Docker containers (yes, that's the actual number), including things like nginx, postgres, memcached, redis, kafka, zookeeper, clickhouse. I've tried upgrading it once and it failed miserably, after which I simply started from scratch.
I would like to use their hosted version, or a more stable paid on-prem version. Pricing is not the issue for me, but I don't want to introduce another third-party service and have to include that in compliance documentation for our customers. That is the reason I still self-host many things to be honest.
Well the bare minimum that you have to tell your customers is your cloud provider / hosting provider, then you sometimes have a separate provider for Object Storage (S3 or Backblaze), maybe another one for email delivery (Mailgun). You might have an invoicing/CRM portal (Stripe), SCM (Github), Chat (Slack), your own email setup (Gmail). And that is not even talking about any client-side tracking, things like Google Analytics etc. So it really makes sense to cut down the number of third parties, especially if there is any scenario in which they would be handling PII, and especially if you're selling to customers in heavily regulated industries.
Their self-hosted version works well with the caveat that the upgrade process can be a pain and can be buggy. I used their docker-compose open source version which involved downtime to spin down the stack, then spin up the new stack. Considering that there's no enterprise support for on-prem, its nice that it works so well. If you're using their apm/observability you will need to keep an eye out on disk space and memory/cpu after turning it on, we ran into resources issues due to the amount of data we started sending. Otherwise it's rock solid and cost-efficient, we were paying 500ish bucks a month for infra and 1-2 days of dev time for something sentry.io quoted 50,000K/month for. I imagine the kubernetes installer is much less hands on in terms of upgrades. Would use again without any hesitation.
We used to run it self hosted, but the architecture is rather complex, so we ended up switching to their SaaS offering. I tried advocating [0] for a leaner architecture for simpler setups + to run integration tests against... but it wasn't met with much enthusiasm.
I imagine 99% of installations (and certainly CI pipelines) would be fine with their wsgi webserver and sqlite instead of Clickhouse, Relay, memcached, nginx, Postgres, Kafka and a host of other services. I wanted to take a stab at this myself, but given the complexity of the system, uncertainty of being able to merge it back, and, last but not least, their license, I decided against it.
This has been the experience at my company as well. The previous version of Sentry Server was okay-ish to self-host, but the newest version requires setting up some more services that our infra unwilling to setup/maintain as they are different from the tech stack that our devs use in the company. We ended up with SaaS too
I've being running their self-hosted version, and it's kind of OK.
Nowhere near as nice as, say, GitLab's Ubuntu packages which you can simply apt-get install, unattended automatic (security) updates are available and work well, and it's been mostly a set-up-and-forget type of thing. For GitLab, backups and most ops procedures are well documented.
Sentry, on the other hand, requires you to run a script to upgrade with their docker-compose thingies. You must remember to upgrade.
Self-hosted Sentry runs on docker, which on Ubuntu is a bit of a security shitshow what with docker side-stepping UFW and all that.
Recently Sentry tripled (seems like) the number of containers they spin up, and it seems much heavier than a couple of years ago. You can still get away with 8G RAM, so it's not really that bad, to be honest.
Backing up Sentry was simple enough. However, a lot of questions are answered in the style "if you want to know more, buy the enterprise version". But, I've backed up and restored a Sentry system and it worked out OK, so I guess we're fine. Unless they change something, and then we're hosed without realizing it.
So, Sentry's self-hosted version is OK, but clearly not their number one priority. I have no idea what Sentry would look like with a self-hosted distributed/HA setup. Our data protection policies would absolutely prohibit us from using their cloud services, so maybe I'll have find out, but I'm not looking forward to that.
I've been running it for almost a year, using docker-compose. It seems rather unstable for me. Every week or so, Sentry will just stop handling incoming events, and the request queues just keep growing. And if you try to upload symbols in this state, it will pause forever when trying to process them. So I've got a script[0] I can run that will unbreak it, but I don't fully understand what it actually does (I cobbled that script together based on a GitHub issue[1] that described the same problem). The Sentry architecture is complicated and not trivial to debug.
When it works, it's pretty good. I'm using it with sentry-native on the application side, which uses Crashpad to capture stack traces of native binaries (x86/ARM, Windows/Mac/Linux, whatever). It often doesn't deduplicate events properly, and the stack trace qualities vary dramatically by platform. Sometimes the stack traces it provides are total nonsense, but it does allow downloading the minidump files, so I can dig at them in Visual Studio and see what's really going on. I have discovered and solved many real bugs using it, so I've put up with the frustrating stability issues.
I do, it works great with their docker-compose install - processing a few billion events per day here. Installation and subsequent minor/major upgrades without any flaw so far, plus they have extensive documentation for the self hosted version which is to note.
Upgrading is a little bit tricky: in my (limited) experience the upgrading procedure for the self-hosted version sometimes fails and requires to take the application offline to perform the upgrade (but it is very possible that I may be doing something wrong).
It works if you can live with 28 containers (messing up "docker container ls" output that I always have to "grep -v sentry") literally eating 2GB+ memory just to have it launched.
Ideally, things run as a single binary with bring-your-own-database but that's not the case here.
I wish I could use my already installed ClickHouse to save some memory but the version constraint and some very complicated docker compose setup on their end wasn't worth trying.
But I still love the product itself, so I'll let it chew my memory.
Upgrade has been fine as the manual says, you pull their git repo, run the installer and worked fine on all my past several occasions.
I run it on k8s for a medium sized company, as running via compose didn't seem to scale well for us. I even upgraded from 9.11 which ran a handful of containers to the most recent version which runs 15+ containers.
You get used to the quirks of running it and then it just works, but it did take some time polishing the health checks etc to get it HA and scalable.
None of the helm charts I tried actually worked or provided HA.
It really does run a ton of different services under the hood. It's designed for their type of scale, a small-scale deployment would probably work OK without Zookeeper and similar.
We run the on-prem version, but looking to transition to SaaS soon. While one can point at the current outage and say "good thing we run it ourselves!" it is a chore to keep updated and running smoothly.
That said, it runs untouched for long stretches of time no problem.
The SaaS offering came later. I used to run it on-prem a few years ago before the SaaS showed up and it was very low maintenance. More of a fire and forget situation once it was set up.
Not just PII, but access tokens to 3rd party systems in case of integrations and whatnot.
A Sentry data leak could include valid access tokens to, say, all data on a customer's Docusign account. We try to make sure data like that is scrubbed before sending to Sentry… but mistakes can happen. This is simply not a risk we are willing to take.
Thus, we self-host Sentry in our own secure infrastructure (we are a SaaS provider ourselves), and accept all the maintenance burden that it entails.
I looked into it at one point and found it required a lot of underlying services and seemed really complex. Primarily I was worried about the maintenance costs moving forward with something like that.
There is an open-sourced option called GlitchTip (https://glitchtip.com/) that is much simpler to self-host. I believe they forked the Sentry repo before the license change. It's not quite as feature rich but is a pretty great alternative.
I wake up to discover that my site https://remotehunt.com is super slow. First thing I do is visit HN to see what's up and I instantly see that Sentry is down.
I'm using Sentry to monitor logs and it now makes sense. Ok, so I remove Sentry from Laravel's error handler but nothing changes. And it's weird because sometimes it works, sometimes not. I tweak some things on Cloudflare (turning on Under Attack mode etc).
I still have this feeling that maybe it's related to Sentry: I'm using it and HN says it's down. So I go and remove the composer package as well, just in case. And it worked.
It took like an hour... I didn't use Sentry anyways :)
What did I learn: do not depend on external services if you really don't need it.
A few years ago I was showing someone how many calls a request to Spotify makes and I noticed they were getting 429s from Sentry. An amusing little insight into what is surely a thorn in some Spotify engineer's side.
#HugOps from ex-Opsgenie SRE. I know the pressure you are on as a mission critical software that people depend on you with their alerts, exceptions etc. I left more than 6 months ago and still recovering from the responsibility toll.
You have a great history of availability and would like to say congrats.
I just dealt with this in my laravel app and it was preventing the application from working. I was able to resolve by redeploying app with "SENTRY_LARAVEL_DSN=null" in the .env file
I’ve used them for a number of years, never so much as a blip. Looks like they’re doing a good job managing this outage. They’re also super transparent with their bugs in their change logs. I recommend and love Sentry.
It's somewhat ironic that a monitoring service has shit the bed. For some reason at work we're using Webpack, and, for some reason we are also using Sentry Webpack plugin that is causing our whole CI to fail.
It’s likely that the failure is around Sentry’s “signal a release and upload your source code out of band to us” feature.
In theory things shouldn’t break but if your CI process involves signaling a third party, it can fail from that (hopefully you can temporarily disable it or something of the sort)
Yes the CI should fail, or else you would be deploying something that can't match your errors to your source code (via the sourcemaps which is what presumably gets uploaded to sentry on the CI run).
Take in advance that I don't know Sentry, nor webpack et al but, in my mind, I'd like to have an opportunity to continue my build process with a big, red warning.
I prefer a 98% accurate error reporting better than a 100% one I can't push to production.
Or maybe I'm not understanding well the value Sentry offers, of course.
This is a choice, usually hidden until you stumble upon it. Is Sentry being available, (actually, the ability to upload sourcemaps) that important for you to fail a CI or you can just fire up a warning somewhere?