Availability is another huge selling point of the cloud. Many companies use multi-cloud with multiple regions to prevent failures from taking the whole system down (as might be in a data center). “Cloud just makes it easier and less expensive to have redundancy to accomplish 99%+ SLAs,” claims one LinkedIn User.
However, availability ranked as a medium size challenge when moving on Prem. But that might only be in relation to the importance of other challenges. As Jens Günther remarks as product CTO says:
“Go for the cloud until your cloud comfort costs may well pay for a solid network ops team (3 network/server/storage people, 3 1st level NOC monitors, 3 Kubernetes pros, project manager, DevOps coordinator). Only then might you want to find a really solid DC operator to make sure you have a first-class DC setup.
And then you have only one DC running. You need to interconnect them, you still might need a CDN for static content (I doubt that most of us need true edge computing)” - Jens Günther on the alphalist CTO Slack
How to get high availability using on-prem infrastructure?
In addition to making sure the hardware you use is of excellent quality and well-maintained, you can station your data centers across regions with a strong network.Lamontcg on HackerNews claims that you can even achieve most of your DC redundancy goals with DCs separated by 100 miles or less and enjoy a lot lower latency, higher bandwidth and lower costs. Lamontcg also suggests thinking about different geographic flood plains and different power companies/grids. This isn’t perfect but might be a good tradeoff in terms of risk.
How did Basecamp get high availability using on-prem?
Wlll, a former employee at BaseCamp wrote on HackerNews that when he was there Basecamp had between 10 and 30 racks at its primary location, another data centre in a second region (Virginia) with a replica of what was needed to run BaseCamp and in a third location (New York) there was a half-rack data replication. There was 10G fibre (rented by wavelength, not the actual fibre) connecting each location. This setup meant: “We could lose one DC and remain RW for our block data, we could lose 2 DCs and we'd have to drop down to RO. Block data were things like uploads, so DBs, search, etc. wouldn't have been affected. We could lose one of the main DCs and still be RW for everything.”
Why did Basecamp run both hot even though NY to Virginia isn’t that far from each other, ”Because it played into the rest of our plan, which was DC failover. With some pretty epic voodoo (Juniper/F5/OpenResty etc.) we could fail over the datacentres, swapping the RO and RW locations. We could also do this if one of the locations was unavailable. We could do this in 4 seconds /without losing a single in-flight request/ (we tested it).” says Wlll.