I hate Clouds - a personal perspective on why I think Clouds suck
(loudwhisper.me)
from loudwhisper@infosec.pub to technology@lemmy.world on 07 Jul 2024 13:49
https://infosec.pub/post/14613558
from loudwhisper@infosec.pub to technology@lemmy.world on 07 Jul 2024 13:49
https://infosec.pub/post/14613558
I hope this won’t be counted as some form of self-promotion, even though I am sharing a post from my own blog.
As a tech worker who works in a Cloud shop, I wanted to elaborate the many reasons why I find working with Clouds terrible, from multiple points of view.
I tried to organize my thoughts in a (relatively long) post, in which both technical aspects and political aspects (which are very related) are covered.
I am sure many people will have different perspectives, and this could be potentially also a nice prompt for a discussion.
threaded - newest
<img alt="" src="https://ani.social/pictrs/image/eadf3a84-be97-45b6-b199-29a847f7d482.webp">
How could I miss the opportunity to use this picture!
It definitely felt like that at times.
Haha, I couldn’t resist. Great post though.
I was really hoping this was going to be a rant about clouds in the sky.
As someone who burns with the faintest touch of stellar light, I love clouds in the sky.
<img alt="" src="https://lemmy.world/pictrs/image/3eca2cf4-07e9-44ad-856d-fe0f18f5d96d.jpeg">
Haha that’s great. Took me a second to notice.
I’m sorry, but this started like a recipe article and I lost all interest. I don’t care about your life story, I clicked the link to read your opinions, and you spent the first several paragraphs avoiding them.
Nothing to be sorry for. I didn’t write for you nor for any particular individual, and it’s fair if you are not interested in it. I also added a table of content at the beginning, so you can jump directly to the relevant section (Technical Side) skipping the (in my opinion needed) introduction completely, if you wish. Cheers
Two brief paragraphs of light nonsense on a blog post, then a quick summary of what the article will cover?
Tell me you don’t read often without telling me you don’t read often:
Yes, I hate cloud too. Now tell this to my company, which received about 100k dollar credits from Azure and Google Cloud :)
What do you mean by “promotion”? A discount? Credits to get started?
Yeah credits makes more sense 👍
Oh yeah, I know that that’s a thing. It’s a practice not too different from the stereotypical drug dealer who gets you hooked on free drugs. In this case the idea is that if you start there, you get vendor-locked and you will have to pay that amount many times over. I understand the appeal from the company perspective, though.
Yes absolutely true. For example, GKE looks very nice, but when we use one of their features, it creates the need to use other features too. That’s why I warned the boss a lot. Even though they have great features, we try to use generic applications to avoid hooks.
I hope they don’t take the credit back :)
So the whole thing is well worth a read IMO, and addresses a lot of the issues I have with cloud as the solution for everything.
There are absolutely companies who need the scaling. But it’s a fucking lot of overhead if you don’t.
This is similar to all the LLM code stuff. If you don’t actually fully understand what your code does, bad stuff happens.
I was going to bring up basically this point:
But you covered it pretty well. Abstractions are great. Proprietary abstractions that are more focused on how they can bill you than real, useful, functional categories? Not so much.
So there’s a level of rent seeking behind all the software moving to subscriptions, and them wanting to lock you in just like their service providers are doing to them. But I have to think the massive costs of cloud junk also pay a role in stuff like a calendar charging double digit annual fees for something that takes very little storage and very little computation (and you of course can’t just buy software any more).
I have no words for multi-cloud. Even like a Facebook or YouTube scale site, are you really going to double (or more for some reason?) your storage costs (plus whatever intercommunication between the two), just in case the provider goes down for a couple hours (which is extremely rare, and you won’t be the only site impacted, so people won’t really blame you for.) Plus that architecture sounds like
Thanks!
Absolutely agree. I did not even think about this aspect, but I think you are absolutely spot on. Building something with huge costs is something that ultimately gets passed to the users in addition to the rent-seeking aspect.
You and me both. I have to work with it and the reality is, there is nobody who actually understands the whole thing. The level of complexity (and fragility, I might add) of it all is astonishing. And all of this to mitigate some (honestly) low risk of downtime from the cloud provider. I have lobbied a little bit against at work, but ultimately it has become a marketing tool to sell to customers, so goodbye any hope of rational evaluation…
It’s all shits and giggles until a network config takes down your cloud provider for 11 hours and you can’t even look at the logs. And multicloud is quite robust if done right, more so than a single cloud, if your setup is fragile someone is not doing their job right.
Complexity brings fragility. It’s not about doing the job right, is that “right” means having to deal with a level of complexity, a so high number of moving parts and configuration options, that the bar is set very high.
Also, I would argue that a large number of organizations don’t actually need the resilience that they pay a very high price for.
Complexity in this case should bring redundancy, not fragility. You are adding components in parallel, not in series, thus reducing fragility.
A raid 5 is more complex than a single drive, but it’s less fragile.
I wish it worked like that, but I donct think it does. Connecting clouds means introducing many complex problems. Data synchronization and avoiding split-brain scenarios, a network setup way more complex, stateful storage that needs to take into account all the quirks and peculiarities of all services across all clouds, service accounts and permissions that need to be granted and segregated for all of them, and way more. You may gain resilience in some areas, but you introduce a lot more things that can fail, be misconfigured or compromised.
Plus, a complex setup makes it harder by definition to identify SPOFs, especially considering it’s very likely nobody in the workforce is going to be an expert in all the clouds in use.
To keep using your simile of the disks, a single disk with a backup might be a better solution for many people, considering you otherwise might need a RAID controller that can fail and all the knowledge to handle and manage a RAID array properly, in addition to paying 4 or 5 times the storage. Obviously this is just to make a point, I don’t actually think that RAID 5 vs JBOD introduces comparable complexity compared to what multi-cloud architecture does to single-cloud.
Split brain are easily solved, there’s of the shelf solutions and if you have some custom code you can use plenty of well researched solutions, for instance raft. Putting bizantine fault in Google scholar yields thousands of papers,if you want something fancier.
Same for most problems you mentioned, they were an issue 10 years ago, nowadays you can federate, abstract or outsource most of it.
Making it harder to identify SPFOs doesn’t increase fragility. If you whole system a single instance it’s trivial to identify (the whole thing) but very brittle.
Of course the problem is solved, but that doesn’t mean that the solution is easy. Also, distributed protocols still need to work on top of a complicated network and with real-life constraints in terms of performances (to list a few). A bug, misconfiguration, oversight and you have a problem.
Just to make an example, I remember a Kafka cluster with 5 replicas completely shitting its pants for 6h to rebalance data during a planned maintenance where one node was brought offline. It caused one of the longest outages to date with the websites which relied on it offline. Was it our fault? Was it a misconfiguration? A bug? It doesn’t matter, it’s a complex system which was implemented and probably something was missed.
Technology is implemented by people, complexity increased the chances of mistakes, not sure this can be argued.
Making it harder to identify SPOF means you might miss your SPOF, and that means having liabilities, and having anyway scenarios where your system can crash, in addition for paying quite a lot to build a resilience that you don’t achieve.
A single instance with 2 failure scenarios (disk failure and network failure) - to make an example - is not more fragile than a distributed system with 20 failure scenarios. Failure scenarios and SPOF can have compensating controls and be mitigated successfully. A complex system where these can’t be fully identified can’t have compensating control and residual risk might be much harder. So yes, a single disk can fail more likely than 3 disks at once, but this doesn’t give the whole picture.
The only problem is that the single instance also has 20 scenarios (and keeps the 2 as well), making it more brittle.
A well design system removes points of failure, disk, power and network are obvious ones, and as long as you keep it byzantine safe, anything you added should be redundant so if one fails the system still runs. Ideally you remove all of them but if there’s one hidden it’s still better than “the whole thing is a single point of failure”.
No, it’s not true. A single system has less failure scenarios, because it doesn’t depend on external controllers or anything that makes the system distributed and that can fail causing a failure to your system (which may or may not be tolerated).
This is especially true from a security standpoint: complexity adds attack surface.
Simple example: a kubernetes cluster has more failure scenarios than a single node. With the node you have hardware failure, misconfiguration of the node, network failure. With a kubernetes cluster you have all that for each node (each with marginally less impact, potentially, because it depends for example on stateful storage, that if you mitigate you are introducing other failure scenarios as well), plus the fact that if the control plane goes in flames your node is useless, if the etcd data corrupts your node is useless, anything that happens with resources (a bug, a misuse of the API, etc.) can break your product. You have more failure scenarios because your product to run is dependent on more components to work at the same time. This is what it means that complexity brings fragility. Looking from the security side: an instance can be accessed only from SSH, if you are worried about compromise you have essentially one service to secure. Once you run on kubernetes you have the CI/CD system, the kubernetes API, the kubernetes supply-chain, etcd, and if you are in cloud you have plenty of cloud permissions that can indirectly grant you access to the control plane and to a console. Now you need to secure 5-6-7 entrypoints to a node.
Mind you, I am not advocating against the use of complex systems, sometimes they are necessary, but if the complexity is not fully managed and addressed, you have a more fragile system. Essentially complexity is a necessary evil to respond to some other necessities.
This is the reason why nobody would recommend to someone who needs to run a single static website to run it on Kubernetes, for example.
You say “a well designed system”, but designing well is harder the more complexity exists, obviously. Redundancy doesn’t always work, because redundancy needs coordination, needs processes that also depend on external components.
In any case, I agree that you can build a robust system within Cloud! The argument I am trying to make is that:
And mind you, everything you can do in Cloud you can also do on your own, if you invest on it.
You make it redundant, I thought I didn’t need to say that…
I am specifically saying that redundancy doesn’t solve everything magically. Redundancy means coordination, more things that can also fail. A redundant system needs more care, more maintenance, more skills, more cost. If a company decides to use something more sophisticated without the corresponding effort, it’s making things worse. If a company with a 10 people department thinks that using Cloud it can have a resilient system like it could with 40 people building it, they are wrong, because they now have a system way more complex that they can handle, despite the fact that storage is replicated easily by clicking in the GUI.
Redundancy should be automatic. Raid5 for instance.
Plus cloud abstracts a lot of complexity. You can have an oracle (or postgres, or mongo) DB with multi region redundancy, encryption and backups with a click. Much, much simpler for a sysadmin (or an architect) than setting the simplest mysql on a VM. Unless you’re in the business of configuring databases, your developers should focus on writing insurance risk code, or telco optimization, or whatever brings money. Same with k8s, same with Kafka, same with cdn, same with kms, same with iam, same with object storage, same with logging and monitoring…
You can build a redundant system in a day like Legos, much better security and higher availability (hell, higher SLAs even) than anything a team of 5 can build in a week self-manging everything.
Yeah it should, but something needs to implement that. I mean, when distributed systems work redundancy is automatic, but they can also fail. We are talking about redundancy implemented via software, and software has bugs, always. I am not saying that it can’t be achieved, of course it can, but it has a cost.
I know, and if you don’t understand all that complexity you can still fuckup your postgres DB in a disastrous way. That’s the whole point of this thread. Also operators can do the same for you nowadays, but again, you need to know your systems.
Of course it is. You are paying someone else for that job. Not going to argue with that. In fact, that’s what makes it boring (which I talked about in the post).
This is a modern dogma that I simply disagree with. Building an infrastructure tailored around your needs (i.e., with all you need and nothing else) and cost effective does bring money, it does by saving costs and avoiding to spend an enormous amount of resources into renting all of that, forever, scaling with your business.
This is the marketing pitch. The reality is that companies still have huge teams, still have tons of incidents, still take long to deliver projects, still have security breaches, but they are spending 3, 5, 10 times as much and nothing of those money is capitalized.
I guess we fundamentally disagree, I envy you for what positive experiences you must have had!
That’s my whole point from the beginning, boring is good. Boring is repeatable, boring is reliable.
Of course they still have huge teams. The invention of the automobile made travel easier therefore there was more travel happening.
Agreed on it all.
I think a big driver for cloud clients is bean counters - cloud is an expense, while having your own systems is capital investment.
They’d rather have the waste of leasing too much compute than have to pay taxes on systems plus the cost of staff to run it.
We won’t really see this get addressed until companies have to truly own the risks they take on (see all the hacks that happen on a daily basis because CIO won’t pay for the security that IT management is screaming to build). When fines for these breaches are meaningful, cloud will be less interesting.
addons.mozilla.org/en-US/…/cloud-2-butt-plus/
This add-on brings me joy and is related. :)
This post must be fun with that one… 150+ instances in various contexts of “cloud”.
Thank you for that… going back and reading again with this was very, very funny
It’s funny that the sheer idea or frequency of the word is distasteful enough to build this.
i have that installed on my work pc. Hasn’t bit me in the ass yet. I work in a datacenter
Thanks for sharing. Great read and points.
Yep. My first move is to ask "could this just live in an ec2 box? Do we really need any of aws’ marketed custom options?
But then I would ask, what’s the point of paying 10-20x per computing unit at that point? If you just use ec2 instance, all AWS offers you is an API to manage them, is it worth the premium? Besides, you will still need to mess with a lot of other services (VPCs, SGs, etc.) anyways.
What’s the selling point in your opinion?
Well I would have more questions, like why AWS at all.
But for some, cognito auth management is important, to align with other product goals.
But then at that point you are already vendor-locked, right? At that point, running on bare ec2 instances and taking more control in your hands (vs using even more AWS-specific services) is going to help very little, when your whole user management is now tied to a specific provider.
The concerns of product auth and isolated ec2 driven work are two separate conversations.
If there is zero contact with AWS services (and ad you say, locks) then I would keep asking questions about why AWS is a good choice at all.
Very good read. I totally agree with your sentiment that more and more, “engineering” is becoming just gluing together and managing cloud services and features.
My job as a sys admin has become the same. It’s not about actually understanding the technology at a deep level and troubleshooting problems, it’s about learning specific applets and features to click on and running down daily and weekly checklists.
Temporarily becoming.
Just like China had some social and cultural changes since being closed and till the Opium wars.
Systems are built around people and limited by what a human can conceive and make work. We don’t evolve that fast.
Also dependency on big centers has led to catastrophes in the past and will lead to those again.
It will all crash with a huge bang.
I’m confident of this, anyone who wants may call me a luddite.
Let’s hope that people will start to favor on-prem solutions and smaller independent cloud providers vs the massive trillion dollar corpo clouds that control so much now.
I feel you very much. Security work is also somewhat similar.
I think this takes a way basically the component that made it interesting, understanding what you are doing to the point that you can build stuff.
Well said.
And that’s a good thing, IMHO. As an architect I don’t want to rely on some single genius knowing secret incantations or anything like that.
Boring, tried and true services, repeatedly put together and if the organization allows the time for it, with excessive documentation.
Is that what you get with Cloud? Because there are still a million ways to shoot yourself in the foot. The main difference is that the single genius doesn’t need to implement things him/herself, but decisions still need to be taken and fragile setups can still be built.
Imagine an ec2 instance in a satellite account performing some business critical function with an instance role, whose custom IAM policy allows to do it in another account. Clouds are not giving you good engineering, they are giving you premade building blocks, you can absolutely still make a mess with those. Even more, the complexity and the immense portfolio of features can allow very creative ways to build very low-quality systems.
I think you can have good, boring, simple systems built by engineers. With or without Cloud services.
You can still make a mess, but you can’t fuck up the building blocks, so it’s a big improvement.
Using an ec2 instance is already a yellow flag, you have higher level services for most tasks.
Yeah in general you can’t mess the building blocks from the PoV of availability or internal design. That is true, since you are outsourcing it. You can still mess them up from other points of view (think about how many companies got breached due to misconfigured S3 buckets).
No one’s talking about secret incantations.
They’re talking about knowing how your applications actually work, so you’re not tied to the whims of a third party.
Hence or anything like that.
If people don’t know what your systems actually do, you’re going to have huge problems at some point.
Where did I request for “not knowing what systems do”?
That’s literally the entire chain you clicked down.
The fact that cloud provider calls aren’t based in any kind of core principles and force you to spend all your resources understanding their nonsensical structure instead of what your code actually does.
Wrong. You don’t know how it’s implemented, but you very much know what they do. Even heard about abstraction?
Abstraction is great. When it’s meaningful.
Cloud abstraction adds massive complexity that has no correlation to what your code does.
An di shouldn’t. Separation of concerns.
Straw man. I’m encountering sys admins and systems “engineers” who don’t know how to spec out a server, don’t understand how certificates work, don’t understand basic IP addressing principles, don’t understand basic networking topology.
They just know how to click a list of specific buttons in a GUI for one specific Corpo vendor.
Maybe that is fine for a Jr. Admin just starting out, but it isn’t what you want for the folks in charge of building, upgrading, and maintaining your company’s infrastructure.
There’s nothing wrong with making interfaces simpler and easier to understand. And there’s nothing wrong with building simplified abstractions on top of your systems to gain efficiency. But this should not be done at the cost of actual deep understanding and functionality.
The people you call when things go badly wrong will always be the folks that have that deep understanding and competency. It already has started hitting the developer community in the last few years. The Jr. Devs that did a 3 month boot camp where they learned nothing but how to parrot code and slap APIs together, are getting laid off and cannot find work.
The devs that went to school for Comp Sci, that have years of real world experience, and actually understand the theory and the nuts and bolts of the underlying tech, they are still largely employed and have little trouble finding work.
I think the same will happen soon in the IT world. Deep knowledge and years of dirty, greasy hands will always be desirable over a parrot that only knows how to click GUI buttons in a specific order.
That’s incompetence, and that’s a different problem.
I hate websites with low contrast text.
<img alt="" src="https://lemmy.world/pictrs/image/a3be2301-25b9-4ff6-89f3-f5e3da04208d.png">
How do you get this? Anything that tries to force a light mode?
This is how the site is supposed to look like (there is no light/dark theme selection):
<img alt="" src="https://infosec.pub/pictrs/image/985f8294-8ba7-4b70-adfc-ca3d0ce55d92.png">
I was reading the site on Android, and it looked dark, but after seeing this comment, I tried disabling Android system wide dark mode, and sure enough it became white like in the screenshot! For the record, I tried with both Firefox and a Chromium-based browser.
Thanks! I went and tried on my phone and indeed setting Firefox to light mode indeed causes that horrendous and unreadable result. I will need to figure out way, eventually, and provide an alternative light scheme.
I get the same white background on Windows, Chromium and Firefox. Checking settings, I see FF is set to “Automatic” light/dark mode. When I manually select Dark mode, I see the dark background.
I will have a look if there is something that suggests how to “make” a light theme. Thanks for the info!
Thanks for the feedback, and same to @ilmagico@lemmy.world and @jg1i@lemmy.world. I fixed the configuration of the site and now the site should be readable even in light mode.
You’re welcome! And yes, I can confirm it works in light mode as well :)
That’s an interesting gotcha.
.
there are too many points of failure for me to ever be comfortable using the cloud as a primary storage option.
i’ve always maintained this opinion when “the cloud” started being touted as being the future. and yet more corporations (including mine) are reliant on it. i mean sure, i can log in on my home computer and have some access to stuff as though i were physically at the office but that convenience ain’t worth the headache if the main storage site crashes.
If the storage “crashes” it doesn’t matter if it’s in the cloud or on-prem.
With the cloud you get two substantial advantages:
Of course all this costs big bucks, but technically it’s superior, easier and less risky.
That’s easily mitigated just following established standards. Redundancy is cheaper than anything else in the aftermath and documentation can be done easy with automation.
You don’t, you rent rack space in a location far enough away but close enough to get the data in a few hours.
It’s neither superior, easier or less risky, it’s just a shift in responsibility. And in most cases, it’s so expensive that a second or third on site engineer is payed for.
And what is simpler and faster, renting rack space in another continent (and buying, shipping, racking and initializing) or editing your terraform file?
Not OP, but they are comparable efforts, especially since it’s a relatively infrequent activity. You can rent dedicated boxes with off-the-sheld hardware almost instantly, if you don’t want to deal with the hardware procurement, and often you can do that via APIs as well. And of course both options are much, much, much cheaper than the Cloud solution.
For sure speed in general is something Cloud provide. I would say it’s a very bad metric though in this context.
Full-ACK.
My last customer (global insurance company) provisions several systems a day. Now moving to hundreds via Jenkins. Frequency is environment dependent.
If your compute needs expand that much everyday, and possibly shrink in others, than your use-case is one that can benefit from Cloud (I covered this in the post).
That said, if provisioning means recycle, then it’s obviously not a problem.
This is a very rare requirement. Most companies’ load is fairly stable and relatively predictable, which means that with a proper capacity planning, increasing compute resources is something that happens rarely too. So rarely that even a lead time for hardware is acceptable.
So if I may ask (and you can tell), what is the purpose of provisioning that many systems each day? Are they continuously expanding?
Agree to disagree. Banking, telecommunications, insurance, automotive, retail are all industries where I have seen wild load fluctuations. The only applications where I have seen constant load are simulations: weather, oil&gas, scientific. That’s where it makes sense to deploy your own hardware. For all else, server less or elastic provisioning makes economic sense.
Edit to answer the last question: to test variable loads, in the last one. Imagine a hurricane comes around and they have to recalculate a bunch of risk components. But can be as simple as running CI/CD tests.
Systems are always overspecced, obviously. Many companies in those industries are dynosaurs which run on very outdated systems (like banks) after all, and they all existed before Cloud was a thing.
I also can’t talk for other industries, but I work in fintech and banks have a very predictable load, to the point that their numbers are almost fixed (and I am talking about UK big banks, not small ones).
I imagine retail and automotive are similar, they have so much data that their average load is almost 100% precise, which allows for good capacity planning, and their audience is so wide that it’s very unlikely to have global spikes.
Industries that have variable load are those who do CPU intensive (or memory) tasks and have very variable customers: media (streaming), AI (training), etc.
I also worked in the gaming industry, and while there are huge peaks, the jobs are not so resource intensive to need anything else than a good capacity planning.
I assume however everybody has their own experiences, so I am not aiming to convince you or anything.
Banking is extremely variable. Instant transactions are periodic, I don’t know any bank that runs them globally on one machine to compensate for time zones. Batches happen at a fixed time, are idle most of the day. Sure you can pay MIPS out of the ass, but you’re much more cost effective paying more for peak and idling the rets of the day.
My experience are banks (including UK) that are modernizing, and cloud for most apps brings brutal savings if done right, or moderate savings if getting better HA/RTO.
Of course if you migrate to the cloud because the cto said so, and you lift and shift your 64 core monstrosity that does 3M operations a day, you’re going to 3nd up more expensive. But that should have been a lambda function that would cost 5 bucks a day tops. That however requires effort, which most people avoid and complain later.
Ofc they don’t run them on one machine. I know that UK banks have only DCs in UK. Also, the daily pattern is almost identical everyday. You spec to handle the peaks, and you are good. Even if you systems are at 20% half the day everyday, you are still saving tons of money.
Between banks, from customer to bank they are not. Also now most circuits are going toward instant payments, so the payments are settled more frequently between banks.
I want to see this happening. I work for one and I see how our company is literally bleeding from cloud costs.
One of the most expensive product, for high loads at least. Plus you need to sign things with HSMs etc., and you want a secure environment, perhaps. So I would say…it depends.
Obviously I agree with you, you need to design rationally and not just make a dummy translation of the architecture, but you are paying for someone else to do the work + the service, cloud is going to help to delegate some responsibilities, but it can’t be cheaper, especially in the long run since you are not capitalizing anything.
Not only it can be cheaper, it is cheaper in most cases… when designed correctly. And if you compare TCO, not hardware vs IaaS.
It can also be much more expensive of course, but that’s almost always a skill issue.
In most cases! Sorry, I simply don’t believe it. Once you operate for 5, 10, 20 years not having capitalized anything is expensive as hell, even without the skill issue (which is not a great argument, as it is the case for almost anything).
It’s almost always the case with rent vs invest.
Do you have some numbers?
I cite a couple of articles in the post, and here is a nice list of companies and orgs that run outside the Cloud (it’s a bit old!) or decided to move away. Many big companies with their own DC, which is not surprising, but also smaller (Wikipedia!).
37signals also showed a huge amount of savings (it’s one of the two links in the post) moving away from the cloud. Do you have any similar data that shows the opposite (like we saved X after going cloud)? I am genuinely curious
Edit: here is another one tech.ahrefs.com/how-ahrefs-saved-us-400m-in-3-yea… Looking solely at the compute resources, there was an order of magnitude of difference between cloud costs and hosting costs (x11). Basically a value comparable (in reality double) to the whole revenue of the company.
Why on another continent? Except maybe VDI, some direct calls to some LLM or some insane scales, there’s nothing really that needs those round trip times.
Because the customer demands it.
Also data rules / data privacy. Some things need to have the original in Europe; China & Russia also need their data separated from others.
AWS engineers’ first responsibility is to shareholders
Mike’s responsibility is to your same boss.
They are not the same.
Bonus: you can see Mike’s certs are real.
It’s not about responsibility (and only the c suite reports to the shareholders, not Mike), it’s about capability, visibility, tooling and availability.
If everything that you run is local as in the same physical location and there is no requirement for external or internet access then sure. Not everyone has that luxury. Otherwise, There are the same number of points of failure in a non-cloud configuration. You just feel more comfortable with those because you have direct hands on control.
You write “actually following best practice instead of faking it and lying” funny.
There are places that actually do that?
Can you provide a list, because I’d like to work there.
(I do not have 25 years of sysadmin angst over nobody ever doing shit right until after it’s on fire.)
Proton runs fully on their own hardware, they have some positions open!
Are you implying that the various cloud vendors lie about the way they configure their environments or admins don’t have emotional biases or something else entirely?
Having done everything from building my own servers 30 years ago to managing hundreds of servers in data centers to now managing hundreds of instances and other services in AWS, I’ll gladly stick with AWS. The hardware management alone makes it well worth the overhead.
25 or so years ago I had to troubleshoot a hardware issue in a SCSI-based server with 6 hard drives in it. A drive appeared to be failing so I replaced it and immediately another drive failed, then another, and so on. After almost a full day of troubleshooting later and we realized the power supply was actually the culprit and could no longer provide sufficient power to the full set of hard drives.
20 years ago while managing 700+ servers in a datacenter we had to manage a recall of about 400 of them thanks to the Capacitor plague that caused a handful of our servers to literally burst into flames.
Hardware failures like the above and dozens of others were mitigated in most cases thanks to redundancies in the software we wrote. But dealing with hardware failures and the resulting software recovery was a real PITA.
With AWS I may occasionally have a Linux instance lock up due to a hardware failure but it’s usually fairly easy to reboot the instance and have it migrate to new hardware. It’s also trivial to migrate a server to run on more (or less) number of CPU’s, RAM, etc. with only a couple of minutes of downtime.
The more advanced services AWS offers like object storage, queues, databases, etc. are even more resilient. We occasionally get notified that a replica for one of these services had failed or was determined to be on hardware that was failing, and it was automatically replaced with a new replica.
I’d much rather work this way than the way I did 20+ years ago.
Why not outsourcing just the hardware then? Dedicated servers and Kubernetes slapped on them. Hardware failure mitigated for the most part, and the full effort goes into making the cluster as resilient as possible, for 1/5 of the cost of AWS. If machines burn, it’s not your problem (you can have them spread over multiple sites, DCs, rooms, racks) anymore.
We did that (with Rackspace) for years before migrating to AWS. AWS is still far better from a service & flexibility perspective.
My employers website has certain times of the year where we see a huge increase in web traffic. When we had a hosted solution it took weeks of preparation to provision additional web servers to handle that load. We had to submit formal requests for additional servers, document how to wire them into our network & required firewall rules, etc. Then we had to wait an arbitrary number of days for them to do the work. And then we had to repeat that whole process when we no longer needed the additional capacity.
With AWS we just define an auto scaling group and additional web servers are spun up automatically when demand is high, and frees them up again when no longer needed. Even if we didn’t use auto scaling we could easily automate this sort of thing via terraform or other tools and spin up additional instances in minutes instead of days.
Great post, a quick nitpick if you don’t mind, introduce or use an abbreviation’s full words before using its abbreviated form
Granted that the article is geared towards sysadmins and cloud developers, others who may want to read it may have a hard time doing so. As an example, reading through the first technical point, I saw “IAMs” and “Network ACL”, I don’t understand what those abbrs mean
Thanks, that is a very good observation! I will try to sneak an edit later today where I can add some appendix about acronyms and abbreviations.
Edit:
While it might not look great, I have added at the bottom an Appendix with all (hopefully, I might have missed some) acronyms and abbreviations. Thanks for the suggestion!
The cloud is just someone else’s computer
With a lot of stuff on top!
How do u feel about cotton candy
Pink or Blue?
Pink of course
Anything that requires a fancy buzzword is usually stupid but a good way to make money for someone. The “cloud” has always existed as offsite hosting. Off-site shared servers, VPSs, whatever. It’s no different than running CPanel on an LAMP VPS in 2003.
But calling it “the cloud” gave all the business majors a hard on and then the accounts department realized they could manipulate share pricing by reducing the amount of assets a company holds. It’s the same stupid reason many companies don’t own their corporate headquarters or remote centers. They lease the, even if from themselves through another holding. It looks better on paper so the share price goes up. It’s all mind boggling stupid.
The cloud today significantly different than the 2003 cpanel LAMP server. It’s a whole new landscape. Complex, highly-available architectures that cannot be replicated in an on-prem environment are easily built from code in minutes on AWS.
Those capabilities come with a steep learning curve on how to operate them in a secure and effective manor, but that’s always going to be the case in this industry. The people that can grow and learn will.
I’m fully aware of the few buzzword and marketing pitches that cloud hosting uses. I’m forced to use both GCP and AWS for different contracts and I’m good at it.
The real truth is that most websites and internet services do not need scale. They do not need all this crap. A Pentium 3 could host all the data for most of these businesses and services. You don’t need serverless lambda functions to handle an api when an actual endpoint does the same thing to pull some info. The few companies that need such distributed computing and power, will need a big on-site or off-site implementation. It makes sense for that sometimes. But most times, it doesn’t even then. You’re just outsourcing your engineering and paying a premium.
I have seen so many startups spin up cloud accounts costing thousands of dollars a month when they’re in “private beta stealth”. Literally a $500 laptop could host all of their services just as quickly with no monthly fee. But as long as the VCs are paying, just flush that cash down.
The costs are definitely a huge consideration and need to be optimized. A few years back we ran a POC of Open Shift in AWS that seemed to idle at like $3k/mo with barely anything running at all. That was a bad experiment. I could compare that to our new VMWare bill, which more than doubled this year following the Broadcom acquisition.
The products in AWS simplify costs into an opex model unlike anything that exists on prem and eliminate costly and time consuming hardware replacements. We just put in new load balancers recently because our previous ones were going EoL. They were a special model that ran us a about a half-mil for a few HA pairs including the pro services for installation assistance. How long will it take us to hit that amount using ALBs in AWS? What is the cost of the months that it took us to select the hardware, order, wait 90 days for delivery, rack-power-connect, configure with pro services, load hundreds of certs, gather testers, and run cutover meetings? What about the time spent patching for vulnerabilities? In 5-7 years it’ll be the same thing all over again.
Now think about having to do all of the above for routers, switches, firewalls, VM infra, storage, HVAC, carrier circuits, power, fire suppression.
I’m immensely disappointed!
Not kidding: when I first saw the post title, I was fully convinced that I’ll read the post of a crazy person, rambling about (rain) clouds.
Yup, me too!
I am sorry! As an amateur landscape photographer I actually like very much those clouds. There are a few r-word posts about people hating those clouds though, but I checked and they are nowhere near as long as you would expect a proper rant to be
I used to love ‘the cloud’. Rather, a specific slice of it.
I worked almost exclusively on AppEngine, it was simple. You uploaded a zip of your code to appengine and it ran it at near infinite scale. They gave you a queue, a database, a volatile cache, and some other gizmos. It was so simple you’d struggle to fuck it up really.
It was easy, it was simple, and it worked for my clients who had 10 DAU, and my clients who had 5 million DAU. Costs scaled nearly linearly, and for my hobby projects that had 0 DAU, the costs were comparable.
Then something happened and it slowly became complicated. The rest of the GCP cloud crept in and after spending a term with a client who didn’t use “the cloud” I came back to it and had to relearn nearly everything.
Pretty much all of the companies I’ve worked for could be run on early AppEngine. Nobody has needed anything more than it, and I’m confident the only reason they had more was because tech is like water. You need to put it in a bucket or it goes everywhere.
Give me my AppEngine back. It allowed me to focus on my (or my clients) problems. Not the ones that come with the platform.