Posted on 2023-01-05

Build vs buy

I like cloud services as much as the next person. Definitely not too much but definitely, umm, much?

Anyway, I think cloud specific services can sometimes be worth the sort of vendor lock-in that they facilitate.

Vendor lock in sucks, but so does re-inventing the wheel. As mentioned in an earlier post it can be pretty nice to create alerts and application metrics passively from your log files.

Knowing when to use vendor-specific services and APIs versus rolling your own is, in some ways, more art than science. Reasonable teams won't change their infrastructure provider very many times and the difference between the two choices is often less about lock-in and more about whether or not the problem you're solving is really the special unicorn that you want it to be or not.

Sometimes, the cloud vendor solution is just bad, though.

Case in point - AWS Transfer Server

On my team, we need to accept SFTP file transfers from about 20-30 vendors.

Sadly we can't avoid SFTP for this. It's an industry standard for the kind of data we're receiving and we aren't in a position to dictate otherwise.

AWS offers a suite of products which include a hosted SFTP solution, AWS Transfer.

SFTP is simple enough and I'd ultimately like the uploaded files in an S3 bucket so seems like a great fit, right?

Wrong.

You have to jump through a hundred different hoops to event attempt to have password-based authentication working on this service.

The services involved to make password based authentication work here aren't in and of themselves a problem. The problem is the multiple points of failure they represent. Each service can fail on its own, the permissions between the services can fail, or the code in the Lambda itself can fail.

Debugging it was a nightmare. It ultimately worked but I wasn't sure why. I had a strange permissions error about the IAM Role being used to either launch the Lambda or the role used to access the S3 bucket for uploads. I don't remember at this point. But it was really clear that if it ever broke again then myself or whatever poor soul had to work on it was going to going to regret the choice of technologies to make this work.

You know what is easy and simple and well understood? OpenSSH and Linux user accounts.

In the end, the AWS Transfer option proved to be just complicated and enciting enough for me to waste about a day and a half on it before I realize, in a drunken stupor, that I'd be better off creating a snow-flaked EC2 instance and calling it a day.

Resisting temptation

I do think I could have identified that this was a bad idea before I wasted over a day. One doesn't have to be doomed to repeat this mistake to learn the lesson here.

The way to avoid this is to interrogate the path before you start.

Empathy is a good tool, here.

Picture the thing working as intended. Then imagine someone else (or future you a year from now) having to search for how to change something or track down an error.

Are they likely to find many other people using these services? (In this case, no)
Are the services involved purpose build for the task at hand? (In this case, no none of them)
Does your team have pre-existing expertise around the services involved? (Again, no)
What benefits does this approach yield over a traditional solution? (Upload to S3 is nice)
Are those benefits important to your use case? (Not for us)

I didn't ask myself of those questions. I really should have. Because the answers are easy to get and they clearly indicate AWS Transfer Server is not the best answer.

Timeboxes

Another tool that could have helped here is a timebox.

A timebox is when you decided to give yourself a fixed, and often pretty short, amount of time to accomplish something or to at least get meaningful insight into a potential solution.

Sometimes I will demand that a working solution can be done inside a given timebox or then the approach is abandoned. But other times I might just say that I'm going to spend X amount of time on an approach and then make a point to re-evaluate how long the complete solution will take.

It's important to remember how valuable timeboxes can be. The more time we waste on things like this then the more attached we become to it. Sunk cost can get even the best of us.

For me the simplest tool for avoiding sunk cost fallacy is to timebox risky endeavors like this one. The time inside a timebox starts out as a write off. I never feel bad about discarding failed results inside a timebox. I should use them more aggressively.