Easy as...a Data Lake

Whether you’re a developer, solutions architect, or running a business and thinking about moving your data to the cloud, you need to know about AWS Lake Formation.

Let’s say you’ve got some data, an AWS account and some big dreams for all the insight your data is going to create. You think you’ll create a quick data lake in with s3. That’s where the fairytale ends.

Every solutions architect I’ve ever known has cautioned against data lakes turning into “data swamps”. This is with good reason, especially if you have lots of data. The bigger the data lake, the harder it is to organise everything. AWS makes it seem simple: “We have s3",” they said. “Put all your data here,” the said. It will be really easy,” they said. The thing is, they did tell you the truth, they just didn’t tell you the whole truth.

Don’t get me wrong, s3 is fantastically durable, and reliable and simple; it abstracts all sorts of problems that you’re better off not having to worry about in terms providing you with access to your data when you need it (think AWS’ 11-9’s chat). But as soon as there are people other than you who are going to access your data, that’s where the fun really begins.

Are you encrypting that data?

If not, you should be, because GDPR - and because, frankly, encryption doesn’t have to be an expensive operation anymore. “But that’s easy, we have sse-kms, or sse-s3.”

But are you going to manage all those KMS Key Policies and IAM Roles (or Users - but really, don’t use Users) to make sure the right IAM resources have got kms:encrypt and kms:decrypt permissions?

What if you have to restrict access within an s3 bucket at the objects-level?

Now you need Object-ACL’s, additional accounts and cross-account Roles (or maybe a different strategy for how your organise your data across buckets). Wouldn’t it be nice to give permissions at various field-levels rather the duplicate data, or create it ad-hoc, based on a User, or Role?

I can say from experience that it is not only nice - it is required to be able to do all of these things.

And if you’re thinking about what to do with your data, you’ll probably find that these all sound useful as well. The problem is that all of these things that are great for security, and holding to the Principal of Least Privilege is a huge drain on time and patience.

This is where AWS Lake Formation becomes your friend and we revisit my topic sentence. Here it is again:

Whether you’re a developer, solutions architect, or running a business and thinking about moving your data to the cloud, you need to know about AWS Lake Formation.

The simple fact is that Lake Formation helps with a huge chunk of building a Data Lake, but there’s one feature that I’m here to tell you about more than the rest. I’ve built data lakes before without Lake Formation, and while building a metastore can be a pain, AWS Glue makes that easy, if you’re clever with using event triggers. Transformation of data can be done in tons of ways these days, either with AWS-specific tools, or ETL platforms. You name it.

The one task that I find most cumbersome is setting up permissions for who has access to what data - and finding a good way to implement that. When I say “a good way,” what I mean is a way that doesn’t make your developer’s lives terrible. I have had these conversations with AWS staff regarding what I would need to do to replicate Lake Formation’s simple access control tools - and let me tell you, it is not a fun journey for a developer.

AWS has made setting who has access to which parts of your data lake really easy. This isn’t even just who has access to what files; it’s at the field level. Don’t want your admin staff to see street addresses? No problem. Don’t create a second version of the file and put it in a location that they have access to - just restrict the field to their Role.

This is easy. This is revolutionary. I love it, your developers will, too. If you are a developer and you didn’t know about this, now you do. Here’s the documentation and the CloudFormation.

Go forth and be a hero.