Organizations are realizing the importance of data! Due to that - they are maintaining strong data operations. Its recognition as a powerful business asset has seen the emergence of dedicated data teams comprising full-time roles for data scientists, architects, analysts, and, crucially, data engineers. As the data infrastructure helps the organization get decisions faster - it is important to understand the data engineer impact. I’ll be talking about a few , but very important , skills of a data engineer that need to be buffed - continuously...
Measurement
I would start with the obvious - as you are dealing with data , you are building products that are based on data , you are connecting the companies part by using data… so it’s pretty clear that you will measure yourself by it!
Here are some examples:
Measure the value you are creating per MB - try to create a way to measure every table in your DWH that you have created and see how many of those are being used , create a measurement to map your usage (MB used (unique) / total MB) , it will reveal lots of unused jobs to scrap.
Example - events , when you start capturing everything and end-up using a small fraction of them
% of tests out of jobs - tests are important to you data reliability , system stability and your day to day. Measuring the tests ratio will urge you to keep adding them
Data Management
Many data warehouses or data lakes are the result of a small initiative that got expended during the time , and ended up in a state of entity duplication , messy metadata and no catalog at all. You should start planning your data management strategy early, ideally in parallel with any new data initiative or project - as you will be more efficient , scalable and ready for the future , plus - will be saving lots of your business users precious time.
Having a data management infrastructure that includes things like metadata management is critical for allowing users to perform data discovery and make optimal use of your data. You can think of metadata services as a core component of your data platform in order to make it widely used around the company. It may seem like a big investment while you're team is rather small - but if you're company aim for data-driven culture , it's a must have.
Infrastructure automation
As a data engineer you get to work with many tools to create data pipelines and integration , and those might be used with several of your projects.
Let’s take the AWS platform for an example - you will use Lambda functions to validate the ingestion, API Gateway as your REST interface for data ingestion, Kinesis Data Streams for real-time analysis, Kinesis Firehose to deliver the data, and S3 as a persistence layer , and there's also GCP that I haven't talked about… it's a lot to handle.
All the above requires setup and configuration , and with similar projects using manual setup - it can be time consuming effort.
You can:
Build modular - use modules
The fastest way to provide a POC of easily establish a new pipeline is to work with modular deployment of your re-use code. For example, use one module to deploy your API gateway, another module to deploy Kinesis, an additional module to manage IAM roles etc. You will be able to reuse your code in various components with the vast majority of the tools around.
Use a version-control
Working in a team , that’s helpful! you can enable a pull-request option to check the code before applying it to the master branch - that makes an easy way of tracking changes and effective collaborate with others (re-use , did I mention that?).
Use a continuous integration/continuous delivery pipeline
You can automate everything and make your job much easier by using a CI/CD pipeline. I believe that many of you are already using it - but keep doing it all around is always good.
Use Terraform or CloudFormation (if you are AWS based), and then write all your infrastructure as code. The time and effort required will be worthwhile: you will have full control of your infrastructure, and it will enable you to deploy a brand-new data pipeline in minutes, by just executing your infrastructure code.
In my next posts I'll be writing on big things from the DE day-to-day such as architecture dilemmas , data stack decisions and data pipelines optimizations...
Stay tuned.
Comments