By: Andrew Weeks, Jeremy Brenner, Mayra Quiroga
Drawing insights from data is an invaluable tool. But it is rare to find data that is neatly packaged and ready to be consumed. Before that can happen, there are a few preliminary steps that need to be taken before any analysis can begin, e.g. finding data, sourcing it, cleaning it, rearranging it, enriching it, storing it, and finally accessing it. Yikes. Each of these steps have their own complications when you take into account the many different technologies involved, while some will be out of your control, the ones that aren’t might induce decision paralysis from all the choices available. If you are an analyst or anyone who takes an interest in drawing insights from data, more than likely, you’d like to skip to the part where you actually analyze the dataset. A Data Engineer (DE) helps you do just that. Their role aims at providing a dataset ready for analysis. Figuring out the technicalities of what is happening behind the scenes of a data pipeline is part of their job.
So what does a Data Engineer do?
When a DE starts a new project, their first few steps are normally to gather requirements, assess the technical landscape, architect and implement the data pipeline solution, tailor the data, and deliver it. Let’s dive into each of these steps and how we approach it at Enfuse.
The most important part of the process is to understand the needs and wants of the consumers. To gather requirements we aim to answer questions such as:
- How will the end-user leverage this data?
- What should the shape of the data and its fields look like?
- What format does the client expect? Flat files, Relational Data, API, etc.
- What is the frequency of data delivery? bulk /realtime
- Any SLAs, performance or freshness requirements? e.g. 10 million records in 24 hours
- Is there sensitive data? e.g. SSN
- Any relevant compliance requirements? e.g. HIPAA, SOX, GDPR, CCPA
- Is any data encrypted / masked?
- How do they expect the data to be presented? e.g. Business intelligence tool
Often, we summarize the key findings from the requirements stage in a data contract.
Assess the Technical Landscape:
Not only do we need to understand the data requirements, but also the environment we develop, architect, and deploy our solution within. Too many times, we have a theoretically working solution only to be delayed by months due to access issues. By planning and identifying risk areas and blockers at the front we are able to substantially increase our velocity later. The key areas of discovery are:
- What technologies are storing the data currently? e.g. relational databases, data lake, NAS, object storage, HD file system, API
- Where does the data live? e.g. On-premise, on the cloud, third-party vendor
- How much do we rely on third party resources? e.g. engineers, tooling, and services
- How is access granted to these systems, for both users and automated processes?
- Will the new solution be developed on existing or new infrastructure / cloud platform?
- What firewall rules need to be validated and implemented?
- Another key area, not directly related to technologies, but is critically important is to understand the current SDLC that teams work with. What does work intake look like, how are stories written, assigned and tracked?
- Not exactly related to the technical implementation, but understanding the current SDLC that the team(s) we are partnering with follow is crucial to ensure consistent velocity.
It’s important to note that each source system and technology comes with its own methods of extraction which will dictate the architecture of the solution to store the raw data for further processing.
Data Pipeline Architecture and Implementation:
In the same way that no two snowflakes are the same, no two data pipelines are the same. They might have the same structure, but also have their own intricacies. Once we’ve gathered the requirements and have assessed the landscape – we are ready to design and focus on a solution. When designing pipelines, we know they must be:
- Optimized for performance
- Measureable and monitorable – we should be able to monitor and tune performance and be alerted about failures
In this way Data Engineering really is in part software engineering.
Tailor the data:
The simplest pipeline architecture will involve a data source, one or more data transformations, and a data sink. Without proper tailoring, a data pipeline may fall victim to the garbage-in garbage-out principle. Before we really dive into developing the pipeline we generally profile the data: understanding existing schemas and relationships across the datasets to gather metadata that could help clean, organize, and understand the scope of potentially duplicate, incomplete or missing data.
When we talk about tailoring our data we are referring to:
- Simple transformation/enrichment of data (e.g. converting data formats, joining fields)
- Applying complex rules based on composite fields (e.g. address validation based on city/state and zip)
- Data cleansing and validation (e.g. discarding rows that are missing required fields to Dead Letter Queue – DLQ)
- Normalize and denormalize the data as appropriate
- Join and enrich across multiple datasets
Delivering the data:
When a data pipeline has delivered the data to its final sink (for example a storage bucket or a data lake), a data engineer’s job is not complete. A DE’s job is not to deliver data to a bucket, but to ensure that it is accessible and usable by the people who need it. Sometimes this involves transformation and access controls to work with data analysts. Sometimes it involves loading into a Business Intelligence (BI) tool like Tableau or Looker for executive-level dashboards and summaries. Often it will be a balance of the two and it’s important that the DE be able to hand off the data in various formats or tools.
That’s how we see the role and responsibility of a data engineer. They are advocates for your consumers and ensure that your data gets to the right place, at the right time, in the right shape. They are critically important as they become experts on the data and pipelines. At Enfuse we often take it a step further, as with our Software Engineers who practice “XP” – Extreme Programming, we have also extended this to our Data Engineering team.
What makes a Data Engineer an “EXtreme” Data Engineer?
There are many practices involved in XP and if you would like to find out more you can check out some of our blogs or read “EXtreme Programming Explained” by Kent Beck. For us, EXtreme Data Engineering really revolves around three key practices: Pair Programming, Unit Testing, and Continuous Integration / Continuous Deployment (CI/CD) . These allow us to ensure we develop good code, at a consistent predictable velocity, and regularly and confidently deploy fixes to our users.
Pair Programming is the practice of two people working together to achieve higher quality results, less bug ridden code in less time. Overall, while having data engineers pairing is uncommon in the industry it is part of our secret recipe for consistently delivering quality results that exceeds expectations. When we partner with clients they are able to share unique perspectives on historic decisions, data schemas, and we share best in breed tools, practices and experience, to level both engineers up.
The practice of testing is simultaneously universally accepted, and almost every implementation has their detractors and supporters. At Enfuse we firmly believe in the importance of unit tests. This allows us to refactor and extend our code confident in the knowledge that no existing functionality is broken. This primarily applies to our python pipelines. The second layer on top of unit tests is integration and regression tests. We push these for all clients who have a non-production environment to support it. Being confident that the SQL transformations and views are working as expected is incredibly valuable to making the decision to deploy to production.
A key theme of EXtreme Data Engineering is the ability to have confidence in the code you are deploying. CI/CD is about having confidence in deploying the code. A good CI/CD pipeline/process allows us to quickly, smoothly and automatically deploy our code to any environment (including production!). Knowing that the code has been validated (through automated and manual tests) in a lower prod-like environment gives us the confidence to deploy to the production environment.
The advantage of CI/CD is that we are able to confidently push a tested, validated deployable unit to production knowing that it has everything it needs to work (in partnership with external configs that should be kept separately).
Being Data Engineers means we worry about all the technicalities of providing data for the consumers so they can concentrate on what they do best – analyze and draw valuable insights. Being EXtreme Engineers means we take it a step further by building dependable pipelines, where each enhancement is meticulously tested and efficiently deployed which allows for getting insights faster and with greater confidence.