Rethinking Our Data Engineering Process
When you’re starting a new team, you’re often faced with a crucial dilemma: Do you stick with your existing way of working to get up and running quickly, promising yourself to do the refactoring later? Or do you take the time to rethink your approach from the ground up?
We encountered this dilemma in April 2023 when we launched a new data science team focused on forecasting within bol’s capacity steering product team. Within the team, we often joked that “there’s nothing as permanent as a temporary solution,” because rushed implementations often lead to long-term headaches.These quick fixes tend to become permanent as fixing them later requires significant effort, and there are always more immediate issues demanding attention. This time, we were determined to do things properly from the start.
Recognising the potential pitfalls of sticking to our established way of working, we decided to rethink our approach. Initially we saw an opportunity to leverage our existing technology stack. However, it quickly became clear that our processes, architecture, and overall approach needed an overhaul.
To navigate this transition effectively, we recognised the importance of laying a strong groundwork before diving into immediate solutions. Our focus was not just on quick wins but on ensuring that our data engineering practices could sustainably support our data science team’s long-term goals and that we could ramp up effectively. This strategic approach allowed us to address underlying issues and create a more resilient and scalable infrastructure. As we shifted our attention from rapid implementation to building a solid foundation, we could better leverage our technology stack and optimize our processes for future success.
We followed the mantra of “Fast is slow, slow is fast.”: rushing into solutions without addressing underlying issues can hinder long-term progress. So, we prioritised building a solid foundation for our data engineering practices, benefiting our data science workflows.
Our Journey: Rethinking and Restructuring
In the following sections, I’m going to take you along our journey of rethinking and restructuring our data engineering processes. We’ll explore how we:
- Leveraged Apache Airflow to orchestrate and manage our data workflows, simplifying complex processes and ensuring smooth operations.
- Learned from past experiences to identify and eliminate inefficiencies and redundancies that were holding us back.
- Adopted a layered approach to data engineering, which streamlined our operations and significantly enhanced our ability to iterate quickly.
- Embraced monotasking in our workflows, improving clarity, maintainability, and reusability of our processes.
- Aligned our code structure with our data structure, creating a more cohesive and efficient system that mirrored the way our data flows.
By the end of this journey, you’ll see how our commitment to doing things the right way from the start has set us up for long-term success. Whether you’re facing similar challenges or looking to refine your own data engineering practices, I hope our experiences and insights will provide valuable lessons and inspiration.
Go with the flow
We rely heavily on Apache Airflow for job orchestration. In Airflow, workflows are represented as Directed Acyclic Graphs (DAGs), with steps progressing in one direction. When explaining Airflow to non-technical stakeholders, we often use the analogy of cooking recipes.