Photo by Murilo Viviani on Unsplash
Applying various aspects of the software development lifecycle to data science, engineering, and analytics is very on trend right now — and that’s a good thing. Whether you’re talking about treating data transformation as code, adopting DataOps and Agile Data Governance practices, thinking about data-as-a-product, or contemplating a data mesh architecture (essentially applying microservice fundamentals to the data and analytics stack), the world is coming around to finally viewing data and analytics as a team sport. But if you want to win this game, you need to find ways for players to interact and collaborate together, capture knowledge and make it easier for more people to play.
There’s a mad rush to steal from the successes of the software development community, especially when it comes to facilitating better collaboration and teamwork around data science and analytics. Great artists steal and, as the guy who writes and talks a lot about Agile Data Governance, I do it too!
But not everyone is on board with the idea that we’d be saved if we all just did what the software engineers and product managers do. People point out very valid issues around data privacy, bias, validity and other problems that make the output of analytics and data science much more sensitive than a typical agile software project. It’s also clear that work involving data and analytics often culminates in critical decision making in organizations that involve a much broader set of stakeholders than a typical software project.
On the pro side of the argument, agile practices that are inclusive, iterative and truly involve stakeholders throughout the project have the potential to bring data producers (data engineers and stewards) and consumers (data scientists and analysts) together in really meaningful ways. This can contribute to strong data-driven cultures and vastly improved data literacy.
In my day job, we build a data catalog platform that is used by enterprises like Prologis, the world’s biggest logistics real estate company. By developing its analytics and data governance practice with a use-case driven agile approach, Prologis has been able to deploy it in a way that has been truly transformative.
Historically, our organization had a top-down, ‘boil the ocean’ approach to data governance, which wasn’t working for us. We knew we had to completely shift our approach by focusing on where our users were experiencing the most pain and aggravation. We then allowed that real-world data to drive how our data governance program expanded. In short, focusing on deploying our data catalog in an iterative way by prioritizing data assets based on business use-case has been key to reigniting employee interest in our governance program.”
— Luke Slotwinski, VP of IT Data & Analytics, Prologis
Taking on a data cataloging initiative may not be in the cards for you yet, but what if I told you that there are two simple ideas from agile that you can implement today that will vastly improve both the quality of your data and analytics work and start promoting shared understanding and data literacy? The two ideas are Peer Review and the Definition of Done.
In software development the idea of the Code (or Peer) Review has been around far longer than the famous Github “Pull Request’’ feature that allows engineers to comment on each other’s work before it gets published. I first read about the technique in Steve McConnell’s Code Complete — which was initially published in 1992! There’s simple power and beauty in having a teammate double check your work and make suggestions on things you might’ve missed. For particularly complicated pieces of code, the two engineers might walk through it line-by-line, with the author explaining how it works to the reviewer, in real time. This has the added benefit of securing an important knowledge transfer, ensuring more than one person on the team knows how that code works and avoiding the dreaded “Bus Factor 1”.
Adding this style of review as part of your process is extremely impactful. Another benefit of this (especially since we’re almost all living in a remote world right now) is that commentary captured in doing these reviews can massively improve any of your data and analytics documentation efforts with nearly no additional effort. It’s important to note though that review should focus on a peer data scientist or analyst going through how the analysis was done, either line by line in code or stepping through how visualizations were created. This is not stakeholder or end-user acceptance testing. (In software the analogy would be like having a user click through the prototype and calling it done).
This point can’t be under emphasized: a peer reviewer will find errors in your technique or have suggestions that the business stakeholder simply won’t have (especially if the analysis happens to confirm a belief or previously held bias). Adding this step in the process will positively influence cross-department collaboration, overall data literacy and have a tremendous impact on the over quality of analysis done or data models delivered. I recommend that you Google “Good Code Read Practices” for a huge array of articles on how to do this well and think about how to adapt this to the data and analytics best practices within your company.
Definition of Done
Something that can really make a difference in peer review sessions is having a great “Definition of Done”. The concept of the Definition of Done comes from Scrum methodology in the software development world where the idea is that you should have a simple, yet unambiguous checklist to go through when a User Story is completed to make sure that, as an engineer (and as the reviewer), you’ve hit the mark. Having a simple checklist that all data scientists, analysts and engineers in your organization agree to use in order to sign off on work, can raise the bar for the quality and consistency of your data work significantly.
When it comes to definitions of done for analytics and data engineering work however, you may need to have separate Definitions for your analytics work and data modeling work. We’ve used a couple of definitions for this work which may be good starting points for your organization as well:
Definition of Done for Data Science and Analytics
- Hypothesis well stated
- Method reviewed for bias
- Data sources are well defined
- New data sources/modeling submitted to data engineering
- Method has been read and can be reproduced & used in a peer’s environment
- Axes, features, domains are well defined
- Conclusions and action steps are documented
- Reviewed for compliance
- Open for comment or reuse
Definition of Done for Data Modeling
- Data dictionary written
- Existing business terms and dependent metrics linked
- New business terms submitted
- Transformation code reviewed
- Adherence to to data architectural style (Kimball, Data vault, etc)
- Adherence to Don’t-Repeat-Yourself principals
- Tests written and data profile is well documented/understood
- Reviewed for compliance
- Open for comment or reuse
As we adopted dbt for modeling our internal data and analytics at data.world, having these definitions of done (particularly the data modeling focused one), was instrumental in establishing our baseline for completing analytics engineering work. It ensured that even as we learned a new tool, we were consistent in what we were looking for as we reviewed each other’s work. This easily doubled our efficiency in getting new data models into our data warehouse as we migrated to Snowflake.
Adopting the practice
Whenever you adopt a new practice like doing peer reviews or having a Definition of Done to check work against, it’s important to incorporate them into the tools you’re using to manage your projects today — you don’t even need any fancy new data or analytics infrastructure to do this. Often, just adapting software process tools like JIRA to your data and analytics practice is more than sufficient. If you’re organizing your analytics work around user stories, adding a “Peer Review” step into a tool like JIRA is incredibly easy. Using JIRA’s agile boards is simple and already incorporates a Review step:
Image by author
If you’ve actually got a data catalog, adding an “In Review” flag to your data assets is also extremely helpful and can be synced to your project management tools. These flags can be used to assign reviewers or to let consumers know if the asset is ready to use yet.
Image by author
Winning in Data and Analytics with Agile
At the beginning of this post, I gave a nod to my history of stealing from agile software development principles and applying them to data and analytics work. And you know what? I will continue to steal from agile because I believe in creating collaborative, data-driven cultures where transparency and trust in data and analytics reign. Peer review and following a well-understood Definition of Done for data and analytics work are extremely valuable and simple ideas you can adopt today. No two changes are easier to implement or will have a greater positive impact on data quality and shared understanding within your organization.
I’d love to hear from anyone who’s adopted these or other Agile approaches in your data and analytics process or governance workflows!
Bio: Jon Loyens is the CPO and Co-Founder of data.world, the data catalog for the modern data stack. He’s a passionate advocate for building data-driven cultures rooted in openness and transparency. In a past life, Jon was the VP of engineering for Traveler Products at HomeAway and VP of engineering at Bazaarvoice. As a long-time technology executive in Austin, he has lived the rise in the trend of data and analytics as a democratizing force. Jon brought A/B testing and data-driven product management to Bazaarvoice and massively expanded the data programs at HomeAway.
Original. Reposted with permission.