1 Working without a Workflow

Not even too long ago, when I was starting my career as a data scientist, I did not really have a workflow. Freshly graduated from an applied statistics master I entered the arena of Dutch business, employed by a small consulting firm. Neither the company I was with, nor the clients I was working for, nor myself had an understanding of what it meant to implement a statistical model or a machine learning method in the real world. Everybody was of course interested in this “Big Data” thing, so it did not take long before we I could start at clients, often without a clear idea what I was going to do. When we finally came to something that looked like a project, I plunged into it. Eager to deliver results quickly I loaded data extracts into R and started to apply all kinds of different models and algorithms on it. Findings ended up in the code comments of the R scripts, scripts that often grew to hundreds or even thousands of lines. To still have somekind of an overview I numbered the scripts serially, but this was about all the system I had. Soon I found myself amidst dozens of scripts and data exports of intermediate results that were no longer reproducible. The R session I was running ad infinitum was sometimes mistakenly closed, or it crashed (which was bound to happen as the memory used grew). I spent hours or even days to recreate the results when this happened. Deadlines were a nightmare, everything I had done so far had to be loaded, joined and compared at the last moment. More often than not, the model results appeared to be different from the indications in the notes I took earlier, with no idea if I was mistaken earlier, I was using the wrong data now, or some other source of error I was not aware of was introduced. Looking back at these times, I had no clue about the importance of a good workflow for doing larger data science projects. I was saved several times when plug was pulled from the project, due to other reasons. If I was expected to bring the results to production then, it would certainly been a painful demasqué.

I learned a great deal since these day, both from other people’s insights and from my own experiences. Writing an R package that was shipped to CRAN enforced me to understand the basics of software engineering. Not being able to reproduce crucial results forced me to start thinking about end-to-end research and model building, controlling all the steps along the way. Last year, for the first time, I joined a Scrum team (frontend, backend, ux designer, product owner, second data scientis) to create a machine learning model that we brought to production using the Agile principles. It was an inspiring experience from which I learned a great deal. My colleagues patiently explained the principles of Agile software development and together we discovered what did and did not work for machine learning.

1.1 What this Text is About

All these experiences culminated in the workflow that we are adhering to at work now and that I think is worthwhile sharing. It is heavily based on the principles of Agile software production, hence the title. We have explored which of the concepts from Agile did and did not work for data science and we got hands-on experience in working from these principles in an R project that actually got to production. This text is split into two parts. In the first we will look into the Agile philosophy and some of the methodologies that are closely related to it (chapters 2 and 3). Both will be related to the machine learning context, seeing what we can get from the philosophy (chapter 4) and what an Agile machine learning workflow might look like (chapter 5). The second part is hands on. We will explore how we can leverage the possibilities in the R software system to implement Agile machine learning.

1.2 Writing out Loud

Machine learning projects can greatly differ from each other. There are so many variables that make each project unique, there are many situations I have not experienced and there are necessarily many aspects of machine learning I am not aware of. In this writing I am relating my own experiences to the theory and best practises of Agile software development, to come up with a general workflow. This means that if I were to write this text in isolation I would be “overfitting” the workflow on about the dozen machine learning projects I have done. That is why I need your help. I hope many of you will read the first drafts of this book very critically and relate the content to their own experiences. If you find that parts are incomplete or plainly incorrect, please file an issue. Also, anyone who succesfully completes machine learning projects must have developed an effective workflow for themselves, even when it is not grounded in a widespread theory such as Agile. I am very interested in the best practised you have developed, even when they don’t fit directly in the framework. File an issue with what you would like to add, if we can’t fit it in the text we can always add it as an appendix or a discussion. This text is meant to be a living thing with the objective of documenting a workflow that yields optimal reproducibility, highly reproducible results and quality code. The more people share their best practises, the closer we get to this objective. Please follow along on this journey and get involved! Finally, I am not a native of English so fixed typos and style improvements are greatly appreciated.

1.3 Intended Audience

The title of this text has four components: Agile, machine learning, R, and workflow. When you are interested in all four, you are obviously at the right place. This text is not for you if you hope to learn about different algorithms and statistical techniques to do machine learning, more knowledgeable people have written many books and articles on those topics. The workflow I present is completely separate from the algorithms you choose, as it focuses on code organisation and delivery. When you use python rather than R, you will still find this text valuable. The first part especially, which focuses on workflow only and is tool agnostic. Finally, this text is intended for everybody building a data science product in R. Whether a Shiny app or a complex statistical model, this text should be valuable for you as well. The iterative nature of Agile will make your process more effective and your stakeholders more involved. I contemplated calling it Agile Data Science… instead of Agile Machine Learning… to broaden the scope of the text, but frankly I have only done simple, internal-use Shiny dashboards and not too complex statistical models. Therefore, I will stick to what I know well and only give examples of machine learning. I hope you will have little trouble translating the concepts to you own situation if it is other than machine learning. You are most welcome to do suggestions if you think the text will benefit from expanding to other examples as well.