def transform_data(df: pd.DataFrame) -> pd.DataFrame:
# run some transformation
return df
Why Data Scientists Should Write Tests
The Jupyter Habitat
For a data scientist, jupyter notebooks are a natural habitat. They’re usually the first step in any project, and it’s the arena where thousands of lines of code are written, erased and forgotten forever in dumpgrounds. Data science is very exploratory in nature, because of which rapid iteration is necessary to build projects, which is exactly why notebooks are so popular. Constant experimentation and changes are a part of the job.
Yet as time passes and the project increases in scope, you find yourself moving away from a notebooks appraoch. There are many reasons why you’d switch to a traditional repo over a notebook.
Jupyter No Longer
Project Grows in Scope.
There is simply more stuff to do, more knobs and more switches. At first, you think a utils.py file will take care of it all. But quickly the utils.py swells in size, resembling a Lovecraftian horror of a thousand functions each doing some distinct operation.
It is then you create modules for better delegating things and separation of operations. And the crisis is averted. For now.
Collaboration
Notebooks are great for one contributor to keep tinkering, but it is not the best approach when working with a team. Having a distributed system can be beneficial since people focus on their part of the repo. Another reason to move away from notebooks.
Deployment
While there have been newer tools to facilitate notebooks in production (nbdev
etc), deployment to production generally needs you to have all of your code in .py
files. This is another reason to move to modules.
The Crisis of Rapid Iteration
Slowing Down
Except, now that the pivot is complete, you find yourself more frustrated than before. When writing a notebook, we usually ignore what are called “best practices” since they slow us down. We keep ignoring such practices even as the work grows in scope, and problems start rising.
Notebooks are a high velocity medium, which is their biggest strengths. You can quickly put together a bunch of code and get it running. You can quickly erase lines and add to make changes. It is a nimble, flexible craft.
When you are not working on a single notebook, and you have a forest of modules feeding into your cursor, the velocity often takes a hit. You find out that making a single change is a lot more expensive now than it was before. Many things end up breaking, when earlier making changes was easy. To add some functionality, you end up thinking about all the potential modules that could break.
This slows down the pace of development, and you end up captured in the web of your own local imports, scratching your heads through all the scripts present.
It’s a Mess
And there’s another important reason why the pace is down- because your code is too convoluted. The classes you defined have low cohesion, i.e. they do all sorts of different things. The modules you’ve defined are extremely dependent on each other, i.e. high coupling. So when you make changes in one module, because of high dependencies, some other module ends up breaking as well.
Managing the project architecture efficiently like this is an art that is learned through experience. If only there were simple disciplines, following which you could write better code. Turns out, writing tests can do that.
What are Tests?
A few months ago, despite being 4 years into my professional journey, I had little idea about tests. It seemed like a software engineering concept that has no obvious application in a data science project. So if you’re like me, here’s a little primer about tests.
If, after making a change to the codebase, you run the entire workflow to make sure it worked (or broke), you are already testing. Except less efficiently. Running the entire workflow can be time-consuming, and you don’t want to run everything when you’ve made changes to only one section of the codebase. Plus, debugging can be tough. Tests are basically functions that you set up to check whether other modules are doing their job correctly.
Here’s an example of a function in the transforms.py
file that does something.
To create a test for this function (and others) create a file called test_transforms.py
. Convention is to put all your tests inside a tests/
folder at the top of the repo. Inside this file you can write
from src.transforms import transform_data
def get_mock_data():
# create mock data
= pd.DataFrame({})
df
def test_transform_data():
# the testing function should start with 'test_'
= get_mock_data()
df = transform_data(df)
transformed_df # assert that the transform has correctly been completed
# you can check any aspects that have been transformed
# assume that the number of columns are doubled in transformation.
assert len(transformed_df.columns) == 2*len(df.columns)
This test is completely isolated from the rest of the repo, so it’s failure has only one reason, the function itself.
Now you can run test this by three methods:
pytest tests/
Runs all the tests in the folder.
pytest tests/test_transforms.py
Runs only the tests on transforms
pytest tests.test_transforms.test_transform_data
Runs only this particular function.
Why Are Tests?
Write Tests to Improve Speed
The single biggest reason why as data scientists need to write tests is that they are monumentally helpful in increasing the speed of experiments. Fast iteration is very important for a data scientist because we generally are constantly running experiments trying different ways to improve our models, trying a different architecture, tinkering with the pre-processing, etc.
When you write unit tests, making changes becomes easier because now you have an easy way of validating whether your changes are breaking the code or not. Moreover, because you are writing tests for each unit, you can identify exactly which part of the codebase is being affected by your change. Now that certain parts of the codebase are anchored in your mind as unbreakable, you are free of the anxiety to go tinker on other units.
Write Tests for Better Code
Second greatest reason is that writing tests help you write efficient code. Generally you shouldn’t write a function that does more than one thing. But while developing, in the flow you could write such cronenberg functions. When you write unit tests, you find that you need to write distinct tests for the same function, which is an indicator that it can be decomposed.
Similarly, writing tests help your code become orthogonal. You learn to write your code in such a manner that each unit is less dependent on other units. Which leads to modular, understandable code that is also very iteration-friendly.
Write Tests for Collaboration
Working in a team needs some vigilance. How to know if a change someone else made isn’t breaking down a routine specific to my task? Or vice versa? You simply write a test for your task, so your teammates can easily verify it themselves, reducing the feedback loop. Or you will waste time in to-and-fro clarifications.
If you hand over your repo to someone else, like a new team or the deployment team, the tests are helpful to them too for quick onboarding.
Write Tests Because they’re Easy and Cool
In python, it’s terribly easy to write tests thanks to the pytest
library. Requires no syntax to learn, it’s as easy as writing a function. Writing tests is cool, that’s a fact. It also feels great when you run pytest
on your code base and all of your cases pass in a flurry of green dots.
This, in a nutshell, why data scientists should not ignore the power of tests.