Python logo from the official python website.

Python best practices even data scientists should know

How to make your code as clean and readable as possible

Yan Gobeil
Published in
8 min readMay 27, 2022

--

During my almost 3 years as a data scientist at Décathlon Canada my coding skills have improved a lot. This improvement has however been very gradual so it can sometimes be hard for me to assess what I learned up until now.

One moment when I realize how much I have learned is when I review code written by new interns or team members, who are not used to writing python code in industry. There are simple best practices that are almost always unknown to new people and this is what I want to share in this article. These are certainly useful when you write training code in a jupyter notebook, but even more when you write scripts to include your models in full projects.

Docstrings

The first comment I have for people is in general about the documentation of their code. It is very important to describe what each of your functions is doing, what arguments it expects and what it returns. This can be done two ways, the first of which is docstrings.

A docstring is a string that you use to describe your function. It is helpful for anyone who reads your code but also for people who use your code since the docstrings can be displayed easily by IDEs or jupyter notebooks.

Example of docstring displayed in a colab notebook

Here is an example of docstring for a very simple function

You see that the string contains all the elements that were mentioned above and other people don’t have to dive into the code to know how to use the function. There are many styles that can be used but all of them include the same required information.

Docstrings can also become very useful when you want to write a full documentation for your codebase. Packages like sphinx allow you to automatically generate a documentation from the docstrings so you can save a lot of time.

Type hinting

A second way to document your functions is with type hinting, which should in general be combined with docstrings for optimal clarify. The idea here is to include the type of each argument when you define a function, and the type of the return. Here is an example for our earlier function

There are various basic types, like float , int , str and bool , but it’s also possible to use type hinting with more complicated types by importing them from the Typing module. Specifying when your function doesn’t return anything is a good practice as well. Here is a more complicated example that illustrates most of the ideas of type hinting.

In addition to being useful for people who read your code, many IDEs (like pycharm) use your type hints to help you avoid typing errors. Note however that your code will run fine even if you don’t respect perfectly your type hints. Python is not a typed language like C for example so you can only get warning messages.

Example of typing errors reported by pycharm

You can get more examples of type hinting here.

README

Whenever you do a Pull Request on github to push your code in production there should always be an update of the README that explains the new features that you added. Don’t write too much because it will discourage people from reading it, but make sure that everything that is needed to use your code is explained. Make sure that the general idea of your code is clear as well. The best thing to do to make sure your README is ok is to imagine yourself as a new comer to the project. What information would you need to start working with the code as quickly as possible?

On the flip side, when you start working on a new project, read the README. It is called like that for a reason. People have come to me so many times with questions that were already answered in the README that I wrote months before.

PEP8

This is the styling convention that is preferred for python code. It contains a lot of various rules that can be found here. These are there to insure that all python code has the same style and is visually similar, no matter where it is.

It’s best to keep the rules in mind while coding but there are tools that can help you clean up your code as well. IDEs like pycharm can tell you where you should change something. Tools like autopep8 can also format your code automatically, but be careful because sometimes it can reformat your code too much and make it less readable.

Example where pycharm reports a PEP8 error

Next are a few of my favorite rules to follow.

Imports

This rule seems a bit annoying and not super useful at first but when you get used to it you realise that it is useful. The idea is to follow a simple organization principle for your imports. Group them in 3 blocks:

  1. Standard packages that come with python
  2. Third party packages, that you install using pip
  3. Local files and modules

Then in each block you order the imports alphabetically. This way of writing your imports is useful for a few reasons, on top of the fact that it standardizes all your code. The third party section is where you will get the information needed to write your requirements.txt file, mentionned below. The local imports section helps you see the interdependence between your different files. Here is an example of what it should look like.

To know if a package is native to python or not, you can just google to find its documentation. If you find the information in docs.python.org then it is native. If the documentation is hosted somewhere else or if you can find the package on pypi.org it’s an external package.

Spaces

There are many rules regarding spaces that help make scripts more uniform. The idea is not that one method is better than the other. It’s just important to have a single rule that everyone follows for uniformity.

  • Always put spaces after commas , and colons : .
  • Always surround mathematical operators like = , - , > with spaces, except when used in keyword arguments of a function.
  • Don’t surround brackets like () , {} and [] with spaces.
  • Separate functions with two blank lines.
  • Avoid any trailing spaces and empty lines with spaces.

Here is an example that shows many of the rules.

Naming conventions

Python functions and variables use snake case, which is of the form my_function .There should not be any capital letters in these names and they should not start with numbers. Global variables however should be written in full caps to distinguish them from local variables. The only exception to this convention is for class names, which should use the camel case form myClass .

Requirements

The first important recommendation in this section is to always use a separate virtual environment when you start a new project. This can be managed with venv , conda or any other method that you want. The idea here is to start with an empty environment for your project so you can install only the packages that you need. This avoids conflicts between different versions of a package and helps you keep track of what version was used.

In addition to virtual environments you should always include a requirements.txt file in your project. This is where you record the python packages that are necessary to install to run the code, along with the versions used. If you use a virtual environment and organise your imports correctly, this list should be simple to make. Here is an example of file, where the list is ordered alphabetically for convenience.

There are multiple ways to specify the version that you want for a package. You can ask for the exact version with == , but keep in mind that this could cause problems when another package tries to install a different version. For example tensorflow requires some version of numpy, which could be different from the one you require. To avoid this you should use ~= to allow for a bit of variation in the package version. You can finally specify a minimum with >= (or maximum version with <=) if a breaking change is present in the different versions of the package.

This requirements file is essential if you want anyone else to be able to execute your code. If they use the wrong versions they may get weird errors and not be able to use your project. It is even more important when you want to deploy your code in production because the installation will be done automatically instead of manually. To install the packages using the file you just have to do

pip install -e requirements.txt

Clear names

Even if you use the correct PEP8 conventions for your variable and function names it does not mean that your code will be easy to understand. It is also very important to give them meaningful names so people can keep track of what the variables contain or what the functions do. It can be sometimes useful to specify the variable type as well in the names. Let’s look at a few examples so you can see for yourself.

Can you tell what result_func is?
Which loop is the most readable?

List comprehension

This one is a bit more advanced but it helps making your code more compact and avoid having to define many intermediate variables to do a single task. A single example of converting a loop to a list comprehension is not too impressive but when there are many of these loops in a single function you can save a lot of space by doing the conversion. Here is a simple example to show how list comprehensions work.

It is important to be careful not to overuse list comprehensions. The end result should still be easy to read so avoid using this method for complicated calculations. Note that comprehension can also be done with dictionaries.

There are still many habits and tricks that I like to use to improve my code but this is a good sample of best practices to keep in mind when writing your code. This should become natural and you should do it automatically while you code, not after everything is done or when you get comments in a PR.

If you have other examples of important best practices pleae share them in the comments :)

--

--

Yan Gobeil
Geek Culture

I am a data scientist at Décathlon Canada working on generating intelligence from sports images. I aim to learn as much as I can about AI and programming.