Skip to content

Latest commit

 

History

History
378 lines (266 loc) · 35.1 KB

python.md

File metadata and controls

378 lines (266 loc) · 35.1 KB

Python

Page maintainer: Bouwe Andela @bouweandela

Python is the "dynamic language of choice" of the Netherlands eScience Center. We use it for data analysis and data science projects, and for many other types of projects: workflow management, visualization, natural language processing, web-based tools and much more. It is a good default choice for many kinds of projects due to its generic nature, its large and broad ecosystem of third-party modules and its compact syntax which allows for rapid prototyping. It is not the language of maximum performance, although in many cases performance critical components can be easily replaced by modules written in faster, compiled languages like C(++) or Cython.

The philosophy of Python is summarized in the Zen of Python. In Python, this text can be retrieved with the import this command.

Project setup

When starting a new Python project, consider using our Python template. This template provides a basic project structure, so you can spend less time setting up and configuring your new Python packages, and comply with the software guide right from the start.

Use Python 3, avoid 2

Python 2 and Python 3 have co-existed for a long time, but starting from 2020, development of Python 2 is officially abandoned, meaning Python 2 will no longer be improved, even in case of security issues. If you are creating a new package, use Python 3. It is possible to write Python that is both Python 2 and Python 3 compatible (e.g. using Six), but only do this when you are 100% sure that your package won't be used otherwise. If you need Python 2 because of old, incompatible Python 2 libraries, strongly consider upgrading those libraries to Python 3 or replacing them altogether. Building and/or using Python 2 is probably discouraged even more than, say, using Fortran 77, since at least Fortran 77 compilers are still being maintained.

Learning Python

  • A popular way to learn Python is by doing it the hard way at http://learnpythonthehardway.org/
  • Using pylint and yapf while learning Python is an easy way to get familiar with best practices and commonly used coding styles

Dependencies and package management

To install Python packages use pip or conda (or both, see also what is the difference between pip and conda?).

If you are planning on distributing your code at a later stage, be aware that your choice of package management may affect your packaging process. See Building and packaging for more info.

Use virtual environments

We strongly recommend creating isolated "virtual environments" for each Python project. These can be created with venv or with conda. Advantages over installing packages system-wide or in a single user folder:

  • Installs Python modules when you are not root.
  • Contains all Python dependencies so the environment keeps working after an upgrade.
  • Keeps environments clean for each project, so you don't get more than you need (and can easily reproduce that minimal working situation).
  • Lets you select the Python version per environment, so you can test code compatibility between Python versions

Pip + a virtual environment

If you don't want to use conda, create isolated Python environments with the standard library venv module. If you are still using Python 2, virtualenv and virtualenvwrapper can be used instead.

With venv and virtualenv, pip is used to install all dependencies. An increasing number of packages are using wheel, so pip downloads and installs them as binaries. This means they have no build dependencies and are much faster to install.

If the installation of a package fails because of its non-Python extensions or system library dependencies and you are not root, you could switch to conda (see below).

Conda

Conda can be used instead of venv and pip, since it is both an environment manager and a package manager. It easily installs binary dependencies, like Python itself or system libraries. Installation of packages that are not using wheel, but have a lot of non-Python code, is much faster with Conda than with pip because Conda does not compile the package, it only downloads compiled packages. The disadvantage of Conda is that the package needs to have a Conda build recipe. Many Conda build recipes already exist, but they are less common than the setuptools configuration that generally all Python packages have.

There are two main "official" distributions of Conda: Anaconda and Miniconda (and variants of the latter like miniforge, explained below). Anaconda is large and contains a lot of common packages, like numpy and matplotlib, whereas Miniconda is very lightweight and only contains Python. If you need more, the conda command acts as a package manager for Python packages. If installation with the conda command is too slow for your purposes, it is recommended that you use mamba instead.

For environments where you do not have admin rights (e.g. DAS-6) either Anaconda or Miniconda is highly recommended since the installation is very straightforward. The installation of packages through Conda is very robust.

A possible downside of Anaconda is the fact that this is offered by a commercial supplier, but we don't foresee any vendor lock-in issues, because all packages are open source and can still be obtained elsewhere. Do note that since 2020, Anaconda has started to ask money from large institutes for downloading packages from their main channel (called the default channel) through conda. This does not apply to universities and most research institutes, but could apply to some government institutes that also perform research and definitely applies to large for-profit companies. Be aware of this when choosing the distribution channel for your package. An alternative, community-driven Conda distribution that avoids this problem altogether because it only installs packages from conda-forge by default is miniforge. Miniforge includes both the faster mamba as well as the traditional conda.

Building and packaging code

Making an installable package

To create an installable Python package you will have to create a pyproject.toml file. This will contain three kinds of information: metadata about your project, information on how to build and install your package, and configuration settings for any tools your project may use. Our Python template already does this for you.

Project metadata

Your project metadata will be under the [project] header, and includes such information as the name, version number, description and dependencies. The Python Packaging User Guide has more information on what else can or should be added here. For your dependencies, you should keep version constraints to a minimum; use, in order of descending preference: no constraints, lower bounds, lower + upper bounds, exact versions. Use of requirements.txt is discouraged, unless necessary for something specific, see the discussion here.

It is best to keep track of direct dependencies for your project from the start and list these in your pyproject.toml If instead you are writing a new pyproject.toml for an existing project, a recommended way to find all direct dependencies is by running your code in a clean environment (probably by running your test suite) and installing one by one the dependencies that are missing, as reported by the ensuing errors. It is possible to find the full list of currently installed packages with pip freeze or conda list, but note that this is not ideal for listing dependencies in pyproject.toml, because it also lists all dependencies of the dependencies that you use.

Build system

Besides specifying your project's own metadata, you also have to specify a build-system under the [build-system] header. We currently recommend using hatchling or setuptools. Note that Python's build system landscape is still in flux, so be sure to look upthe some current practices in the packaging guide's section on build backends and authoritative blogs like this one. One important thing to note is that use of setup.py and setup.cfg has been officially deprecated and we should migrate away from that.

Tool configuration

Finally, pyproject.toml can be used to specify the configuration for any other tools like pytest, ruff and mypy your project may use. Each of these gets their own section in your pyproject.toml instead of using their own file, saving you from having dozens of such files in your project.

Installation

When the pyproject.toml is written, your package can be installed with

pip install -e .

The -e flag will install your package in editable mode, i.e. it will create a symlink to your package in the installation location instead of copying the package. This is convenient when developing, because any changes you make to the source code will immediately be available for use in the installed version.

Set up continuous integration to test your installation setup. You can use pyroma as a linter for your installation configuration.

Packaging and distributing your package

For packaging your code, you can either use pip or conda. Neither of them is better than the other -- they are different; use the one which is more suitable for your project. pip may be more suitable for distributing pure python packages, and it provides some support for binary dependencies using wheels. conda may be more suitable when you have external dependencies which cannot be packaged in a wheel.

Build via the Python Package Index (PyPI) so that the package can be installed with pip

  • General instructions
  • We recommend to configure GitHub Actions to upload the package to PyPI automatically for each release.
    • For new repositories, it is recommended to use trusted publishing because it is more secure than using secret tokens from GitHub.
    • You can follow these instructions to set up GitHub Actions workflows with trusted publishing.
      • The verbose option for pypi workflows is useful to see why a workflow failed.
      • To avoid unnecessary workflow runs, you can follow the example in the sirup package: manually trigger pushes to pypi and investigate potential bugs during this process with a manual upload.
  • Manual uploads with twine
    • Because PyPI and Test PyPI require Two-Factor Authentication per January 2024, you need to mimick GitHub's trusted publishing to publish manually with twine.
    • You can follow the section on "The manual way" as described here.
  • Additional guidelines:
    • Packages should be uploaded to PyPI using your own account
    • For packages developed in a team or organization, it is recommended that you create a team or organizational account on PyPI and add that as a collaborator with the owner rule. This will allow your team or organization to maintain the package even if individual contributors at some point move on to do other things. At the Netherlands eScience Center, we are a fairly small organization, so we use a single backup account (nlesc).
    • When distributing code through PyPI, non-python files (such as requirements.txt) will not be packaged automatically, you need to add them to a MANIFEST.in file.
    • To test whether your distribution will work correctly before uploading to PyPI, you can run python -m build in the root of your repository. Then try installing your package with pip install dist/<your_package>tar.gz.
    • python -m build will also build Python wheels, the current standard for distributing Python packages. This will work out of the box for pure Python code, without C extensions. If C extensions are used, each OS needs to have its own wheel. The manylinux Docker images can be used for building wheels compatible with multiple Linux distributions. Wheel building can be automated using GitHub Actions or another CI solution, where you can build on all three major platforms using a build matrix.
  • Make use of conda-forge whenever possible, since it provides many automated build services that save you tons of work, compared to using your own conda repository. It also has a very active community for when you need help.
  • Use BioConda or custom channels (hosted on GitHub) as alternatives if need be.

Editors and IDEs

Every major text editor supports Python, either natively or through plugins. At the Netherlands eScience Center, some popular editors or IDEs are:

  • vscode holds the middle ground between a lightweight text editor and a full-fledged language-dedicated IDE.
  • vim or emacs (don't forget to install plugins to get the most out of these two), two versatile classic powertools that can also be used through remote SSH connection when needed.
  • JetBrains PyCharm is the Python-specific IDE of choice. PyCharm Community Edition is free and open source; the source code is available in the python folder of the IntelliJ repository.

Coding style conventions

The style guide for Python code is PEP8 and for docstrings it is PEP257. We highly recommend following these conventions, as they are widely agreed upon to improve readability. To make following them significantly easier, we recommend using a linter.

Many linters exists for Python. The most popular one is currently Ruff. Although it is new (see the website for the complete function parity comparison with alternatives), it works well and has an active community. An alternative is prospector, a tool for running a suite of linters, including, among others pycodestyle, pydocstyle, pyflakes, pylint, mccabe and pyroma. Some of these tools have seen decreasing community support recently, but it is still a good alternative, having been a defining community default for years.

Most of the above tools can be integrated in text editors and IDEs for convenience.

Autoformatting tools like yapf and black can automatically format code for optimal readability. yapf is configurable to suit your (team's) preferences, whereas black enforces the style chosen by the black authors. The isort package automatically formats and groups all imports in a standard, readable way.

Ruff can do autoformatting as well and can function as a drop-in replacement of black and isort.

Type hints

Since PEP 484, which was first implemented in Python 3.5 (released in 2015), Python has gained the ability to add type information to variables. These are not types, as in typed languages; they are hints. Naively, one could say they are a new type of documentation. However, in practice they are far more than this, because they do have their own special syntax rules and are thus parsable. In fact, some tools have started to make use of this in runtime modules as well, making them more than hints for tools like Pydantic, FastAPI and Typer (all described below). See this guide to learn more about type hints.

Some tools to know about that make use of type hints:

  • Type checkers are static code analysis tools that check your code based on the type hints you provide. It is highly recommended that you use a type checker. Choose mypy if you are unsure which one to choose.
  • Tools to build documentation from source code have extensions that can show type hints in the generated documentation to make your code easier to understand. Popular examples are sphinx autodoc, sphinx autapi, and mkdocstrings.
  • Pydantic is a widely used data validation library that allows you to automatically validate instances of dataclasses at runtime. This means that for this tool the type hints are no longer just hints or a form of documentation, but have actual effects. Essentially, a fully Pydantic-enriched application (in "strict mode") is like having Mypy at runtime (there is also a "tolerant" mode that lets some common types slip through without errors). It effectively turns Python into a statically typed language.
  • Most editors nowadays make use of type hints for autocompletion. If the editor knows the type of your variable, for instance, it can autocomplete attributes or methods of that class.

We recommend using type hints, where possible and practical. Type hints are still being actively developed; not everything one would like to be able to express in a compact way can yet be achieved. This is why, for instance, NumPy arrays and machine learning library (e.g. Pytorch, Tensorflow) "tensor" types still (in 2024) have awkward type hinting. Crucial information that one would typically want to encode for array type input arguments are shapes, but this is not yet possible. Other important libraries, like Matplotlib, have very complex functions that take in many possible types of arguments, leading to overly complex variable types. Such huge types clutter your code tremendously, so they are not typically encouraged.

Testing

Use pytest as the basis for your testing setup. This is preferred over the unittest standard library, because it has a much more concise syntax and supports many useful features.

It has many plugins. For linting, we have found pytest-pycodestyle, pytest-pydocstyle, pytest-mypy and pytest-flake8 to be useful. Other plugins we had good experience with are pytest-cov, pytest-html, pytest-xdist and pytest-nbmake.

Creating mocks can also be done within the pytest framework by using the mocker fixture provided by the pytest-mock plugin or by using MagicMock and patch from unittest. For a general explanation about mocking, see the standard library docs on mocking.

To run your test suite, it can be convenient to use tox. Testing with tox allows for keeping the testing environment separate from your development environment. The development environment will typically accumulate (old) packages during development that interfere with testing; this problem is avoided by testing with tox.

Code coverage

When you have tests it is also a good to see which source code is exercised by the test suite. Code coverage can be measured with the coverage Python package. The coverage package can also generate html reports which show which line was covered. Most test runners have have the coverage package integrated.

The code coverage reports can be published online using a code quality service or code coverage services. Preferred is to use one of the code quality service which also handles code coverage listed below. If this is not possible or does not fit then use a generic code coverage service such as Codecov or Coveralls.

Code quality analysis tools and services

Code quality service is explained in the The Turing Way. There are multiple code quality services available for Python, all of which have their pros and cons. See The Turing Way for links to lists of possible services. We currently setup Sonarcloud by default in our Python template. To reproduce the Sonarcloud pipeline locally, you can use SonarLint in your IDE. If you use another editor, perhaps it is more convenient to pick another service like Codacy or Codecov.

Debugging and profiling

Debugging

Profiling

There are a number of available profiling tools that are suitable for different situations.

  • cProfile measures number of function calls and how much CPU time they take. The output can be further analyzed using the pstats module.
  • For more fine-grained, line-by-line CPU time profiling, two modules can be used:
    • line_profiler provides a function decorator that measures the time spent on each line inside the function.
    • pprofile is less intrusive; it simply times entire Python scripts line-by-line. It can give output in callgrind format, which allows you to study the statistics and call tree in kcachegrind (often used for analyzing c(++) profiles from valgrind).

More realistic profiling information can usually be obtained by using statistical or sampling profilers. The profilers listed below all create nice flame graphs.

Logging

Documentation

It is recommended that you write documentation for your projects and publish it on an interactive webpage. A popular and recommended solution for hosting documentation is Read the Docs. It can automatically build documentation for projects hosted on GitHub, GitLab, and Bitbucket.

Building documentation

There are several tools for building webpages with documentation. At the eScience Center, we mostly use Sphinx (more established) and MkDocs (newer).

User guides and other text documents are typically written in Markdown or reStructuredText. Sphinx supports both formats, while MkDocs only supports Markdown. Markdown has the advantage that it's easier to read for humans so it may be easier to work with and contribute to. reStructuredText is easier to read for computers so may be more suitable for complex projects.

Python uses Docstrings for code documentation. You can read a detailed description of docstring usage in PEP 257. Both Sphinx and MkDocs can generate documentation webpages from docstrings. There are two popular Sphinx extensions for generating documentation: autoapi (newer and more lightweight) and autodoc (more established). For MkDocs the mkdocstrings package is available. We recommend using the NumPy documentation style, as that is widely used in the scientific Python ecosystem.

You can also integrate entire Jupyter notebooks into your documentation with nbsphinx or mkdocs-jupyter. This way, your demo notebooks, for instance, can double as documentation. Of course, the notebooks will not be interactive in the compiled webpage, but they will include all code and output cells and you can easily link to an interactive version from the compiled documentation.

It is recommended that you routinely test any code examples in your documentation.

Recommended additional packages and libraries

General scientific

  • NumPy
  • SciPy
  • Pandas data analysis toolkit
  • scikit-learn: machine learning in Python
  • Cython speed up Python code by using C types and calling C functions
  • dask larger than memory arrays and parallel execution

IPython and Jupyter notebooks (aka IPython notebooks)

IPython is an interactive Python interpreter -- very much the same as the standard Python interactive interpreter, but with some extra features (tab completion, shell commands, in-line help, etc).

Jupyter notebooks (formerly know as IPython notebooks) are browser based interactive Python enviroments. It incorporates the same features as the IPython console, plus some extras like in-line plotting. Look at some examples to find out more. Within a notebook you can alternate code with Markdown comments (and even LaTeX), which is great for reproducible research. Notebook extensions adds extra functionalities to notebooks. JupyterLab is a web-based environment with a lot of improvements and integrated tools.

Jupyter notebooks contain data that makes it hard to nicely keep track of code changes using version control. If you are using git, you can add filters that automatically remove output cells and unneeded metadata from your notebooks. If you do choose to keep output cells in the notebooks (which can be useful to showcase your code's capabilities statically from GitHub) use ReviewNB to automatically create nice visual diffs in your GitHub pull request threads. It is good practice to restart the kernel and run the notebook from start to finish in one go before saving and committing, so you are sure that everything works as expected.

Visualization

  • Matplotlib has been the standard in scientific visualization. It supports quick-and-dirty plotting through the pyplot submodule. Its object oriented interface can be somewhat arcane, but is highly customizable and runs natively on many platforms, making it compatible with all major OSes and environments. It supports most sources of data, including native Python objects, Numpy and Pandas.
    • Seaborn is a Python visualisation library based on Matplotlib and aimed towards statistical analysis. It supports numpy, pandas, scipy and statmodels.
  • Web-based:
    • Bokeh is Interactive Web Plotting for Python.
    • Plotly is another platform for interactive plotting through a web browser, including in Jupyter notebooks.
    • altair is a grammar of graphics style declarative statistical visualization library. It does not render visualizations itself, but rather outputs Vega-Lite JSON data. This can lead to a simplified workflow.
    • ggplot is a plotting library imported from R.

Parallelisation

CPython (the official and mainstream Python implementation) is not built for parallel processing due to the global interpreter lock. Note that the GIL only applies to actual Python code, so compiled modules like e.g. numpy do not suffer from it.

Having said that, there are many ways to run Python code in parallel:

Web Frameworks

There are convenient Python web frameworks available:

  • flask
  • CherryPy
  • Django
  • bottle (similar to flask, but a bit more light-weight for a JSON-REST service)
  • FastAPI: again, similar to flask in functionality, but uses modern Python features like async and type hints with runtime behavioral effects.

We have recommended flask in the past, but FastAPI has become more popular recently.

NLP/text mining

  • nltk Natural Language Toolkit
  • Pattern: web/text mining module
  • gensim: Topic modeling

Creating programs with command line arguments

  • For run-time configuration via command-line options, the built-in argparse module usually suffices.
  • A more complete solution is ConfigArgParse. This (almost) drop-in replacement for argparse allows you to not only specify configuration options via command-line options, but also via (ini or yaml) configuration files and via environment variables.
  • Other popular libraries are click and fire.
  • Typer: make a command-line application by using type hints with runtime effects. Very low on boilerplate for simple cases, but also allows for more complex cases. Uses click internally.