Building modern Python API backends in 2022

Published: August 21, 2022

Intro

This guide is intended for people who are already familiar with Python 3 and looking to start a new project.

This guide comes from experience building api backends for a variety of startups in different industries, it serves as a starting point for what the state of “best practise” is for how I think about organising Python projects, structuring code, testing, and common libraries I’ve reused across projects.

I've taken an intentional opinionated approach with this guide, I have strong views on how things should should work - this has all been informed from the lessons learned from building multiple python backends from scratch.

Default to these libraries when building an API backend

FastAPI - HTTP routing

This is my preferred library for writing request/response handlers in Python. It has first class support for async, Python 3 types, and the documentation is great. If you’ve used Flask, then you’ll feel right at home with FastAPI.

Pydantic - Data serialisation, de-serialisation, and validation

This library is the recommended way of serialising/de-serialising requests and responses from FastAPI. I use this to serialise database models to the correct format for sending back as a response to the client. For requests, I use in built in validation features to make sure that clients are sending us the correct data.

SQLAlchemy - ORM

The best ORM in the Python ecosystem, and possibly other languages too. It’s a big library, and has a lot of “magic” features to make working with your database easier, more on that later.

Pytest - Unit and integration testing

Currently the best testing framework for Python.

Celery - Distributed task queue

One of the best pieces of software in the Python ecosystem. I’ve used this to run millions of background jobs across multiple nodes. When you start scaling, you’ll eventually be using this somewhere in your stack.

Structlog - Structured logging

This is one of the more recent additions to my toolbelt. I prefer this over the standard library because it’s designed from the outset to support structured logging.

Typer - Building CLI scripts

Another creation of the author of FastAPI, it makes extensive use of Python types to build out CLI applications. I highly recommend this over the standard argparse library

Alembic - Managing database migrations

This works hand-in-hand with SQLAlchemy to manage your database schema. If you’re using a relational database, you’ll eventually want to change schema at some point, this makes it a breeze.

Poetry - Dependency management

If you’re still using requirements.txt you’re doing it wrong. Poetry is the best package manager for Python at the moment. It works similar to yarn, there’s a poetry.lock file which tracks the concrete version of libraries installed, and another pyproject.toml which tracks the dependency versions.

Python types are you friend, use them whenever possible

Your codebase should be taking advantage of typing that was introduced in Python 3. With a properly setup editor like visual studio code, it provides helpful linting warnings when there are mismatches in types, or when an optional type is not checked properly.

A lot of popular python libraries now ship with type definitions, and frameworks like FastAPI take advantage of typing information, there’s now no excuse not to use them.

If you’re using Python 3.9+, use the built-in types, ie list, dict etc rather than using their equivalent in the typing library.

Use automation wherever possible to keep your codebase clean

I use black, isort, mypy and flake8 integrations with vscode to make sure that good code is shipped to production.

Black and isort make sure that code is formatted properly, and consistent across my codebase. These tools are opinionated with their default settings, but that’s a good thing.

Mypy and flake8 catch problems like unused variables, or mixing types. Integrating these tools into your CI process, or within vscode itself can catch problems early on when writing code.

When codebases start growing, keeping consistency with how everything “looks” is important. It makes reading code easier for existing developers, and new developers find it easier to navigate the codebase

Your backend code will largely fall into these categories:

In most backend projects, your application code will fall into these buckets:

  • database access - basically CRUD operations on the database
  • services - anything that has business logic, stuff that is specific to your app domain, or anything that interacts with an external service
  • routing - in my case the FastAPI endpoints, the code that deals to handling requests and sending back responses
  • schemas - serialisation of models to json responses, and deserialisation of requests to python classes, including validation of request data
  • tasks - asynchronous tasks that run in your job queue
  • scripts - operator tasks like deleting users, backfilling data, running migrations, etc
  • tests - unit and integration tests

Keep the folder structure simple

To make navigating the codebase simple, I typically structure the python backend folder structure like this:

.
├── app
│   ├── models (Your ORM models)
│   ├── repos (Database CRUD classes)
│   ├── routers (FastAPI endpoints)
│   ├── schema (Pydantic schemas)
│   ├── services (Your domain code)
│   ├── tasks (Celery tasks)
│   └── lib (shared code)
├── migrations (Alembic database migrations)
├── scripts (Various CLI scripts)
└── tests (Unit and integration tests)

I’ve found this folder structure works very well for organising my python backend codebase. It’s easy to workout where a specific piece of code is, and for anyone creating a new file, it’s clear where the file should be created.

Use the “magic” features of SQLAlchemy sparingly

Let me preface this with saying that most ORMs are great for solving 90% of the problems when writing code to interact with your database. SQLAlchemy in particular is the best ORM i’ve worked with.

When you’re doing simple queries against your database, ORMs can be a timesaver, however when you’ve got a complex data model which requires more advanced query construction, you’ll end up trying to figure out how to map the SQL query you want to construct into SQLAlchemy’s way of doing it. SQL is already a powerful language, oftentimes it’s just better to write raw SQL queries.

The other issue I’ve encountered is the breath of options you have when creating SQLAlchemy models. For example, the various options when defining relationships between models. There’s an entire chapter dedicated to relationship configuration in the SQLAlchemy manual. This means having to keep a mental model of how SQLAlchemy works along with how plain SQL works.

The rule of thumb that I have when using SQLAlchemy is to use a small subset of SQLAlchemy’s ORM’s smart features. I avoid adding/deleting models via ORM relationships, I avoid using polymorphic models, etc.., basically anything that’s too far removed from SQL

Containerise your backend

For local development, I have a single docker-compose file that will bring up the entire backend stack including all related services. This makes the developer experience first class. Developers are able to makes changes, poke api endpoints and see whether their changes work as expected without having to wait for code to ship to a test environment.

Having a local setup that strongly mirrors what’s running on production is one of the biggest wins for improving developer productivity.

Packaging your code inside a container is now the default way of shipping code to production in 2022. Most hosting providers have some platform that allows you to run containers, whether it’s a managed k8s service, AWS ECS, Google Cloud Run, or Digitalocean’s app platform. I run the backend using a managed Kubernetes cluster on Digitalocean.

Use structured logging

You should be using logging in production. Logging can help you debug problems, so don’t be economical with your logging. Log storage is cheap, and the performance overhead of emitting logs is low.

With structured logging, you can embed information like IDs, timing information, error details within your log lines. If you use a log based metrics product like google cloud logs, you can extract metrics or build alerts using purely output from your logs.

Application configuration

Following the convention of 12 factor apps, I store all configuration data within the environment variables of the containers I run.

To get an config variable, it’s just a matter of calling os.getenv to access the variable value. Alternatively you can use Pydantic’s BaseSetting class to populate a class with values from the environment.

As I use kubernetes for the backend app, I use ConfigMap to store environment variables, which is then referenced by the deployment file.

Streamline how you run your ops/engineering scripts

One of the unspoken things about running and maintaining a backend system is that you’re going to accumulate a collection of various scripts to perform frequent and ad-hoc engineering tasks.

For example, if you’re using alembic to manage database migrations, you’ll need to execute the alembic upgrade command against your production database. If you need to backfill some data, you'll need to run an engineering script to do this.

Make it easy for your engineers to run these engineering scripts on production.

Outro

Hopefully this guide was helpful in guiding your approach to building backend systems.

Although this guide has some Python specific advise, on the whole it can be applied to building backends in most popular languages.