Database Migration: Middleware Pattern

I was recently given the task of migrating the persistence layers of several JSON APIs from AWS's DynamoDB to Postgres. DynamoDB was initially chosen because of its flexibility as a managed NoSQL solution, but as models solidified, relations revealed themselves, and data sets grew the need for the speed and structure of SQL was clear.

These APIs were mission critical to the business, in constant use (meaning that there was always data being read, written and updated) and some were public facing, thus requiring little to no downtime and backwards compatibility. To reduce the chances of introducing new bugs to the DynamoDB pathways of the application, as well as make the removal of migration code safer, we needed to make as few changes as possible to the existing source code. Finally, we determined that a fail-safe process was essential, so that if any of our data was not being persisted to the Postgres database as expected (which indeed was the case on one occasion) no data would be lost.

Project Priorities:

Move 100% of the data accurately to Postgres
Incremental releases
Backwards compatibility of API requests and responses
Minimize source code changes
Fail-safe process

Given these requirements, a middleware pattern emerged that would facilitate a four phase migration. The basic idea is to send all transactions through a piece of code that reads from and writes to different databases depending on the value of an environment variable. This approach would allow for an incremental migration to our new database, and create a data safety net to fall back on. This phased approach also meant that if things did go wrong, we could simply rollback to eliminate prolonged downtime.

4 Phase Migration

Phase	Read (db of record)	Write	Can Rollback
0	DynamoDB	DynamoDB	Yes
1	DynamoDB	DynamoDB, Postgres	Yes
2	Postgres	DynamoDB, Postgres	Yes
3	Postgres	Postgres	No

Phase 0:

read - DynamoDB
write - DynamoDB

This phase is to ensure that the new router code could be released while still preserving existing functionality. This is an important step that allowed for incremental development, meaning that we did not need to complete all Postgres functionality before getting migration code released to production.

Phase 1:

read - DynamoDB
write - DynamoDB and Postgres

Phase one is implemented to test connecting and writing to Postgres while maintaining DynamoDB as the database of record. An initial migration of the existing data should be performed before this phase is implemented.

Phase 2:

read - Postgres
write - DynamoDB and Postgres

In phase two Postgres is now the database of record. Prior to implementing phase two, the Postgres database is truncated and a fresh data migration is performed. It is preferable to stop writes and updates to DynamoDB, as this will simplify the the verification of data integrity, but at a minimum updates need to be disabled. More on this later. And because Dynamo is still being written to in phase two, if anything goes wrong, it is simple to revert to phase one and continue relying on DynamoDB as the database of record.

Phase 3:

read - Postgres
write - Postgres

This marks the point of no return. All data persistence must be verified in Postgres before switching to phase three because the application will no longer be writing to DynamoDB as a fail-safe.

Clean Up:

It would be tempting to move on at this point, but it is important to remove the migration and DynamoDB code while this application is still fresh in your mind.

Existing Code Structure

This pattern is tailored to the structure of the code bases I was working in where resources were accessed via services which in turn called repositories to interface with the database. Services perform any necessary logic on data entities, where repositories simply communicate with the database to retrieve and persist data. This router pattern I am about to explain can still be applied to other application structures, but would require alterations to the implementation.

Another helpful feature of the code bases I was working in was a defined data domain. The data entities were represented in a Python dataclass and these objects were the sole representation of the data entities when being passed throughout the app, making it simple to convert from the new Postgres representation into the expected domain. This will be discussed in more detail.

Middleware: Database Router

The DBRouter class is the workhorse of this pattern. It handles converting a function name (as an enum) into the correct actions on either DynamoDB, Postgres, or both depending on the migration phase. Consuming this functionality is simple and therefore requires only minor updates to the existing database usage. Given the service and repository structure explained above, implementing the database router would look like this:


from utils.cars_db_router import CarsDBRouter, CarsFunction
from repositories.cars import DynamoCarsRepository, PsqlCarsRepository

class CarService:

    __init__(self):
        cars_router = CarsDBRouter(DynamoCarsRepository, PsqlCarsRepository)

    def get_car(id)
        car = cars_router.call(CarsFunction.GET_CAR, id)
        return self.format_car(car)

    ...

Now that I have established the desired implementation, I will show you the DBRouter class that makes this possible. It consists of a single public function, call, that accepts the an enum representing the repository function and its operation type, as well as any arguments needed, then returns the results from the "source of record" database. The private functions of DBRouter handle deciding which databases to use depending on both migration phase and whether the operation is a read or a write.

Initial class setup:

# utils/database_router.py

from repositiores import DynamoRepository, PostgresRepository

DYNAMO_SLUG = "dynamo"
POSTGRES_SLUG = "postgres"

class DatabaseRouter():
    READ = "r"
    WRITE = "w"

    def __init__(self, 
        dynamo_repository:DynamoRepository, postgres_repository:PostgresRepository):
        """
        Set Dynamo and Postgres repositories

        """
        self.PHASE: str = os.environ.get("MIGRATION_PHASE")
        self.dynamo_repository = dynamo_repository
        self.postgres_repository = postgres_repository

To start, a few variables are established; database constants are set, operation type constants are set in the class body and the migration phase, database repositories are set in the class initialization.

The `call` method

The only public method in the class and therefore, the entry point into the DBRouter, progresses like so:

It accepts a function_name (enum), and any arguments passed to it
The DatabaseServiceFunction (contains the mappings and operation type, more on this below) is retrieved and set to the variable service_function
Database(s) that will be used are determined based on the migration phase and database operation
Looping over the database list returned in step 3, the actual calls are made and stored in the results dict
Finally the results from the "source of record" database, which is determined by the migration phase, are returned

# utils/database_router.py

class DatabaseRouter():

    ...

    def call(self, function_name, *args, **kwargs) -> Dict:
        service_function = self._get_service_function(function_name)
        dbs_to_call = self._get_dbs(service_function.read_or_write)
        results = {}

        for db_slug in dbs_to_call:
            function_to_call = service_function[db_slug]
            result = function_to_call(*args, **kwargs)
            results[db_slug] = result
        source_db_slug = self._determine_source_db()
        return results[source_db_slug]
        
    ....

Helpers Methods

There are two helper methods needed for the call method; one to get the desired databases to act upon, and another to determine which database should be read from (e.i. which is the database of record). As well, the DatabaseServiceFunction class is needed to enable the router functionality.

# utils/database_router.py

class DatabaseRouter():
    
    ... 

    def _get_dbs(self, read_or_write) -> []:
        if read_or_write == self.READ:
            if self.PHASE == '0':
                return [DYNAMO_SLUG]
            if self.PHASE == '1':
                return [DYNAMO_SLUG]
            if self.PHASE == '2':
                return [POSTGRES_SLUG]
            if self.PHASE == '3':
                return [POSTGRES_SLUG]
        if read_or_write == self.WRITE:
            if self.PHASE == '0':
                return [DYNAMO_SLUG]
            if self.PHASE == '1':
                return [DYNAMO_SLUG, POSTGRES_SLUG]
            if self.PHASE == '2':
                return [DYNAMO_SLUG, POSTGRES_SLUG]
            if self.PHASE == '3':
                return [POSTGRES_SLUG]

    def _determine_source_db(self) -> str:
        if self.PHASE == '0':
            return DYNAMO_SLUG
        if self.PHASE == '1':
            return DYNAMO_SLUG
        if self.PHASE == '2':
            return POSTGRES_SLUG
        if self.PHASE == '3':
            return POSTGRES_SLUG

    ...

At this point our DBRouter implementation is close to complete. You may have noticed that the _get_service_function is not yet implemented. This will be configured per repository in the child classes of DBRouter, so for now it is best to define it and simply raise a NotImplementedError.

# utils/database_router.py

# class DatabaseRouter():
...

    def _get_service_function(self, function_name) -> DatabaseServiceFunction:
        raise NotImplementedError

...

Also needed for the router functionality is this DatabaseServiceFunction class that is used to map the database repository functions to the service function and store the operation (read or write).

# utils/database_router.py
...

class DatabaseServiceFunction:

    def __init__(self, read_or_write, dynamo, postgres):
        self.read_or_write = read_or_write
        self.methods = {
            f'{DYNAMO_SLUG}': dynamo,
            f'{POSTGRES_SLUG}': postgres
        }

    def __getitem__(self, item):
        return self.methods[item]

Extending the Router for Each Repository

The DBRouter is implemented per database repository mapping to a single data entity, which for this example is Cars. The database repository functions need to be mapped in the router via the _get_service_function method:

# utils/cars_db_router.py

from enum import Enum

from chalicelib.utils.database_router import DBRouter, DatabaseServiceFunction


class CarFunction(Enum):
    FETCH_CAR = 1
    UPSERT_CAR = 2

class CarsDBRouter(DBRouter):

    def __init__(self, dynamo_repo, postgres_repo):
        super().__init__(dynamo_repo, postgres_repo)

    def _get_service_function(self, function_name) -> DatabaseServiceFunction:
        if not isinstance(function_name, CarFunction):
            raise TypeError(f"{function_name} must be of type CarFunction")

        if function_name == CarFunction.FETCH_CAR:
            return DatabaseServiceFunction(self.READ, self.dynamo_repository.fetch_car,
                                           self.postgres_repository.fetch_car)

        if function_name == CarFunction.UPSERT_CAR:
            return DatabaseServiceFunction(self.WRITE, 
                                            self.dynamo_repository.upsert_car, 
                                            self.postgres_repository.upsert_car)

Data Domain

To successfully implement this pattern we need to be able to easily convert data entities from the DynamoDB representation to the Postgres representation. As mentioned above, I was fortunate to have a data domain layer already implemented, so only needed to augment the existing class.

A cars domain might look like this:

# domain/car.py

from dataclasses import dataclass

@dataclass
class Car:
    make: str
    model: str
    year: str
    color: str
    is_4wd: bool

On this domain we could define a new function called to_psql_model and transform the data as necessary to fit the Postgres schema. For example, the year could be converted to an int and the make and model could be converted to an enum, like so:

# domain/car.py

import Enum

class Makes(Enum.enum):
  FORD: 1
  TOYOTA: 2
  TESLA: 3


@dataclass
class Car:
    ...

    def to_psql_model() -> CarModel:
        car_kwargs = {
          "make": Makes[self.make],
          "model": Models[self.make],
          "year": int(self.year),
          "color": self.color,
          "is_4wd": self.is_4wd,
        }
        return CarModel(**car_kwargs)

It is also necessary to implement a from_psql_model method so that when retrieving data from Postgres, the objects can be transformed into what the application code expects.

# domain/car.py

@dataclass
class Car:
    ...

    def from_psql_model(car:CarModel):
        car_kwargs = {
            "id": car.id
            "make": car.make.value(),
            "model": car.model.value(),
            "year": str(car.year),
            "color": car.color,
            "is_4wd": car.is_4wd,
        }
        return Car(**car_kwargs)

These methods would be used in the Postgres repository to convert the data to and from the domain. Putting them into action might look like this:

# repositories/psql_cars_repository.py

from domain.car import Car as CarDomain
from data.models import Car as CarModel

class CarsRepository():

    ...

    def fetch_car(id:str) -> CarDomain:
        try:
            car = self.session.query(CarModel).filter(CarModel.id == id).first()
            return Car.from_psql_model(car)
        except Exception as e:
            LOGGER.error(f"Failed retrieve car ${id} - ${e}")
            self.session.rollback()
            self.close()
            raise e

    def create_car(car:CarDomain) -> CarDomain:
        try:
            car_model = car.to_psql_model()
            self.session.add(car_model)
            self.session.commit()
        except Exception as e:
            LOGGER.error(f"Failed to save car: ${car} - ${e}")
            self.close()
            raise e
        return CarDomain.from_psql_model(car)

Some Tips From the Field

When to Migrate Data?

When we moved into phase one (writing to both Dynamo and Postgres), we were running into errors when updating some records. The cause was these records did not yet exist in Postgres. If it is possible to do migrate data from DynamoDB to Postgres before moving to phase one, that is a good idea. Although this may not be practical for various reasons; it may require down time, there is a huge amount of data, etc.

A complete migration should be performed before going into phase two. Before this migration begins all write operations should be suspended and only re-enabled when the migration is complete and the migration phase has been set to two. If disabling writes is not possible for whatever reason, there may be creative ways to allow writes (not updates) to continue, but I would advise against this path as it risks lost data and may not even be possible for some applications.

As Few Changes as Possible

This point is worth mention again. When performing this migration it may be tempting to make improvements to the code based on increased understanding of the business domain such as renaming fields or otherwise updating functionality. In our experience this creates new variables in an already consequential process and should be avoided. Keeping the the goal, getting off of Dynamo, in focus take note of discoveries made and fold them into upcoming sprints.

Object Reference

A tricky bug we encountered was when an object was mutated by one of the database paths just before persisting it to the database. In our case we were storing additional data in Postgres compared to what was originally stored in DynamoDB. To keep DynamoDB backward compatible we deleted the new fields before saving a record. Initially we were deleting fields from a shared reference to the original object and ran into Null violations in the Postgres database. The simple fix was to make a deep copy of the object before deleting the fields (which is generally best practice and should have been done already), but it was a time consuming bug to track down.

Test Roll Backs

This is an easy thing to miss and caused us some heartburn in our initial attempt to rollback from phase two to phase one. The problem was that the data saved to DynamoDB was incorrectly mutated and broke logic when we switched from reading from Postgres back to reading from DynamoDB. This error resulted in downtime and a fair amount of manual cleanup. Testing the rollback in a lower environment would have saved time and spared us the firefighting.

Conclusion

By providing a fail-safe process, a very complex and consequential migration can be accomplished with confidence and safety. This four phase middleware pattern offers incremental steps for your team to digest and execute at a pace that makes sense for you. As well, the stability of each phase and ability to rollback provides breathing room to verify data and application behavior in a manageable scope. I hope this experience proves valuable to you, thanks for reading and best of luck in your next migration.