Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

Replace home-grown PostgreSQL database migrations with something modern #754

Closed
pypt opened this issue Jan 7, 2021 · 6 comments
Closed

Comments

@pypt
Copy link
Contributor

pypt commented Jan 7, 2021

Hey James!

So, are you up for fixing up some past mistakes of mine?

As you might have noticed (and as is described in the docs), we use a self-made database migrations system that I came up with. The essence of it is that to come up with a migration, one has to:

  1. Update apps/postgresql-server/schema/mediawords.sql which serves as our reference schema file.
  2. Add a new migration under apps/postgresql-server/schema/migrations/mediawords-XXX0-XXX1.sql that will get imported when the production database actually get migrated.

This process kinda works, but there are a few issues with it:

  • Migrations don't get automatically tested at any point. When postgresql-server gets built into a Docker image, only the reference schema/mediawords.sql gets imported which is not something that we do at all when deploying. Instead, migrations get applied on postgresql-server container start, i.e. the bin/postgresql_server.sh (via bin/apply_migrations.sh) starts the server in an unpublished port (IIRC 1234), tests whether the schema of the database is up to date, applies the migrations if needed, stops the server and then restarts in on a "proper" port. If a migration fails, then the container never really starts (which is okay because we don't want a database instance with an out-of-date schema but also not okay because we should somehow learn that the migration is broken before we even try deploying it).
  • Duplicate work. To add a table, one has to edit both schema/mediawords.sql and a migration file in schema/migrations/.

So, a new migration system is due. Something like this would be a tremendous improvement IMHO:

  1. We get rid of the mediawords.sql file altogether and manage the schema only via migrations.
  2. At build time, we import all the migrations sequentially to end up with a live schema.
  3. At runtime, we maybe follow a similar approach to what's currently being done - start a "private" instance of PostgreSQL, run a migration tool against it to get it up-to-date, and then restart it to make it public.

Or is there a better way to do migrations these days?

Vague to-do:

  1. Come up with a few migration tools that you like, describe them in this issue and pick one that you think we should go with (I mentioned Flyway during our meeting but it doesn't have to be Flyway at all, just something that makes the most sense to you).
  2. (Attempt to) rewrite current migrations to said tool.
    • I think essentially you'll have to rename existing migrations somehow, get rid of the duplicate comments at the top of each migration file, and remove the set_database_schema_version() stuff, quite possibly fix a few syntax errors here and there (let me know if you get stuck with those, I might be able to help out).
    • We have quite a few migrations, and redoing them with a new migration tool might be a bit of work (even with the help of regex find-and-replace) but trying out the new tool of choice with quite a few of migrations might prove to be useful as we'd get to find out things like: 1) how fast does it work with a hundred of migrations? 2) how does it handle migrations that can't be run in a transaction? 3) how does it deal with errors? etc.
    • With that said, feel free to skip the migrations that are better off being skipped, e.g. the ones that create a table that gets dropped right away in a subsequent migration.
  3. Loosely compare schema/mediawords.sql with whatever schema got imported though those migrations and make sure that they more or less look like the same thing.
    • There might (will!) be discrepancies between what gets imported via the main mediawords.sql schema file and what ends up in the database after going through all of the migrations. If it's just column order or something like that that's different, then that's fine, but I think it's important to not miss a table or two.
    • The simpliest way of comparing those would be to import schema/mediawords.sql into one database, all of the migrations into another, pg_dump both databases and review a diff between them.
  4. Update the docs for them to describe the new migration process.

Notes, considerations and a wishlist:

  • While schema/mediawords.sql has to go, it would still be tremendously useful to have a single (auto-generated) file with the currently active schema for our own reference (i.e. something that you could look at while developing things). Maybe the container image build process could import all the migrations and then do a quick schema dump to a separate file that we could then be able to look at? Something like:

    RUN import_migrations.sh
    
    RUN pg_dump > /mediawords.sql

    So that later one could do:

    docker run dockermediacloud/postgresql-server cat /mediawords.sql > mediawords.sql

    to extract the latest schema.

    Or is there some sort of a better way?

    • One issue that comes to mind with this approach (pg_dump-generated reference schema files) is that we'd lose the -- comments that we have in the schema. Some of these comments are not particularly useful (e.g. feeds.type column is described as -- Feed type :)) but others are something that we'd like to retain. Maybe the most useful comments could be ported to COMMENT ON statements? Or would this be too much of a hassle?
  • It would be nice to have (retain) to have some way of applying any given migrations manually before deploying the rebuilt postgresql-server service. Sometimes migrations that work fine on an empty / testing database don't quite work as well on a production one (e.g. CREATE INDEX on a table with a billion rows) so sometimes it's useful to apply the migration manually before deploying anything. If the migration tool would be able to print out SQL that would be run on the live database instead of the tool insisting that it has to run the SQL itself, that would be pretty great! If not, then oh well, we'll think of something.

  • If it so happens that your migration tool of choice uses Java, consider using (and rebasing postgresql-base on) java-base app which is a base container image for apps that use Java.

  • If you don't like it that migrations get applied at deployment time, that's up for a discussion too - it just seemed to make sense to me so I did it that way, but maybe there's a better way (point in time) to apply those migrations.

Years ago, I started doing this in a flyway branch but never finished up the work. What might be useful for you in this branch is the pre-migrations full schema file (the first "migration" so to say) and a bunch of migrations with some fixed up SQL.

While doing the task, please keep in mind that if it seems to take too much time to do something (e.g. port existing migrations or rewrite comments to COMMENT ON statements), we could always decide to skip on the too-time-consuming parts of this task (e.g. existing migrations is just a nice-to-have, and comments on various tables could potentially be fished out some other way). Don't let yourself get scope-creeped!

As always, do let me know if you have any questions and / or need any help.

@jtotoole
Copy link
Contributor

Alrighty, I've reviewed a variety of options in this space. Taking into consideration the fact that we don't want to use an ORM and we want to write our migrations in plain SQL or PL/pgSQL, the top contenders in my mind are Flyway in combination with pgTAP for testing, or sqitch (maybe also with pgTAP, but that's not strictly necessary).

You're already familiar with Flyway, so I won't go too much into its details.

Pros:

  • As you've seen, very easy to grok + set up: Just make a migrations folder, number your migrations, and you're all set.
  • Easy to tie into a Dockerfile/build system
  • Widely used, people seem to like it

Cons:

  • ROLLBACK isn't available in the free version (the paid version is $3000/year for 10 production schemas), although the docs say migrations are automatically rolled back if they fail, so I'm not sure if this is a big deal or not.
  • --dry-run not available in the free version, so I'm not sure how we'd enable manually migration deployment
  • Free version doesn't work with Google Cloud Platform, and limits you to 100 migrations on AWS; worth noting if a move to a cloud provider ends up being something we consider more seriously.
  • If for some reason we wanted to reorder our migrations, adjusting the .sql file names would be a pain.

Sqitch, meanwhile, is a system for tracking changes and automatically testing them before deploy. When starting on a new migration, you run a command that creates three .sql files: the migration itself, a statement that verifies whether or not the migration succeeded, and a script that reverts the migration in case of failure. These file types are housed in three separate directories (/deploy, /verify, and /revert). The sqitch deploy command then runs each of these files in sequence for each migration, and adds a line to a .plan file that contains the migration order. Sqitch tracks changes using a git-like system that lets you step back through your migration history and revert to any particular point, or to reorder your migrations (you can also reorder them by editing the .plan file).

Pros:

  • Builds QA into the migration process; we can be more confident in deploying
  • Fully FOSS
  • Very well-documented/maintained, good community
  • Offers --dry-run option
  • Easy to tie into a Dockerfile/build system

Cons:

  • Writing migration, verify, and revert scripts, while logical, perhaps gets us away from the initial goal of simplifying the migration workflow. Some people also go beyond verify scripts and add a separate tests directory executed via pgTAP. The scripts in /tests and /verify end up being almost exactly the same; the only reason to do this I guess would be that pgTAP is purpose-built for test-logging? I'm not 100% clear on this.
  • More complicated than Flyway, steeper learning curve (though it's not wildly complex).
  • Confusing name (apparently it's pronounced "skitch")

Stray thoughts:

  • Yandex has a simple, Python-based tool that works very similarly to flyway. They're a large/sophisticated company, the project is well-maintained, it supports ROLLBACK, and it's fully FOSS; so, all in all, maybe also worth considering. The most obvious downsides are that it's poorly documented, and there doesn't seem to be much of a community.
  • Although it adds an extra step, I like the idea of bringing testing to migrations and incorporating that into our CI/CD flow
  • +1 on using pg_dump to create a reference schema file
  • I'm okay with scanning through mediawords.sql and porting the inline comments to COMMENT ON statements. Tedious, but I've got some two-hour techno mixes that will get me through, plus I think it will be a useful exercise in that it will increase my familiarity with the database.

@pypt
Copy link
Contributor Author

pypt commented Jan 20, 2021

Thanks, this is very thorough!

ROLLBACK isn't available in the free version (the paid version is $3000/year for 10 production schemas), although the docs say migrations are automatically rolled back if they fail, so I'm not sure if this is a big deal or not.

By that are you referring to dry runs or undos?

I would feel inclined to just downright give up on the idea on writing those "down" (revert) migrations. First of all, it's pretty hard to write a "down" migration for an "up" migration like that:

DO $$
DECLARE
new_table_name TEXT;
tables CURSOR FOR
SELECT tablename
FROM pg_tables
WHERE schemaname = 'public'
AND tablename LIKE 'download_texts_p_%'
ORDER BY tablename;
BEGIN
FOR table_record IN tables LOOP
SELECT REPLACE(table_record.tablename, 'download_texts_p_', 'download_texts_') INTO new_table_name;
EXECUTE '
ALTER TABLE ' || table_record.tablename || '
RENAME TO ' || new_table_name || '
';
EXECUTE '
ALTER INDEX ' || table_record.tablename || '_pkey
RENAME TO ' || new_table_name || '_pkey
';
EXECUTE '
ALTER INDEX ' || table_record.tablename || '_downloads_id_idx
RENAME TO ' || new_table_name || '_downloads_id_idx
';
EXECUTE '
ALTER TABLE ' || new_table_name || '
RENAME CONSTRAINT ' || table_record.tablename || '_downloads_id_fkey
TO ' || new_table_name || '_downloads_id_fkey
';
EXECUTE '
ALTER TRIGGER ' || table_record.tablename || '_test_referenced_download_trigger
ON ' || new_table_name || '
RENAME TO ' || new_table_name || '_test_referenced_download_trigger
';
END LOOP;
END
$$;

(Here we iterate over tables download_texts_p_* and rename them to download_texts_p_, together with the accompanying indices and such.)

Also, I can't quite figure out the reason for having those "down" migrations in the first place:

  1. If there are syntax errors in the "up" migration, then the tool will just fail importing the migration script at some point, and then, provided that the migration is run in a transaction (which it should be in most cases), it will get reverted automagically (as the tool won't get to run COMMIT after a failed migration).
  2. However, if the migration contains logic errors (let's say the migration erroneously renames those download_texts_p_* tables to two_hour_techno_mixes_* instead of download_texts_*), then it will pass with flying colours, and reverting such a migration would involve someone coming in with tape + WD40 (i.e. logging into a database and running some SQL commands manually), so the "down" migration never gets run anyway as we didn't know the mistake that we're about to do in the first place.
  3. There might be cases when we rename table foo to bar and in a week change our minds and revert it back to foo, in which case the "down" migrations become useful. However, those cases are rare, plus we might want to rename bar to baz (instead of foo) instead.

Am I missing something here? Feel free to disagree, maybe I'm downright wrong somewhere.

Although it adds an extra step, I like the idea of bringing testing to migrations and incorporating that into our CI/CD flow

I would be of the opinion that we're better off testing the application itself that uses the schema and not the schema migrations. I don't think there's much of a point in verifying that the table really exists right after executing CREATE TABLE, and if there's some more advanced logic in the migration (e.g. a partition table rename as in the example before), one can verify that the function did whatever it was supposed to do in the migration itself. Lastly, plpgsql is a quite awful language to write stuff in (as compared to Python) :)

Again, maybe I don't see something here? What is it that those tests would test for?

Yandex has a simple, Python-based tool that works very similarly to flyway. They're a large/sophisticated company, the project is well-maintained, it supports ROLLBACK, and it's fully FOSS; so, all in all, maybe also worth considering. The most obvious downsides are that it's poorly documented, and there doesn't seem to be much of a community.

I kinda like Yandex's KISS approach to migrations - their tool seems to be something that one of us could code together in a weekend with a few Red Bulls :), meaning that it wouldn't be hard to "migrate" away from this migration tool if need be. Yeah, the docs are a bit funny.

They get a few things right:

  • There are some migrations that can be run within a transaction and some that can't, they seem to acknowledge that.
  • Sometimes one might want to run a shell script or something like that before a migration, there seems to be some support for that.

I wonder what's the reason to insist for the migration files to be in ASCII though: https://github.com/yandex/pgmigrate/blob/master/doc/tutorial.md#utf-8-migrations

I'm okay with scanning through mediawords.sql and porting the inline comments to COMMENT ON statements. Tedious, but I've got some two-hour techno mixes that will get me through, plus I think it will be a useful exercise in that it will increase my familiarity with the database.

Again, don't let yourself suffer too much over these comments - feel free to skip the non-useful ones. If you find it too time consuming (e.g. taking more than two runs of Dekmantel Podcast 013 - Linkwood) to copy those comments over, don't hesitate to cut corners. Or maybe there's a better way to preserve some of this information in the comments instead of COMMENT ON which kinda feels cumbersome - maybe just leave them in migrations themselves?

@jtotoole
Copy link
Contributor

Update for anyone else reading this issue: Linas and I discussed and actually settled on trying Yandex's tool. There don't seem to be super large differences in functionality vs. Flyway, it's fully FOSS, and it wouldn't be crazy complicated to rewrite on our own if we ever needed to do so. I'm going to take a stab at implementing it and will link to the PR here when ready.

@pypt
Copy link
Contributor Author

pypt commented Jul 22, 2021

Fixed in #759

@pypt pypt closed this as completed Jul 22, 2021
@pypt
Copy link
Contributor Author

pypt commented Aug 4, 2021

We forgot about one thing - when pgmigrate tries to apply migrations to the current production database (or any other existing dataset) for the first time, it will assume that it's currently at the schema version 0, i.e. it will try applying V0001__initial_schema.sql which will fail (as all of these tables already exist).

So, we need to tell pgmigrate that the current schema is actually at version 1 already. pgmigrate seems to store its schema version in a schema_version table (which gets created here), so we can just create it ourselves:

DO $$
BEGIN
    IF NOT EXISTS (SELECT 1 FROM pg_type WHERE typname = 'schema_version_type') THEN
        CREATE TYPE public.schema_version_type AS ENUM (
            'auto',
            'manual'
        );
    END IF;
END $$;


CREATE TABLE IF NOT EXISTS public.schema_version (
    version         BIGINT NOT NULL PRIMARY KEY,
    description     TEXT NOT NULL,
    type            public.schema_version_type NOT NULL DEFAULT 'auto',
    installed_by    TEXT NOT NULL,
    installed_on    TIMESTAMP WITHOUT time ZONE DEFAULT now() NOT NULL
);


INSERT INTO public.schema_version (
    version,
    description,
    type,
    installed_by,
    installed_on
) VALUES (
    1,
    'initial schema',
    'auto',
    'mediacloud',
    NOW()
) ON CONFLICT (version) DO NOTHING;

I've run this on our production and will tell CfA people for them to do it on their end too.

@jtotoole
Copy link
Contributor

jtotoole commented Aug 4, 2021

excellent catch!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants