Introduce timeout safeguard for migration sql queries (#3633)

The mysql statement timeout just works for selects. Additionally mysql does not support transactional rollbacks of DDL statements. Thus there is a risk to exit in an unclean state which might cause CC runtime errors and addtionally the timeout on selects only is not worth much functionally. Add max_migration_statement_runtime_in_seconds into default config For consitency since other optional parameters are also defined there explicitly.
cloudfoundry · Apr 9, 2024 · 18a2fda · 18a2fda
1 parent fe1cf1c
commit 18a2fda
Show file tree

Hide file tree

Showing 6 changed files with 34 additions and 9 deletions.
diff --git a/config/cloud_controller.yml b/config/cloud_controller.yml
@@ -322,6 +322,7 @@ newrelic_enabled: false
 max_annotations_per_resource: 200
 max_labels_per_resource: 50
 max_migration_duration_in_minutes: 45
+max_migration_statement_runtime_in_seconds: 30
 db:
   log_level: 'debug'
   ssl_verify_hostname: false

diff --git a/lib/cloud_controller/config_schemas/base/api_schema.rb b/lib/cloud_controller/config_schemas/base/api_schema.rb
@@ -86,6 +86,7 @@ class ApiSchema < VCAP::Config
             newrelic_enabled: bool,
 
             optional(:max_migration_duration_in_minutes) => Integer,
+            optional(:max_migration_statement_runtime_in_seconds) => Integer,
             db: {
               optional(:database) => Hash, # db connection hash for sequel
               max_connections: Integer, # max connections in the connection pool

diff --git a/lib/cloud_controller/config_schemas/base/migrate_schema.rb b/lib/cloud_controller/config_schemas/base/migrate_schema.rb
@@ -7,6 +7,7 @@ class MigrateSchema < VCAP::Config
         define_schema do
           {
             optional(:max_migration_duration_in_minutes) => Integer,
+            optional(:max_migration_statement_runtime_in_seconds) => Integer,
 
             db: {
               optional(:database) => Hash, # db connection hash for sequel

diff --git a/lib/cloud_controller/db_migrator.rb b/lib/cloud_controller/db_migrator.rb
@@ -7,12 +7,20 @@ class DBMigrator
   def self.from_config(config, db_logger)
     VCAP::CloudController::Encryptor.db_encryption_key = config.get(:db_encryption_key)
     db = VCAP::CloudController::DB.connect(config.get(:db), db_logger)
-    new(db, config.get(:max_migration_duration_in_minutes))
+    new(db, config.get(:max_migration_duration_in_minutes), config.get(:max_migration_statement_runtime_in_seconds))
   end
 
-  def initialize(db, max_migration_duration_in_minutes=nil)
+  def initialize(db, max_migration_duration_in_minutes=nil, max_migration_statement_runtime_in_seconds=nil)
     @db = db
     @timeout_in_minutes = default_two_weeks(max_migration_duration_in_minutes)
+
+    @max_statement_runtime_in_milliseconds = if max_migration_statement_runtime_in_seconds.nil? || max_migration_statement_runtime_in_seconds <= 0
+                                               30_000
+                                             else
+                                               max_migration_statement_runtime_in_seconds * 1000
+                                             end
+
+    @db.run("SET statement_timeout TO #{@max_statement_runtime_in_milliseconds}") if @db.database_type == :postgres
   end
 
   def apply_migrations(opts={})
@@ -32,12 +40,10 @@ def wait_for_migrations!
     logger = Steno.logger('cc.db.wait_until_current')
 
     logger.info('waiting indefinitely for database schema to be current') unless db_is_current_or_newer_than_local_migrations?
-
     timeout_message = 'ccdb.max_migration_duration_in_minutes exceeded'
     Timeout.timeout(@timeout_in_minutes * 60, message: timeout_message) do
       sleep(1) until db_is_current_or_newer_than_local_migrations?
     end
-
     logger.info('database schema is as new or newer than locally available migrations')
   end
 

diff --git a/spec/migrations/Readme.md b/spec/migrations/Readme.md
@@ -7,7 +7,7 @@ Sequel migrations modify the database schema. To write high-quality migrations,
 At present, the Cloud Controller (CC) supports both Postgres and MySQL in various versions. Since these databases behave differently and offer different feature sets, it's necessary to reflect these differences in the migrations. Failing to do so may result in differing schemas for MySQL and Postgres after the same operation.
 
 1. Postgres supports transactional DDL (Data Definition Language) statements but MySQL does not. This means when a failure occurs and a transaction is rolled back, only Postgres will correctly roll back the altered changes. MySQL commits every DDL statement immediately, which makes it impossible to roll back any schema change.
-1. While Postgres automatically resets table locks on rollback or commit of a transaction, MySQL does not. If a migration that locks tables doesn't unlock them as part of error handling, MySQL can permanently block a table until it is manually unlocked again. Moreover, Sequel doesn't recognize how to lock tables, necessitating use of raw SQL queries, with differing syntaxes for both DBs.
+1. While Postgres automatically resets table locks on rollback or commit of a transaction, MySQL does not. If a migration that locks tables doesn't unlock them as part of error handling, MySQL can permanently block a table until it is manually unlocked again. Moreover, Sequel doesn't recognize how to lock tables, necessitating use of raw SQL queries, with differing syntax for both DBs.
 1. If a MySQL table is locked and sub-queries are used, those sub-queries (alias them) must be named and locked, even if they don't exist at the time of locking.
 1. When dropping columns, Postgres automatically drops all related indices and constraints entirely while MySQL only removes the affected columns from the index/constraint.
 1. When renaming columns, both MySQL and Postgres auto-rename columns in indices. However, while Postgres cascades the rename into views, MySQL doesn't and the view still references the old column name, thus breaking it. Views thus must generally be dropped and recreated after renaming the column. Postgres just cascades a rename properly.
@@ -24,8 +24,8 @@ To create resilient and reliable migrations, follow these guidelines:
 
 1. Do not use any Cloud Controller code in a migration. Changes in the model might modify old migrations and change their behavior and the behavior of their tests. Use raw Sequel operations instead, as you're aware of the exact database schema at that migration time. For instance, if you use a model in a migration and a later migration changes that model's table name, the initial migration would fail because the table does not yet exist when the migration runs and uses the model from the CC's codebase.
 1. Aim to write separate [up and down](https://sequel.jeremyevans.net/rdoc/files/doc/migration_rdoc.html#top) Sequel migrations. The `change do` command automatically generates the down part but occasionally behaves unpredictably and isn't always able to compute the opposite of what the migration accomplished. By defining `up do` and `down do` separately, you'll have full control over the down part.
-1. Opt for a larger number of smaller transactions in migrations rather than larger transactions. If a single change fails, the entire migration must be manually rolled back. Atomic migrations are easier to roll back. Each migration runs in a database synchronisation call.
-1. Write idempotent migrations. Even though a row in the database prevents a migration from running twice, you can't be sure the database remains unchanged after an error in the migration since schema changes can't always be transactionally rolled back e.g. on MySQL. Make your migrations rerunnable. This often isn't the case when using Sequel functions with default values. For instance, [drop_index](https://www.rubydoc.info/github/jeremyevans/sequel/Sequel%2FSchema%2FAlterTableGenerator:drop_index) by default throws an error if an index you want to drop has already been dropped. In a migration, you'll want to pass the optional parameter `if_exists: true` so the statement is idempotent and if a migration fails in the middle, it can be rerun. Sometimes you will also find different functions that offer idempotency. Like `create_table` (just create table), `create_table!` (drop in any case and create table) and `create_table?` (create table if not already exists) so also look out for such options. In case the needed Sequel function does not offer a parameter/function for idempotency, one must test against the schema and just run statements conditionally. E.g. check if an index already exists with `... if self.indexes(:my_table)[:my_index].nil?` before creating it.
+1. Opt for a larger number of smaller transactions in migrations rather than larger transactions. If a single change fails, the entire migration must be manually rolled back. Atomic migrations are easier to roll back. Each migration runs in a database synchronization call.
+1. Write idempotent migrations. Even though a row in the database prevents a migration from running twice, you can't be sure the database remains unchanged after an error in the migration since schema changes can't always be transitionally rolled back e.g. on MySQL. Make your migrations runnable. This often isn't the case when using Sequel functions with default values. For instance, [drop_index](https://www.rubydoc.info/github/jeremyevans/sequel/Sequel%2FSchema%2FAlterTableGenerator:drop_index) by default throws an error if an index you want to drop has already been dropped. In a migration, you'll want to pass the optional parameter `if_exists: true` so the statement is idempotent and if a migration fails in the middle, it can be rerun. Sometimes you will also find different functions that offer idempotency. Like `create_table` (just create table), `create_table!` (drop in any case and create table) and `create_table?` (create table if not already exists) so also look out for such options. In case the needed Sequel function does not offer a parameter/function for idempotency, one must test against the schema and just run statements conditionally. E.g. check if an index already exists with `... if self.indexes(:my_table)[:my_index].nil?` before creating it.
 1. During migrations that execute DML (Data Manipulation Language) statements, consider locking the table while altering data. For instance, if you're removing duplicates and setting a unique constraint, you must prevent new inserts while the migration runs. Otherwise, setting the unique constraint later could fail. Note that Sequel doesn't offer built-in functions for table locks, so you'll need to use raw SQL queries that vary depending on the Database backend they run on.
 1. Separate data-changing migrations from schema-changing DDL statements to leverage at least the database rollback function for DML Statements. Explicitly wrap data changes in a transaction that doesn't include DDL statements. Avoid including DDL in that transaction otherwise MySQL will autocommit data changes and a rollback is impossible.
 1. Try to handle DML (Data Manipulation Language) statements entirely within the DB, avoiding any looping over returned data in the migration process. This helps to reduce the time a table is locked. For example, instead of finding all duplicates with a select and then iterating over the results in Ruby and deleting the rows, use a subquery to let the DB perform the heavy lifting. Since these operations should be performed with a table lock to prevent new duplicates, it is advantageous to minimize runtime as much as possible. Involving Ruby and dealing with large tables could increase runtimes to several minutes, leading to locked API requests and potential unavailability.
@@ -67,7 +67,8 @@ To create resilient and reliable migrations, follow these guidelines:
    end
    ```
 1. If you're writing a uniqueness constraint where some of the values can be null, remember that `null != null`. For instance, the values `[1, 1, null]` and `[1, 1, null]` are considered unique. Uniqueness constraints only work on columns that do not allow `NULL` as a value. If this is the case, change the column to disallow `NULL` and set the default to an empty string instead.
-1. If you need to execute different operations for MySQL and Postgres, you can check the database type as follows: `... if database_type == :postgres` or `... if database_type == :mysql`.
+1. If you need to execute different operations for MySQL and Postgres, you can check the database type as follows: `... if database_type == :postgres` or `... if database_type == :mysql`. If the differences are too big, consider writing separate migrations for each database type.
+1. Be sure that with real world table sizes and load each sql query will finish in reasonable time inside your migration. **There is a hard limit of 30s in place to protect against outages caused by long-running migrations**. If you need to run a long-running migration, consider breaking it up into smaller parts. If you make use of table locking, be sure to run any query with real world table sizes sub 2 seconds to not cause issues due to table locks and waiting queries. In case a single statement exceeds 30s(default), the migration is aborted. An operator can overwrite this behaviour by setting the `max_migration_statement_runtime_in_seconds` config property.
 
 # Sequel Migration Tests
 

diff --git a/spec/unit/lib/cloud_controller/db_migrator_spec.rb b/spec/unit/lib/cloud_controller/db_migrator_spec.rb
@@ -1,9 +1,10 @@
 require 'spec_helper'
 
 RSpec.describe DBMigrator do
+  let(:db) { Sequel::Model.db }
+
   describe '#wait_for_migrations!' do
     let(:migrator) { DBMigrator.new(db, 1) }
-    let(:db) { double(:db) }
 
     it 'blocks until migrations are current or newer' do
       expect(Sequel::Migrator).to receive(:is_current?).with(
@@ -47,4 +48,18 @@
       end.to raise_error(UncaughtThrowError)
     end
   end
+
+  describe 'postgresql' do
+    it 'sets a default statement timeout' do
+      skip if db.database_type != :postgres
+      expect(db).to receive(:run).with('SET statement_timeout TO 30000')
+      DBMigrator.new(db)
+    end
+
+    it 'sets a config provided statement timeout' do
+      skip if db.database_type != :postgres
+      expect(db).to receive(:run).with('SET statement_timeout TO 60000')
+      DBMigrator.new(db, nil, 60)
+    end
+  end
 end