Learning formal verification by writing a spec for verifying a zero-downtime database migration in PlusCal

2025-03-09 01:31:00
biradarganesh25.github.io

Learning formal verification by writing a spec for verifying a zero-downtime database migration in PlusCal

posted on Dec 28, 2024

Formal verification can be used to analyze complex systems to ensure
their correctness. Formal verification is the process of checking that a system
satisfies specific properties (which are specified by the user) across all possible states.
There are 2 main types of properties:
“safety properties” and “liveness properties”. For the formal definition,
please refer here.
If a property is not satisfied in any state but has the potential to be
satisfied in a future state, then that would be a “liveness
property”. If it can never be satsified in any future state, then it would
be a “safety property”.

For example, let’s imagine a bunch of servers serving a website. A property
saying that “no server can ever be offline” would be a safety property
because if in any state of the system, any server goes offline the property
would not be satisfied and can never be satisfied in any future state
after that. But a property saying that “all servers that go offline
eventually come back online” would be a “liveness property” because
there is always a possiblity of that being true in a future state.

TLA+ is a formal specification language where we can model systems
and verify properties across all possible system states. It
allows us to enumerate all possible states a system can be
in and verify that specified properties hold true
across all these states.

PlusCal is a DSL for TLA+. It sits at a higher level of abstraction
and is much easier to specify complex systems in it compared to TLA+.
The trade-off is that some kinds of systems cannot be specified (i.e
it is more constrained than TLA+, but it is sufficient for most
common use cases).

I went through learn tla in about 2 weeks (it
was quite dense – I ended up learning some things about formal verification in order
to understand that material fully). I thought that the best way to learn it would be
to write a spec that actually does something useful and I thought verifying a zero-
downtime database migration algorithm would be fun.

Let’s assume a service exposes 3 APIs: Upsert either updates or creates a new object
, Delete which deletes the object and Get which gets the object.
We want to migrate these objects (for simplicity
purposes let’s assume its a table in a SQL database) from one database (olddb)
to another (newdb).

The pseudocode for migration algorithm is as follows:

Note: the pseudocode assumes that the database is a mapping of
primary keys to values (the objects in question are assumed to
have a primary key and a value)

The Upsert operation:

Upsert(key, val) {
    newdb[key] = val
}

Delete(key, val) {
  // we need to put a TOMBSTONE instead of actually
  // deleting it so that the background migration
  // process does not accidentally insert a 
  // deleted value
  newdb[key] = TOMBSTONE
}

Get(key) {
    if key in newdb {
        val = newdb[key]
        if val == TOMBSTONE {
            return NULL
        } else {
            return val
        }
    } else if key in olddb {
        return olddb[key]
    }
    return NULL
}

And a background migration process that moves all the objects from
olddb to newdb

migrate() {
  for key, val in olddb {

    // if does not exist is a conditional insert 
    newdb[key] = val if does not exist
  }
}

This is the pluscal spec for the above migration algo:

---- MODULE test ----
EXTENDS Integers, FiniteSets, TLC

Threads == {1,2} \* each thread represents a user
N == 1..4 \* primary keys of objects that user will use
Objects ==  [p \in N |-> p] \* the values that primary keys map to 

CONSTANTS 
    NULL,
    TOMBSTONE

(* --algorithm zero_downtime_migration

variables 
\* db is used to simulate a system without migration. 
\* olddb and newdb are used to simulate system with migration

db = [pkey \in 1..2 |-> pkey+1], \* some objects will already exist in db with different values when migration starts
olddb = [pkey \in 1..2 |-> pkey+1],
newdb = [pkey \in {} |-> NULL],

define 
  TypeOK == 
       \/ DOMAIN db \subseteq DOMAIN Objects
       \/ Cardinality(DOMAIN db) = 0

  GetObjectWOMigration(pkey) == 
    IF pkey \in DOMAIN db
    THEN db[pkey]
    ELSE NULL

  GetObjectWMigration(pkey) == 
    IF pkey \in DOMAIN newdb 
    THEN IF newdb[pkey] = TOMBSTONE 
         THEN NULL
         ELSE newdb[pkey]
    ELSE IF pkey \in DOMAIN olddb 
         THEN olddb[pkey]
         ELSE NULL

 \* the system with migration and without migration will be compared 
 \* this is our main invariant - we mimick a system where there is no 
 \* migration happening so that we can use that to check that our system
 \* where the migration is happening is correct
  ConsistentViews ==
    /\ \A pkey \in N: GetObjectWOMigration(pkey) = GetObjectWMigration(pkey)

end define;

\* WOMigration represent operations for the system without migration
\* WMigration represents operations for system with migration

macro UpsertWOMigration(db, key, val) begin
        db := (key :> val) @@ db;
end macro;

macro DeleteWOMigration(db, key) begin
  if key \in DOMAIN db then
    db := [p \in DOMAIN db \ {key} |-> db[p]]
  end if;
end macro;

macro UpsertWMigration(newdb, key, val) begin
        newdb := (key :> val) @@ newdb;
end macro;

macro DeleteWMigration(newdb, key) begin
  newdb := (key :> TOMBSTONE) @@ newdb
end macro;

\* each process represents 1 user
\* each will randomly pick an object and either upsert or delete it
process thread \in Threads

  variables
  i = 0,
  curPkey = NULL;

  begin
    Start:
      while i  olddb[curPkey]) @@ newdb;
          end if;          
      end while;

end process;

end algorithm; *)

Its amazing how the main part of the spec looks very similar to the pseudocode.

The trickiest thing was to figure out the invariant – i.e how to make sure that our
migration algorithm was correct. For this, I simulate 2 services in the spec: one
where the migration is happening and another where it is not. And our invariant will
be that GET operations on all the objects give the exact same value in both the
services.

Lets define the main variables involved:

Objects represents the objects that the users will work with. This is a function
that maps primary keys of the objects to their values. Each object is
assumed to have a primary key and a value. The spec uses 4 objects.
db is the database used by the service without migration. olddb and newdb are
are databases used by the service with migration. All the database variables are
functions mapping primary keys of objects to values.

All the service APIs are written as macros.
The APIs for service with migration end with WMigration
and for the service without migration end with WOMigration

Note: the database updates are
implemented as function updates in pluscal – and the database shows up on the right.
This is because the operator @@ merges the keys in the
left_function @@ right_function
and uses the value of the key in the left_function if present in both the functions.
The GET operation is implemented as an operator – because macros in pluscal do not
return values.

One thing I realized is that there is a lot of freedom in modelling complex systems in
TLA+. This lets us represent complex interactions in a very simple manner (for e.g
a database insert operation as a function update). This freedom might seem a bit weird
when just starting with plucal/tla+, but its very powerful once we get used to it.

Let’s talk about the processses involved in the spec now.

The spec has simulates 2 users randomly picking pre-defined objects (4 objects) and
then randomly
performing either Upsert or Delete operations on them.
Each user does this 2 times. The database starts out
with some objects already present (2 objects). The objects that the users pick and
perform operations on will include these already present
objects also. The number of users and objects and the number of times an action is repeated
might
seem very few to test with to make sure that the system is working properly, but since
the TLC model checker will create a state graph which includes every possible state, we
can be sure that all the edge cases are covered.
All actions happening in our system with migration will be mirrored in a system without
migration atomically.

There is also a background migration process running which copies over the data from
olddb to newdb.

The final migration step involves deleting the TOMBSTONES in the newdb, but for now
the spec does not have it. In future I might add it and revise this blog post.

The true value of TLA+ spec shows up when the spec errors out. For example, when writing
the spec I had written the GetObjectWMigration incorrectly:

GetObjectWMigration(pkey) == 
    IF pkey \in DOMAIN newdb 
    THEN newdb[pkey]
    ELSE IF pkey \in DOMAIN olddb 
         THEN olddb[pkey]
         ELSE NULL

compare this to the correct version:

GetObjectWMigration(pkey) == 
  IF pkey \in DOMAIN newdb 
  THEN IF newdb[pkey] = TOMBSTONE 
        THEN NULL
        ELSE newdb[pkey]
  ELSE IF pkey \in DOMAIN olddb 
        THEN olddb[pkey]
        ELSE NULL

The incorrect version does not handle the case where there is a TOMBSTONE present in newdb.
Because of this the spec was failing.

This is a solid example of how TLA+ saves time: it lets us discover desgin flaws very early.
Imagine debugging this error after writing out the entire migraiton algorithm: it would’ve
hard to do and a very real chance that it’d have been missed altogether. Here, I just had to
step through the state changes to figure out the exact cause of the spec failure.

One more way in which TLA+ helps is it forces us to think which steps must be atomic. If we
come across any steps that cannot be atomic then it forces to consider the implications of
concurrent execution. For example, in the background migration process above, before moving
an item from olddb to newdb we check if the object already exists in the newdb (this
maybe because it was deleted. If it was, we should not overwrite the TOMBSTONE). Checking
whether the object exists and creating it in the newdb must be atomic (most databases have
a IF DOES NOT EXIST support for conditional statements). While this may seem trivial, when
writing the TLA+ spec it made it very clear and bought it to the focus. Important information
like this when discovered early in the design phase can save a lot of time during implementation
(for e.g here we would need to make sure that newdb has support for conditional statements
when choosing the database)

Source Link

Keep your files stored safely and securely with the SanDisk 2TB Extreme Portable SSD. With over 69,505 ratings and an impressive 4.6 out of 5 stars, this product has been purchased over 8K+ times in the past month. At only $129.99, this Amazon’s Choice product is a must-have for secure file storage.

Help keep private content private with the included password protection featuring 256-bit AES hardware encryption. Order now for just $129.99 on Amazon!

Unlock unlimited streaming with a free Amazon Prime trial!
Sign up today!

Help Power Techcratic’s Future – Scan To Support

If Techcratic’s content and insights have helped you, consider giving back by supporting the platform with crypto. Every contribution makes a difference, whether it’s for high-quality content, server maintenance, or future updates. Techcratic is constantly evolving, and your support helps drive that progress.

As a solo operator who wears all the hats, creating content, managing the tech, and running the site, your support allows me to stay focused on delivering valuable resources. Your support keeps everything running smoothly and enables me to continue creating the content you love. I’m deeply grateful for your support, it truly means the world to me! Thank you!

BITCOIN

Bitcoin Logo