Tuesday, November 26, 2024

Splitting the data of the monolith – Because who needs to sleep anyway… | Blog | bol.com

In this article, I would like to share our twisted journey about the data migration from our old monolith to the new “micro” databases. I would like to highlight the specific challenges we encountered during the process, present potential solutions for them, and outline our data migration strategy.

  • Background: summary and the necessity of the project
  • How to migrate the data into the new applications: describe the options/strategies how we wanted and how we did the migration
  • Implementation
    • Setting up a test project
    • Transforming the data: difficulties and solutions
    • Restoring the database: how to manage long running sql scripts with an application
    • Finalising the migration and preparing for go-live
    • DMS job hiccup
  • Going live
  • Learnings

If you find yourself knee-deep in technical jargon or it is too long, feel free to skip for the next chapter—we won’t judge.

Background

Our goal was during the last two years to replace our old monolithic application with microservices. It’s responsibility was to create customer related financial fulfillments, and ran between 2017 and 2024, soit collected extensive information about logistical events, shop orders, customers, and VAT.

Financial fulfilment is a grouping around transactions and connects trigger events, like a delivery with billing.

The data:

Why do we need the data at all?

Having the old data is crucial:including everything from history of the shop orders like logistical events orVAT calculations. Without them, our new applications cannot process correctly the new events of the old orders. Consider the following situation:

  1. You ordered a PS5 and it is shipped– The old application stores the data and sends a fulfilment
  2. The new applications go live
  3. You send back the PS5, so the new apps need the previous data to be able to create a credit.

The size of the data:

Since the old application had been started: it had collected 4 terabytes from which we still would like to handle 3T in two different microservices (in a new format):

  • shop order, customer data andVAT: ~2T
  • logistical events: ~1T

Handle history during development:

To manage historical data during development, we created a small service, which reads directly from the old app database and provides information through REST endpoints. This way can see what has already been processed by the old system.

How to migrate the data into the new applications?

We worked on a new system and by early February, we had a functional distributed system running in parallel with the old monolith. At that point, we considered three different plans:

  • Run the mediator app until the end of the Fiscal Period (2031):
    PRO: it is already done
    CON: we would have one extra “unnecessary” application to maintain.
  • Create a scheduled job to push data to the new applications:
    PRO: We can program the data migration logic in the applications and avoid the need for any unfamiliar technology.
    CON: Increased cloud costs. The exact duration required for this process is uncertain.
  • Replay ALL logistical events and test the new applications:
    PRO: We can thoroughly retest all features in the new applications.
    CON(S): Even higher cloud costs. More time-consuming. Data-related issues, including the need to manually fix past data discrepancies.

Conclusion:

Because the tradeoff was too big for all cases I asked for help and opinions from the development community of the company and after some back and forth, we setup a meeting with couple of experts from specific fields.

The new plan with the collaboration:

Current state of the system(s): Setting the scene

Before we could go ahead, we needed a clear picture of where we stood:

  • Old application runs on datacenter
  • Old database already migrated to the cloud
  • Mediator application is running to serve the old data
  • Working microservices in the cloud

The big plan:

After the discussion (and a few cups of strong coffee), we forged a totally new plan.

  • Use off-the-shelf solution to migrate/copy database: use Google’s open source Data Migration Service (DMS)
  • Promote the new database: Once migrated, this new database would be promoted to serve our new applications.
  • Transform the data with Flyway : Utilising Flyway and a series of SQL scripts, we would transform the data to the schemas of the new applications..
  • Start the new applications: Finally, with the data in place and transformed, we’d start the new applications and process the piled-up messages

The last point is extremely important and sensitive. When we finish the migration scripts, we must stop the old application, while we are collecting messages in the new applications to process everything at least once either with the old or the new solution.

Difficulties -the roadblocks ahead:

Of course, no plan is without its hurdles. Here’s what we were up against:

  • Single DMS job limitation: The two database migration jobs must run sequentially
  • Time-consuming jobs:
    • Each job took around 19-23 hours to complete
    • Transformation time: the exact duration was unknown
  • Daily fulfilment obligations: Despite the migration, we had to ensure that all fulfillments were sent out daily – no exceptions.
  • Uncharted territory: To top it off, nobody in the company had ever tackled something quite like this before, making it a pioneering effort. Also, the team are mainly Java/Kotlin developers using basic SQL scripts.
  • Go live date promise with other dependent projects in the company

Conclusion:

With our new plan in hand, with the help provided by our colleagues we could start working on the details, building up the script execution, and the scripts themselves. We also created a dedicated slack channel to keep everybody informed.

Implementation:

We needed a controlled environment to test our approach—a sandbox where we could play out our plan, also to develop the migration scripts themselves.

Setting up a test project

To kick things off, I forked one of the target applications and added some adjustments to fit our testing needs:

  • Disabling the tests: all existing tests except for the context loading of the Spring application. This was about verifying the structure and integration points, also the flyway scripts.
  • New Google project: ensuring that our test environment was separate from our production resources.
  • No communication: all inter-service communications – no messaging, no REST calls, and no BigQuery storage.
  • One instance: to avoid concurrency issues with the database migrations and transformations.
  • Remove all alerts to skip the heart attacks.
  • Database setup: Instead of creating a new database on production, we promoted a “migrated” database created by DMS.

Transforming data: Learning from failures

Our journey through data transformation was anything but smooth. Each iteration of our SQL scripts brought new challenges and lessons. Here’s a closer look at how we iterated through the process, learning from each failure to eventually get it right.

Step 1: SQL stored functions

Our initial approach involved using SQL stored functions to handle the data transformation. Each stored function took two parameters – a start index and an end index. The function would process rows between these indices, transforming the data as needed.

We planned to invoke these functions through separate Flyway scripts, which would handle the migration in batches.

PROBLEM:

Managing the invocation of these stored functions via Flyway scripts turned into a chaotic mess.

Step 2: State table

We needed a method that offered more control and visibility than our Flyway scripts, so we created a: State table, which stored the last processed id for the main/leading table of the transformation. This table acted as a checkpoint, allowing us to resume processing from where we left off in case of interruptions or failures.

The transformation scripts were triggered by the application in one transaction, which also included updating the state table state.

PROBLEM:

As we monitored our progress, we noticed a critical issue: our database CPU was being underutilised, operating at only around 4% capacity.

Step 3: Parallel processing

To solve the problem of the underutilised CPU, we created a lists of jobs concepts: where each list contained migration jobs, which must be executed sequentially.

Two separate lists of jobs have nothing to do with each other, so they can be executed concurrently.

By submitting these lists to a simple java ExecutorService, we could run multiple job lists in parallel.

Keep in mind all job calls a stored function in the database and updates a separate row in the migration state table, but it is extremely important to run only one instance of the application to avoid concurrency problems with the same jobs.

This setup increased CPU usage from the previous 4% to around 15%, a huge improvement. Interestingly, this parallel execution didn’t significantly increase the time it took to migrate individual tables. For example, a migration that initially took 6 hours (when it runs solely) now took about 7 hours, when it was executed with another parallel thread – an acceptable trade-off for the overall efficiency gain.

PROBLEM(S):

One table encountered a major issue during migration, taking an unexpectedly long time—over three days—before we ultimately had to stop it without completion.

Step 4: Optimising the long-running script(s)

To make this process faster, we required extra permissions to the database and our database specialists stepped in and helped us with the investigation.

Together we discovered that the root of the problem lay in how the script was filling a temporary table. Specifically, there was a sub select operation in the script that was inadvertently creating an O(N²) problem. Given our batch size of 10,000, this inefficiency was causing the processing time to skyrocket.

Related Articles

Latest Articles