The following are the titles of recent articles syndicated from Blog - Percona Add this feed to your friends list for news aggregation, or view this feed's syndication information.
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
2:24 pm
Security advisory: CVE-2026-9740 and CVE-2026-11933 in Percona Server for MongoDB
Fixes land in Percona Server for MongoDB patch window starting next week. The first high-vulnerability issue has nothing between it and your mongod process except your firewall. The second has a configuration off-switch you can flip during a maintenance window. Read on to understand why, how, and what.
CVE-2026-9740 — the one that does not need credentials
A stack overflow in the BSON validator, specifically in the BSONColumn interleaved-reference handling. The validator’s depth tracking resets on mutual recursion between validation functions, so a sufficiently nested input exhausts the thread’s stack before any explicit limit fires. The result: mongod crashes.
CVSS 8.7. High severity. The reason it lands in High instead of merely Medium is the prerequisite for exploitation – there is none.
The attacker needs network reachability to a mongod listener. No credentials, no prior session, and no application interaction. One crafted message over the wire and the process is down. Repeated crashes are trivially repeatable, so an attacker who can reach the port can keep the instance offline for as long as they keep that reachability. The urgency of this issue comes from the audience – everyone with a TCP route to your database.
Upstream tracking: SERVER-125063. Affected versions are Percona Server for MongoDB 8.0 ≤ 8.0.23-10 and PSMDB 7.0 ≤ 7.0.34-19. The vulnerable BSONColumn code path was introduced in 7.0, so 6.0 and earlier are not in scope for this one.
CVE-2026-11933 — the one that does need credentials and permissions to read
The vulnerable code path is inside MongoDB Server’s server-side JavaScript engine, specifically in the BSON-to-array conversion routine. When a BSON document is materialized as a JavaScript array for use inside a server-side script, the engine can reach a state where it accesses memory that has already been freed. An attacker who can submit input that flows into that conversion path can shape what happens at the point of access.
Server-side JavaScript is reachable from the following surfaces:
The $where query operator (deprecated in 8.0).
The $function aggregation expression (deprecated in 8.0).
The $accumulator aggregation expression (deprecated in 8.0).
MongoDB logs a warning when you run deprecated functions.
Prerequisites for exploitation:
The attacker must be authenticated to MongoDB.
The attacker must hold any role that permits running queries or aggregations against a collection. The built-in read role on a single database is sufficient.
Server-side JavaScript must be enabled on the mongod instance. This is the default; many production deployments leave it enabled even when they do not use it.
CVSS 8.8. High severity. Two demonstrated outcomes:
Information disclosure (reading other content out of the mongod process memory) and
Denial of Service (crashing it).
Upstream tracking: SERVER-128125. Affected versions: every supported and End of Life Percona Server for MongoDB major from 4.4 through 8.0.
The good news and bad news
CVE-2026-11933 has a configuration off-switch. If your application does not use server-side JavaScript — $where, $function, $accumulator, mapReduce, or stored system.js functions — you can disable server-side JavaScript on the server, removing the attack surface entirely until you patch.
How to check whether your applications use server-side JavaScript before disabling:
Enable MongoDB profiling at level 2 (all operations) on a representative mongod server for a representative time window. See details in Manage the database profiler.
Search the system.profile collection for operations that include $where, $function, $accumulator, or mapReduce.
Inspect application code paths and stored aggregation pipelines for the same operators. Check system.js in each database for stored functions.
If any usage exists, treat disabling as not viable for those deployments and rely on patching plus the defense-in-depth controls below.
How to disable server-side JavaScript:
Add to your configuration file for mongodand mongos:
After a restart, any operation that reaches for server-side JavaScript will return an error. That is the catch: if your application does use one of those operators, this is not a viable mitigation for you, and you have to wait for the patch. If you are not sure whether your application uses them, turn on the database profiler at level 2 on a representative replica for a window long enough to be representative, then grep the profile collection for the operator names. Several teams have done this exercise in the last forty-eight hours and learned the answer is “no, we don’t actually use any of that.” The cost of disabling is then the cost of a mongod or mongos restart.
That was good news. Now the bad news: CVE-2026-9740 has no equivalent off-switch. The BSON validator is core to every client message; it cannot be disabled. Patch and network controls are the only options.
What is shipping, and when
The fixes for both CVEs will land in a single coordinated patch release for each supported major:
Percona Server for MongoDB 7.0 series — patch 7.0.37-20 released on June 23, 2026.
Percona Server for MongoDB 8.0 series — patch 8.0.26-11 released on June 25, 2026.
Percona Server for MongoDB 6.0 series — patch 6.0.29-23 released on June 24, 2026 (for CVE-2026-11933).
All dates are targets, not commitments. Plan one upgrade window covering all CVEs.
Percona is not building binary packages for the 5.x line. We’re being upfront about that — the calculus on extended support has a limit, and 5.x is past it for us. If you have a hard requirement on 5.x and the time pressure to meet it, the source is available for building. Percona customers on 5.x can open a ticket, and we’ll work on the case individually.
As usual, you can download patches from your package manager or Percona Software Downloads page.
On Kubernetes via the Percona Operator for MongoDB: same drill as usual. When the patched image is published, edit the image tag in your PerconaServerMongoDB custom resource and let the operator roll the cluster. Don’t wait for the June operator release to do it for you. See details in our documentation on how to Upgrade Percona Server for MongoDB. You do not need to wait for an operator release to apply a security fix.
What to do this week
In order of urgency, for most deployments:
Confirm your mongod or mongos listeners are not reachable from any source you would not trust with a shell on the host. If you find an exposure, fix that first. CVE-2026-9740 turns any such exposure into a DoS primitive.
For deployments that do not use server-side JavaScript, disable it. Full mitigation for CVE-2026-11933 within a single mongod restart.
Plan your upgrade window for the week the relevant fixed release lands. One window. Both CVEs. Plus, the others scored lower.
Audit which roles in your deployment can run ad-hoc queries or aggregations. The bar for CVE-2026-11933 is the standard read role, so the population of potential attackers is larger than for most memory-safety defects.
One closing point, because it has come up several times in customer conversations this week. For a deployment behind tight network controls, the post-authenticated bug is the more urgent one. For a deployment reachable from broader networks — public cloud, shared internal LANs, multi-tenant infrastructure — the pre-authenticated bug is. Triage by your exposure, not by their CVSS.
Questions, or a deployment you’re not sure how to triage? Find us on the Percona Forum, or, for customers, in the support portal.
Reviewed by Ivan Groenewold. Vetted for technical accuracy as of June 17, 2026.
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
11:31 am
Extending pt-archiver with a Partition-Aware Plug-in for Fast Retention Policy Enforcement
Managing data retention policies is one of the most common operational tasks in MySQL.
Applications continuously generate transactional, audit, logging, telemetry, and event data. Over time, these tables can grow to billions of rows, causing:
Larger backups
Longer recovery times
Reduced buffer pool efficiency
Slower index maintenance
Increased storage costs
Degraded query performance
To address these problems, organizations typically implement retention policies based on dates or timestamps. Examples include deleting events older than 90 days or purging session data older than 30 days and so forth. The deleted data can then eventually be archived somewhere else, like in another DBMS or on external files.
One of the most widely used tools for implementing these policies in MySQL ecosystems is pt-archiver, part of the Percona Toolkit.
This article provides a review of what pt-archiver is and how to use it, but in particular it focuses on the fact this tool is not partitioning aware, and this can make the deletion phase more costly. The article shows how to extend pt-archiver with a Perl plugin to make it aware of partitioning.
What is pt-archiver?
pt-archiver is a command-line utility from Percona Toolkit designed to:
Archive rows from MySQL tables
Purge rows from MySQL tables
Move data between tables into the local database or a remote one
Export rows into files
In a few words: implementing retention policies safely.
The tool processes rows incrementally in chunks, avoiding massive transactions and reducing impact on production systems.
Date placeholders in the filename are expanded automatically
Rows can optionally be deleted from the source table by adding –purge
This allows pt-archiver to be used both for data retention and for offline archival workflows.
The Hidden Cost of DELETE Statements
Although pt-archiver is much safer than massive DELETE operations, it still fundamentally relies on DELETE statements.
This is a critical point.
Even when there are proper indexes, the rows are processed in chunks, and transactions are small; the large-scale DELETE operations remain expensive.
Deleting rows is expensive in InnoDB because it involves:
Locating rows via indexes
Modifying clustered indexes
Modifying secondary indexes
Generating undo logs
Generating redo logs
Purge thread processing
Replication event generation
Page fragmentation
When deleting billions of rows, the overhead becomes enormous.
Indexes help for sure, but only partially.
Consider:
DELETE FROM events
WHERE created_at < '2024-01-01';
If created_at is indexed, MySQL can efficiently locate rows.
However, locating rows efficiently is only part of the cost. The actual delete operations still require all those things we mentioned above.
At considerable scale, this becomes expensive.
Why RANGE Partitioning is Superior for Retention Policies
For time-based retention policies, partitioning is often dramatically more efficient. In particular, RANGE partitioning is very useful for these cases.
Example:
CREATE TABLE events (
id BIGINT NOT NULL,
created_at DATETIME NOT NULL,
payload JSON,
PRIMARY KEY(id, created_at)
)
PARTITION BY RANGE (TO_DAYS(created_at)) (
PARTITION p202604 VALUES LESS THAN (TO_DAYS('2026-05-01')),
PARTITION p202605 VALUES LESS THAN (TO_DAYS('2026-06-01')),
PARTITION p202606 VALUES LESS THAN (TO_DAYS('2026-07-01'))
);
With partitioning, dropping old data becomes:
ALTER TABLE events DROP PARTITION p202604;
This operation is dramatically faster than running a DELETE.
Dropping a partition:
Removes an entire physical partition
Avoids row-by-row DELETE
Avoids undo generation for each row
Avoids secondary index maintenance per row
Minimizes redo generation
Is nearly metadata-only
This can remove millions or billions of rows in a matter of seconds without the same large cost of DELETE.
The Problem: pt-archiver is Not Partition-Aware
Unfortunately, pt-archiver does not automatically understand partitioning strategies.
Even if the table is partitioned or the retention policy perfectly matches partition boundaries, pt-archiver still executes DELETE statements.
Internally, this still produces DELETE … instead of ALTER TABLE … DROP PARTITION …
This means organizations may lose the major operational benefits of partitioning, or they need to implement custom scripts for managing the selection of rows to copy using pt-archiver and then use DROP PARTITION separately from the tool. That is doable, and to be honest, not too complicated, but why not make pt-archiver aware of partitioning for some specific use cases?
Extending pt-archiver with Pulg-ins
Fortunately, pt-archiver supports Perl plug-ins.
A plug-in can do plenty of things. Like: inspect runtime conditions, interact with MySQL, override behaviors, and execute custom logic
This gives us an opportunity to implement partition-aware retention handling.
The plug-in can:
Inspect partition definitions
Analyze the WHERE condition
Determine which partitions are fully expired
Execute ALTER TABLE DROP PARTITION
Prevent row-by-row DELETE processing
This approach combines the scheduling/orchestration power of pt-archiver with the efficiency of partition pruning.
Plug-in Design
Our plug-in will:
Connect using the pt-archiver DB handle
Inspect INFORMATION_SCHEMA.PARTITIONS
Identify partitions older than the retention cutoff
Issue DROP PARTITION statements
Log actions
Skip DELETE processing
Assumptions:
The table is RANGE partitioned
Partitions are DATETIME based using the TO_DAYS() function to define ranges
Partition naming convention contains dates
Retention policy aligns with partition boundaries; if the plugin cannot determine a specific boundary, pt-archiver does nothing
Full Perl Plug-in for pt-archiver
package pt_archiver_partition_drop;
use strict;
use warnings;
sub new {
my ($class, %args) = @_;
my $self = {
dbh => $args{dbh},
db => $args{db},
tbl => $args{tbl},
statistics => {},
};
bless $self, $class;
return $self;
}
sub statistics {
my ($self) = @_;
return $self->{statistics};
}
sub before_begin {
my ($self) = @_;
my $dbh = $self->{dbh} or die "Missing dbh from pt-archiver\n";
my $db = $self->{db} or die "Missing db from pt-archiver plugin args\n";
my $tbl = $self->{tbl} or die "Missing tbl from pt-archiver plugin args\n";
my $where = _get_cmdline_option('where');
my $dryrun = $ENV{PT_PARTITION_DROP_DRY_RUN} ? 1 : 0;
die "Missing --where from original command line\n" unless $where;
print "PLUGIN before_begin called\n";
print "DB=$db TABLE=$tbl\n";
print "WHERE=$where\n";
print "PLUGIN_DRY_RUN=$dryrun\n";
my ($column, $cutoff_date) = _parse_where($where);
my $partitions = _get_partitions($dbh, $db, $tbl);
if (!@$partitions) {
print "Table `$db`.`$tbl` is not partitioned. Refusing DELETE.\n";
exit(0);
}
my $partition_expr = $partitions->[0]->{expression};
die "Missing PARTITION_EXPRESSION\n"
unless defined $partition_expr && length $partition_expr;
print "Partition expression: $partition_expr\n";
my $cutoff_value = _evaluate_cutoff(
$dbh,
$partition_expr,
$column,
$cutoff_date,
);
print "Cutoff date: $cutoff_date\n";
print "Cutoff boundary value: $cutoff_value\n";
my $matched;
for my $p (@$partitions) {
next if !defined $p->{description};
next if uc($p->{description}) eq 'MAXVALUE';
if ($p->{description} == $cutoff_value) {
$matched = $p;
last;
}
}
if (!$matched) {
print "No exact partition boundary matches cutoff $cutoff_value. Refusing DELETE.\n";
exit(0);
}
print "Matched boundary partition: $matched->{name}, position $matched->{position}\n";
my @drop;
for my $p (@$partitions) {
next if !defined $p->{description};
next if uc($p->{description}) eq 'MAXVALUE';
if ($p->{position} <= $matched->{position}) {
push @drop, $p->{name};
print "Eligible for DROP: $p->{name}, boundary $p->{description}\n";
}
}
if (!@drop) {
print "No partitions eligible for DROP. Refusing DELETE.\n";
exit(0);
}
my $sql = sprintf(
"ALTER TABLE %s.%s DROP PARTITION %s",
_quote_ident($db),
_quote_ident($tbl),
join(", ", map { _quote_ident($_) } @drop),
);
print "SQL: $sql\n";
if ($dryrun) {
print "PT_PARTITION_DROP_DRY_RUN enabled. Not executing DROP PARTITION.\n";
}
else {
$dbh->do($sql);
print "Dropped partitions: " . join(", ", @drop) . "\n";
}
$self->{statistics}->{partitions_dropped} = scalar @drop;
exit(0);
}
sub _parse_where {
my ($where) = @_;
$where =~ s/^\s+|\s+$//g;
die "Only WHERE format supported: created_at < 'YYYY-MM-DD'\n"
unless $where =~ /^`?([A-Za-z0-9_]+)`?\s*<\s*'(\d{4}-\d{2}-\d{2})'\s*$/;
return ($1, $2);
}
sub _evaluate_cutoff {
my ($dbh, $partition_expr, $column, $cutoff_date) = @_;
my $expr = $partition_expr;
$expr =~ s/`//g;
die "Partition expression does not reference column `$column`: $partition_expr\n"
unless $expr =~ /\b\Q$column\E\b/i;
$expr =~ s/\b\Q$column\E\b/'$cutoff_date'/ig;
die "Unsafe generated expression: $expr\n"
unless $expr =~ /^[A-Za-z0-9_\s\(\)\+\-\*\/,\.'":]+$/;
my $sql = "SELECT $expr";
print "Boundary evaluation SQL: $sql\n";
my ($value) = $dbh->selectrow_array($sql);
die "Cannot evaluate cutoff expression: $sql\n"
unless defined $value;
return $value;
}
sub _get_partitions {
my ($dbh, $db, $tbl) = @_;
my $sql = q{
SELECT
PARTITION_NAME,
PARTITION_DESCRIPTION,
PARTITION_EXPRESSION,
PARTITION_ORDINAL_POSITION
FROM INFORMATION_SCHEMA.PARTITIONS
WHERE TABLE_SCHEMA = ?
AND TABLE_NAME = ?
AND PARTITION_NAME IS NOT NULL
ORDER BY PARTITION_ORDINAL_POSITION
};
my $sth = $dbh->prepare($sql);
$sth->execute($db, $tbl);
my @partitions;
while (my $row = $sth->fetchrow_hashref()) {
push @partitions, {
name => $row->{PARTITION_NAME},
description => $row->{PARTITION_DESCRIPTION},
expression => $row->{PARTITION_EXPRESSION},
position => $row->{PARTITION_ORDINAL_POSITION},
};
}
return \@partitions;
}
sub _get_cmdline_option {
my ($name) = @_;
my $opt = "--$name";
for (my $i = 0; $i < @ARGV; $i++) {
if ($ARGV[$i] eq $opt && defined $ARGV[$i + 1]) {
return $ARGV[$i + 1];
}
if ($ARGV[$i] =~ /^\Q$opt\E=(.*)$/) {
return $1;
}
}
if (open my $fh, '<', "/proc/$$/cmdline") {
local $/;
my $raw = <$fh>;
close $fh;
my @cmd = split /\0/, $raw;
for (my $i = 0; $i < @cmd; $i++) {
if ($cmd[$i] eq $opt && defined $cmd[$i + 1]) {
return $cmd[$i + 1];
}
if ($cmd[$i] =~ /^\Q$opt\E=(.*)$/) {
return $1;
}
}
}
return undef;
}
sub _quote_ident {
my ($ident) = @_;
die "Invalid identifier: $ident\n"
unless defined $ident && $ident =~ /^[A-Za-z0-9_]+$/;
return "`$ident`";
}
1;
Create the file named pt_archiver_partition_drop.pm into the /usr/local/share/perl5 path.
Also set the environment variable PERL5LIB to let pt-archiver where to find the Perl package
export PERL5LIB=/usr/local/share/perl5
Example Usage
First, create the partitioned table events and insert some fake data.
Notice the Perl plugin must be indicated with the m option in the DSN string.
In practice:
pt-archiver initializes
The plug-in runs
Partitions are dropped
No DELETE statements are executed
Here is what you get from the execution of the above command:
PLUGIN before_begin called
DB=mydb TABLE=events
WHERE=created_at < '2026-05-01'
PLUGIN_DRY_RUN=0
Partition expression: to_days(`created_at`)
Boundary evaluation SQL: SELECT to_days('2026-05-01')
Cutoff date: 2026-05-01
Cutoff boundary value: 740102
Matched boundary partition: p202604, position 1
Eligible for DROP: p202604, boundary 740102
SQL: ALTER TABLE `mydb`.`events` DROP PARTITION `p202604`
Dropped partitions: p202604
You can simply verify the table has been managed correctly:
SELECT * FROM mydb.events;
SHOW CREATE TABLE mydb.events;
Now TRUNCATE the table and recreate the data and try now to specify the where conditions that match a RANGE that is not the first in the list of the boundaries.
As expected, the tool now refuses to execute anything if it doesn’t find an exact match.
Operational Benefits
This approach provides major advantages.
Dropping partitions is vastly faster than deleting rows, and minimal binary logging is needed, compared to billions of row deletes. There is no massive transactional overhead for managing undo logs and purging. You get then a better InnoDB Buffer Pool stability because of less page churn.
In the end, retention jobs are completed quickly and consistently in a predictable way and at the minimal cost.
Important Caveats
Partition Boundaries Must Match Retention Policy
If partitions contain mixed retention windows, DROP PARTITION may remove too much data. For this reason, ensure correct partition design.
Recommended:
daily partitions
weekly partitions
monthly partitions
aligned with business retention requirements.
Metadata Locks
ALTER TABLE DROP PARTITION still acquires metadata locks.
Test carefully in production.
Backup Awareness
Ensure dropped partitions are no longer needed before removal or use pt-archiver to also copy the data into a remote server or dump the data into a CSV file before running the DROP PARTITION.
Possible Enhancements
The plug-in can be extended further.
Potential improvements:
Support for daily partitions
Support for UNIX timestamp partitions
Dry-run reporting
Automatic partition creation
Push Slack notifications
Export Prometheus metrics
Safety checks for replicas
GTID-aware orchestration
Integration with pt-online-schema-change workflows
These are just some ideas I had meanwhile doing my tests. What you can do by implementing a Perl plugin is only limited by your imagination and your real needs.
Conclusion
pt-archiver remains an excellent tool for implementing retention policies and archival workflows.
However, DELETE-based purging becomes increasingly expensive at scale, even with proper indexing and chunked processing.
For large time-series or historical datasets, RANGE partitioning is often a dramatically superior strategy.
The challenge is that pt-archiver does not natively leverage partition-level operations.
Fortunately, its Perl plug-in architecture allows advanced users to extend its behavior and implement partition-aware cleanup logic.
By combining:
pt-archiver orchestration
MySQL RANGE partitioning
Custom Perl plug-ins
Organizations can achieve:
Faster retention enforcement
Lower operational overhead
Smaller replication impact
Dramatically improved scalability
For large MySQL deployments, this hybrid approach can turn multi-hour purge operations into near-instant metadata operations.
The use case presented in this article is limited to a specific scenario, but you can reuse it or customize it if you have a different kind of RANGE partitioning, for example, not using TO_DAYS().
Take this as just an example of how you can extend pt-archiver. What you can do for real is driven by your needs and/or only limited by your imagination.
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
6:58 am
Group Replication VS Percona XtraDB Cluster: The True Cost of Consistency
Overview
When building high-availability MySQL environments, the choice between MySQL Group Replication (GR) and Percona XtraDB Cluster (PXC) often comes down to how they handle the eternal database dilemma: data consistency versus performance.
While both provide “synchronous-like” replication, they approach the problem of stale reads—reading data that has been committed on one node but not yet applied on another—in distinct ways. Understanding these differences, and the performance penalties associated with fixing them, is critical for any production environment.
Technology Overviews
MySQL Group Replication (GR)
Group Replication is the native, albeit more recent, high-availability solution built by Oracle for MySQL. It is based on a distributed state machine architecture and uses the Paxos consensus protocol.
Mechanism: When a transaction is committed, it is sent to all group members. The members must agree (consensus) on the order of transactions. Once a majority agrees, the transaction is “certified” and committed on the originator.
Replication Type:Virtually synchronous. The consensus ensures the data is received and ordered across nodes, but the actual applying of the data to the database happens asynchronously in the background.
Percona XtraDB Cluster (PXC)
PXC is an open-source enterprise solution based on Percona Server for MySQL and the Galera Replication library, which is the first and most mature virtually synchronous solution for MySQL.
Mechanism: When a node commits a transaction, it sends it to all other members of the Primary component (active group). All nodes must certify the transaction (check for conflicts), this is done on each node in the cluster, including the node that originates the write-set, before the originating node can finalize the commit.
Replication Type:Strictly synchronous (up to the certification level), asynchronous afterward. If the certification test fails, the node drops the write-set and the cluster rolls back the original transaction. If the test succeeds, however, the transaction commits and the write-set is applied to the rest of the cluster.
The Battle Against “Stale Reads”: Why It Matters
The most critical distinction for developers is whether a SELECT query on Node B will immediately see the INSERT just performed on Node A.
In a distributed system, there is a microsecond-to-millisecond gap between a transaction being globally ordered (everyone knows it happened) and being locally applied (the data is physically readable in the table). Reading executed on a secondary during this gap results in a stale read.
Why is avoiding stale reads so critical?
While a stale read might just mean a user temporarily sees their old profile picture after updating it, in many business cases, it breaks the application’s core logic:
Financial Transactions: A user deposits $100 on the Primary node and immediately refreshes their balance page, which reads from a Replica. If the read is stale, the balance hasn’t updated. The user panics, thinking their money is lost.
E-commerce & Inventory: A customer buys the last item in stock. The next user immediately loads the product page. A stale read tells the second user the item is still available, leading to a cancelled order and a frustrated customer.
Security & Access: A user changes their password or updates a critical permission. If the next authentication request hits a node lagging by just a fraction of a second, their valid login might be rejected, or a revoked session might still be active.
To prevent these scenarios, we must tell the database to enforce strict consistency. But how do GR and PXC handle this, and what does it cost?
Consistency Controls Comparison
Both Group Replication and Percona XtraDB Cluster provide built-in mechanisms to enforce consistency and eliminate stale reads when your application demands it. However, they approach this problem using entirely different variables and distinct levels of granularity. The table below breaks down the specific controls each technology offers, highlighting exactly what it takes to force a node to serve fresh data.
Feature
MySQL Group Replication
Percona XtraDB Cluster
Default Behavior
Reads on secondaries may be stale because the applier thread might be lagging after consensus.
Reads on secondaries may be stale due to asynchronous background applying.
Stale Read Fix
Uses the group_replication_consistency variable.
Uses the wsrep-sync-wait variable.
Consistency Levels
Offers EVENTUAL, BEFORE, AFTER, and BEFORE_AND_AFTER.
Offers granular levels from 0 (default, no checks) up to 7 (checks on all READ, UPDATE, DELETE, INSERT, and REPLACE statements).
The Fix
Setting to AFTER ensures the next read is fresh.
Setting to 7 ensures we have a comparable scenario with GR. However in PXC setting wsrep_sync_wait = 1 will be enough to avoid stale reads.
The True Cost of Being Consistent
If we know stale reads are bad, why don’t we just enforce strict consistency everywhere?
An image can help to understand:
Because in distributed databases, consistency is incredibly expensive. To test this, we used a 3-node internal lab environment to run a Sysbench-based TPC-C derivative test (50/50 read/write split, running for 600 seconds, scaling from 1 to 1024 threads).
You can find the detailed machine specifications here. The benchmarks were executed using a TPC-C derivative test based on sysbench. Finally—and crucially—you can review the configuration files used for the tests. I maintained the same baseline MySQL configuration across the board, only adjusting the parameters specific to each replication technology.
Scenario 1: Default (Relaxed) Consistency
(GR = EVENTUAL, PXC = wsrep-sync-wait 0)
I want to remind, that MySQL CE and Percona Server are running using Group Replication, while PXC is using galera.
With default settings, both systems allow stale reads.
Both technologies scales well up to 128 threads:
Group Replication performs exceptionally well, handling up to 15K operations/sec before dropping off after 128 threads.
PXC (Galera) is slightly less efficient at peak but scales very nicely and predictably.
At this level, the lag between the moment of commit and the moment the server returns the answer is minimal. But we are entirely exposed to stale reads.
Scenario 2: Enforced Consistency (The Cost)
(GR = AFTER, PXC = wsrep-sync-wait 7)
When we configure the servers to prevent stale reads, the systems must wait for transactions to be fully applied before returning a read. This is where the architectural differences become glaringly apparent:
PXC (Galera): Performance drops but not too much from a peak of ~9K ops/sec (in the previous test) to roughly ~8.5K ops/sec. This is a hit but not huge and the database remains highly functional and stable.
Group Replication: Performance catastrophically drops from ~15K ops/sec (in the previous test) to a staggering ~3.8K ops/sec.
This is the crucial takeaway
Enforcing strict consistency in Group Replication results in a massive ~75% performance penalty. The latency between the commit and the server response increases significantly compared to PXC.
The intermediate way
There is another approach which is to inject the higher consistency only when it is really needed.
The Solution: Session-Level Consistency You do not need, and should not use, full consistency at the global level for general cases. Instead, force consistency only when and where it is critical.
While for Group Replication there is no support for SQL injection hints like SELECT /*+ SET_VAR(…) */, you can enforce this at the session level right before a critical read:
SET SESSION group_replication_consistency = 'AFTER';
-- OR for PXC:
SET SESSION wsrep_sync_wait = 7;
To note that PXC offers more flexibility and you can use hints:
By isolating these variables to specific sessions (like the immediate redirect after a password change or a checkout process), you ensure data integrity exactly where the business requires it, while allowing the rest of your application to enjoy the high-speed performance of relaxed consistency.
PXC: The performance drop is minimal and the solution is able to provide a consistent delivery with nice scalability up to 256 threads.
Group Replication: The solution suffers from a significant drop, not as if we set the AFTER condition at global level, but still we see a drop of ~52%.
Comparing the two solutions we can see that PXC is able to deal with the additional requested consistency better.
Additional differences
But these are not the only differences we can immediately see. Performing a comparison about resources utilization, we can see that while both solutions move the same amount of data as IO operations:
Yes, for exactly the same load and traffic Group Replication consumes 8GB more than PXC, which in this environment represents 26% memory more, over total available.
Cost that is reflected also as CPU utilization.
Conclusion: How to Survive the Cost
How impactful is enforcing strict consistency at a global level in a production environment? Massively. If you blindly enforce strict consistency globally without understanding your architecture, you will decimate your database throughput. Here is the reality of how the two solutions handle that tax:
The Group Replication Reality: By default (using EVENTUAL consistency), MySQL Group Replication behaves essentially as semi-synchronous replication paired with an automated topology manager (see The Failover Brownout: Rethinking High Availability in MySQL Group Replication). The Primary is allowed to forge ahead and serve traffic even if the Secondaries are lagging significantly behind. The moment you demand strict consistency, the Primary is violently tethered back to the rest of the cluster, and its performance drops off a cliff as it waits for the slowest node.
The PXC Advantage: Percona XtraDB Cluster (PXC) absorbs the “consistency penalty” much more gracefully. While varying consistency levels exist in PXC, adjusting them does not cause the same dramatic throughput shock seen in MGR. This is because PXC enforces a virtually synchronous, high-consistency baseline from the start. It simply does not allow the node receiving writes to deviate too far from the rest of the cluster. You pay a baseline performance tax upfront, but in exchange, you get guaranteed, ironclad High Availability out of the box.
The Final Verdict Modifying consistency values at the global server level should only be done after rigorous load testing and a complete understanding of the performance tax you are about to pay.
Ultimately, it comes down to choosing the right tool for your specific SLA:
If your architecture demands a true, virtually synchronous solution with strict High Availability out of the box, PXC is the purpose-built engine for the job.
If you are looking for a highly automated, semi-synchronous solution, Group Replication delivers excellent default performance—but tuning it to mimic PXC’s strict consistency will cost you heavily in throughput.
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
6:57 am
The Failover Brownout: Rethinking High Availability in MySQL Group Replication
It is time to talk again about Flow control and group replication. This time with a special eye on the use of Group Replication in the Kubernetes context. In this article we will dig a bit on how it works and what are the various side effects.
The problem
Recently I was refining the calculation I use in the MySQL calculator for Operator given I was constantly encountering a very serious problem with the Percona Server Operator.
The problem is that when the deployment was/is serving a high level of traffic, it will, no matter what, end up in getting OMMKill by the K8 system.
This because the pod was gradually consuming more and more memory, reaching the memory limit set in the CR specification.
Now let me clarify a few things, to get straight to the facts.
Kubernetes itself does not OOMKill a pod for hitting its memory limit, the mechanism works as described below with mention on how Working Set Size (WSS) is calculated, and how OOMKills are triggered, and in the resource sections, the links to the official documentation and source code.
1. The Reality of OOMKills vs. Kubelet Evictions
It is crucial to distinguish between what the Linux kernel does and what Kubernetes does:
OOMKilled (Exit Code 137): This is executed entirely by the Linux kernel’s OOM Killer, not Kubernetes. When we set a memory limit in our Pod spec, Kubernetes translates that into a Linux cgroup constraint (memory.limit_in_bytes for cgroups v1, or memory.max for cgroups v2). If our container attempts to allocate more memory than this hard limit, and the kernel cannot reclaim any page cache (like inactive files), the kernel directly intervenes and terminates the process.
Node-Pressure Evictions: This is where Kubernetes actively observes memory. The kubelet monitors the working_set_bytes metric to protect the node from running out of memory. If the node’s memory drops below an eviction threshold, Kubernetes will actively evict pods to prevent the kernel from initiating a system-wide OOM kill.
2. How Working Set Size (WSS) is Calculated for the container
Kubernetes monitors container memory via cAdvisor, which is integrated directly into the kubelet. cAdvisor calculates the Working Set Size by taking the total memory usage and subtracting the inactive file cache (memory that the kernel can easily reclaim if it faces memory pressure).
Because active file caches and anonymous memory (like our application’s heap) cannot be easily evicted, this working set metric is the most accurate representation of the memory your container is forcing the system to hold.
The Calculation & cgroups Evolution The core mathematical calculation is Memory Usage – Inactive File Cache, but how cAdvisor fetches this data from the Linux kernel depends entirely on your node’s cgroup version. Modern cAdvisor relies heavily on the opencontainers/runc/libcontainer library to read these raw cgroup files:
cgroups v1: cAdvisor starts with the raw usage from memory.usage_in_bytes and subtracts the reclaimable cache found under the total_inactive_file key.
cgroups v2 (Unified): cAdvisor starts with the raw usage from memory.current and subtracts the reclaimable cache found under the inactive_file key.
The Underlying Code Logic While older versions used a static setMemoryStats function, modern Kubernetes branches handle this dynamically. The logic executes the following flow before reporting back to the kubelet:
Detects Version: It identifies whether the node runs cgroups v1 or v2 to determine the correct inactive file key name.
Fetch Usage: It pulls the raw memory usage from the container.
Subtract Cache: It looks up the inactive file value and safely subtracts it from the usage (including a safeguard to ensure the working set never drops below zero).
Report Metric: It sets this final calculated value as container_memory_working_set_bytes, which the kubelet then uses to decide if the node is under memory pressure.
Back to us
At the end the point is that if our pod reaches the limit and we ARE NOT using the new swap feature existing in Kubernetes, our pod will be brutally killed, and in 99% of the cases our production will suffer a lot. !Ops spoiler!
To clearly understand what was causing the issue about this memory consumption and having my calculator fail, I started to collect the information about the memory usage in MySQL itself.
SELECT EVENT_NAME,CURRENT_NUMBER_OF_BYTES_USED / 1024 / 1024 AS current_usage_mb FROM performance_schema.memory_summary_global_by_event_name WHERE EVENT_NAME like ‘memory/%’ and EVENT_NAME not like ‘memory/performance%’ order by current_usage_mb desc limit 25;
To simulate the load I used the sysbench-tpcc (tpc-c derivate test) variant and run the tests simulating a load of 1024 threads against a cluster based on machine with 16 Core and 64Gb volumes ~3k IOPS, so not gigantic but not small.
In MySQL, memory/group_rpl/certification_info is a Performance Schema memory instrument. It tracks the exact amount of RAM allocated to store the Certification Database (or Certification Info).
In Group Replication, nodes do not lock rows across the network while a transaction is executing. Instead, transactions execute locally and optimistically. When it is time to commit, the transaction undergoes a Certification Process to ensure no other concurrent transaction in the cluster has modified the exact same rows. The certification_info buffer is the in-memory hash map that makes this conflict detection possible.
1. What is it used for?
The certification_info structure acts as a tracking ledger for recently modified rows.
Here is how it works under the hood:
The Key-Value Pair: It is fundamentally an in-memory dictionary. The key is the hash of a modified row (extracted from the transaction’s “write set”), and the value is the Global Transaction Identifier (GTID) of the transaction that successfully modified it.
Conflict Detection: When a new transaction attempts to commit, it broadcasts its write set and the “snapshot version” of the database it saw when it started. The certifier cross-references the incoming transaction’s write set against the certification_info map.
The Decision: If the certification_info shows that a row was modified by a newer GTID that the incoming transaction did not “see” when it started, a conflict is flagged, and the transaction is aborted. If no conflict exists, the transaction is certified, and the certification_info map is updated with the new write set and GTID.
The primary does not hold onto this memory out of stubbornness; it does so because purging that data too early would destroy the cluster’s consistency in the event of a failover.
In Group Replication, garbage collection for the certification_info buffer is not triggered just because a transaction commits on the primary. It is triggered by a concept called the Stable Set.
Every node in the cluster periodically broadcasts a message to the rest of the group saying, “Here are the GTIDs I have successfully applied to my disk.” The cluster then calculates a global low watermark. This watermark is the highest transaction GTID that every single member of the group has successfully applied. Garbage collection is only allowed to purge write-sets from the certification database that fall below this global watermark. To note that this purge is a synchronous operation during which writes are forbidden.
2. How the Apply Queue Stalls the Watermark
When a secondary node starts lagging, its applier queue grows. This means the secondary is receiving transactions from the network quickly, but its SQL thread is too slow to actually execute them and commit them to disk.
Because the secondary hasn’t applied these transactions, it cannot report those GTIDs back to the group as “finished.”
The lagging secondary’s local watermark stalls.
Therefore, the global low watermark for the entire cluster stalls.
Because the global watermark hasn’t moved forward, the garbage_collect function on the primary (and all other nodes) says, “I am not allowed to delete any write-sets yet.”
As the primary continues to process new writes, the certification_info memory buffer grows continuously.
3. Why the Primary Cannot Purge Early
we might wonder: If the transaction is already committed on the primary, why does the primary care if the secondary has applied it? Why not just drop the write-set from its own memory?
The answer comes down to Failover Safety and Distributed Conflict Detection. GR is a shared-nothing, decentralized architecture. Even if you are running in Single-Primary mode (keep this in mind will be important later), the underlying engine uses the exact same logic as Multi-Primary mode.
Here is why the primary is forbidden from purging that data:
The Failover Scenario: Imagine our primary node crashes right now. The lagging secondary (which still has a massive apply queue) is immediately elected as the new primary.
The Conflict Risk: As the new primary, it starts accepting new writes from your application. However, it still has thousands of old transactions in its applier queue that it hasn’t written to disk yet!
The Necessity of the Buffer: When a new write comes in, the new primary must check if that write conflicts with any of the pending transactions in its apply queue. It does this by checking the certification_info map. If the old primary had purged the global certification data early, the new primary wouldn’t have the write-sets for those pending transactions. It would blindly accept the new write, causing a massive data conflict and breaking the replication group entirely.
Fine Marco, then what is the effect of this?
Well, drums roll …
… When a secondary node is elected as the new primary during a failover, it does not immediately open the floodgates to new writes. It keeps its super_read_only variable set to ON until it has completely drained its local apply queue of all transactions that were certified prior to the election.
This is an intentional design choice to guarantee that the new primary’s state is completely consistent with the old primary before it starts accepting new data.
4. Immediate Write Rejections (No Built-in Queuing)
The most critical impact to understand is that the new primary does not queue or pause new incoming writes while it catches up. It outright rejects them.
If our application or proxy routes a COMMIT, INSERT, UPDATE, or DELETE to the new primary while it is still processing the old queue, MySQL will immediately throw an error back to the client:
ERROR 1290 (HY000): The MySQL server is running with the –super-read-only option so it cannot execute this statement
5. The “Brownout” Window (Write Outage)
Because of this behavior, a failover in MySQL Group Replication does not instantly restore write availability. Our cluster experiences a “brownout”, a period where reads might succeed, but writes are entirely blocked.
The duration of this write outage is directly proportional to the size of the apply queue.
If the secondary was fully caught up, write availability is restored in milliseconds.
If the secondary was lagging by 50 minutes, your application will suffer a 50 minute write outage while the node applies the backlog.
6. Impact on Proxies (e.g., MySQL Router or ProxySQL)
If we are using a proxy layer to route your database traffic, the apply queue dictates how the proxy behaves during the transition:
MySQL Router: It continuously monitors the cluster topology and the super_read_only flag. Even though the node has technically been elected primary, Router will not open the read-write port to it until the apply queue drains and super_read_only flips to OFF. Depending on your application timeouts, client connections will either hang waiting for a writable connection or fail completely.
ProxySQL: Similar to Router, if it is configured to check for the read_only state, it will temporarily quarantine the new primary from the write hostgroup.
HAProxy (in Operator): Monitor both Primary state and read_only state, but it expose the Primary to writes causing the application to fail (bug we need to fix)
7. Read Traffic and Stale Data
During this catch-up phase, the node will accept incoming SELECT queries (since it is still a valid database). However, because it is actively churning through the old primary’s backlog, the data being read is temporarily stale.
If your application reads a row that is sitting in the apply queue but hasn’t been committed to disk yet, it will get the old version of that row.
Why Flow Control is Critical
Because a large apply queue turns a seamless failover into a severe, application-breaking write outage, Group Replication includes the Flow Control feature.
Flow Control monitors the size of the apply queues across all secondaries. If a secondary starts lagging too far behind, Flow Control should actively throttle the write throughput on the current primary to allow the lagging node to catch up. It is essentially a trade-off: we accept a slight performance hit during normal operations to guarantee that your database recovers almost instantly during a failover.
However, this is not what really happens.
1. It is Reactive, Not Proactive (The Polling Blind Spot)
Flow control does not intercept and evaluate every single transaction in real-time. Instead, it relies on a periodic polling interval governed by group_replication_flow_control_period (which defaults to 1 second).
Once a second, the cluster checks the size of the apply queues and the certifier queues.
The Vulnerability: If our application generates a massive spike of 50,000 writes in 500 milliseconds, the primary will happily accept and certify all of them. Flow control will not even notice the spike until the next 1 second polling interval hits. By the time it decides to apply a throttle, the damage is already done, and the secondary’s queue is already overflowing.
2. The PID Controller’s “Soft Brake” Math
When flow control does decide to throttle, it does not simply freeze the primary. It uses a PID (Proportional-Integral-Derivative) controller algorithm to calculate a “write quota” (the maximum number of transactions the primary is allowed to commit in the next second).
The PID controller is deliberately tuned to be gentle. It wants to gracefully degrade performance rather than cause immediate application timeouts.
When the secondary’s queue breaches the group_replication_flow_control_applier_threshold (default 25,000 transactions), the PID controller reduces the primary’s quota incrementally.
The Failure Point: If the primary’s incoming write rate is astronomically higher than the secondary’s disk IO capacity, this incremental “step down” in the quota is too slow. The primary is still allowed to write, say, 10,000 transactions per second, while the secondary is only applying 2,000. The queue continues to grow aggressively despite the throttle being “active.”
3. The Concurrency Mismatch (Parallel vs. Serial)
This is often the silent killer that defeats flow control. Flow control makes mathematical assumptions about how fast the secondary should be able to apply transactions based on recent history.
However, the primary node might be executing writes using hundreds of highly concurrent threads. The secondary relies on the parallel applier to keep up. If the incoming workload suddenly includes transactions that cannot be parallelized, such as writes hitting overlapping rows, cascading foreign key updates, or DDL statements, the secondary’s applier instantly drops from executing in parallel down to a single, serialized thread.
When this serialization happens, the secondary’s applier rate plummets instantly. Flow control, which only checks in once a second and adjusts gradually, cannot brake the primary fast enough to compensate for the secondary suddenly dropping to a crawl.
What can we do?
At the moment of writing there are only two things that can be done.
Make Flow control more aggressive
Increase the number of replication appliers
1. Making Flow Control More Aggressive
We can configure Flow Control to be a bit more aggressive. It will still remain a suggestion but a strong one.
How it works (The Configuration):
Lower the Threshold: By reducing group_replication_flow_control_applier_threshold (default is 25,000) to something like 1,000 or 500, we force the PID controller to kick in almost immediately when a spike occurs.
Remove the Safety Net: By keeping group_replication_flow_control_min_quota to 0 (default), we remove the minimum write guarantee. If the secondary falls behind, Flow Control is allowed to throttle the primary’s writes down to zero, also if this will never happen.
Increase the Sensitivity: We can tweak the PID controller’s math (using the derivative and proportional tuning variables) to react much more aggressively to queue growth. group_replication_flow_control_hold_percent=100 group_replication_flow_control_release_percent=5
The reality check, does it work?:
If the expectation is to have a rigid control over the applier queue on the lagging secondary, then the answer is NO. No matter what, at the moment flow control is not designed to act as we are used to in PXC (Percona Xtradb Cluster), where we have a rigid control of the pending queue also at the cost of delaying the writes. In Group Replication the Flow Control will never bring the write to 0, the unfortunate aspect is that the mechanism is not enough to keep the queue under control.
2. Increasing Replication Appliers
To help the secondary chew through the queue faster, we can increase the number of parallel threads it uses to write to disk.
How it works: We can increase the replica_parallel_workers (formerly slave_parallel_workers) setting. GR is exceptionally smart about this. Because of the certification process we discussed earlier, GR already knows exactly which transactions modify which rows. It uses a writeset-based dependency tracker to safely hand off non-conflicting transactions to multiple worker threads simultaneously. The formula that is normally used to calculate the number of replication workers is to set 2.5 workers for each available core. IE if we have 14000m CPUs in our CR (K8) then we can assign ~35 workers, this is definitely higher than the default value of 4.
The reality check, does it work?: Yes, but only if our workload allows it.
The Catch – The Serialization Wall: Parallel appliers only work if the transactions do not conflict. If our application has 50 concurrent threads all trying to update the same “inventory count” row, or updating a highly contentious table, those transactions cannot be parallelized. The secondary’s coordinator thread will see the row-level conflicts and force those transactions to wait in line and execute sequentially. We could allocate 128 parallel workers, but 127 of them will sit idle while one thread does all the work.
The Catch – Context Switching: More threads do not magically create more disk IOPS. If we set the workers too high (e.g., beyond the physical CPU core count or disk IO capacity), the secondary’s InnoDB engine will spend more time context-switching and fighting over internal mutex locks than actually committing data. In many cases, over-allocating parallel workers actually slows down the apply rate.
Do we have any conclusions?
1. If HA is the goal, enforce Strict Flow Control
If our absolute top priority is High Availability, specifically achieving a near-zero Recovery Time Objective (RTO), we must configure an aggressive flow control.
The Logic: Fast failovers require small apply queues. To guarantee a small apply queue, we must strictly throttle the primary the millisecond the secondary starts to lag.
The Trade-off: we are protecting the cluster’s failover readiness at the expense of application write latency. If there is a massive write spike, our application will face timeouts and connection errors, but if the primary server suddenly catches fire, our database will recover and elect a new primary almost instantly.
The problem is that Group Replication is not able to act like that today, this is something we eventually need to implement to have better HA.
2. If Performance is the goal, relax Flow Control
If our top priority is keeping the application fast and ensuring COMMIT latencies remain extremely low, we should relax flow control or rely on the generous defaults.
The Logic: By relaxing flow control, we allow the primary to run at the absolute maximum speed its local disks and CPU allow. It does not care if the secondaries fall behind. Our application users remain happy and experience zero throttling.
The Trade-off: We are accepting severe risks to your HA posture. If the primary crashes while the secondaries have a massive apply queue, we will suffer a long write outage (the brownout) while the new primary catches up. Additionally, we are accepting the risk that the certification_info memory buffer will grow significantly on the primary and eventually have the pod OOMKilled .
3. Is this not what Asynchronous replication with semy-sync offers?
1. The Similarities
If we look purely at how a single transaction flows and how a failover behaves, GR and Semi-Sync look like twins:
The Durability Guarantee: Semi-Sync: The primary waits to commit until at least one secondary confirms it has received the transaction and written it to its local Relay Log.
GR: The primary waits to commit until a majority quorum of nodes confirm they have received the transaction, certified it, and written it to their local relay logs.
The Failover Delay (The Queue): In both systems, the secondary receiving the data does not mean the secondary has applied the data to its InnoDB tables.
If a crash happens, both systems require the new primary to completely execute its pending queue (Relay Log for Semi-Sync, Apply Queue for GR) before it is safe to accept new writes.
2. The Crucial Differences
If they behave so similarly, why use GR at all? The differences lie entirely in automation, consensus, and split-brain protection. Semi-Sync is just a data transport mechanism; GR is a full state-machine cluster.
Here is what GR gives you that Semi-Sync does not:
Automatic Election and Orchestration:
Semi-Sync: If the primary dies, Semi-Sync does nothing. The cluster sits there broken. You must rely on external tools (like Orchestrator or manual DBA intervention) to detect the crash, pick the most up-to-date secondary, wait for its relay log to apply, disable read_only, and re-point the application.
GR: The cluster detects the failure natively. The remaining nodes use Paxos consensus to elect a new primary automatically, manage the queue drain natively via the super_read_only flip we discussed, and self-heal.
Split-Brain Protection (Network Partitions):
Semi-Sync: If our network splits in half, an external failover tool might accidentally promote a secondary while the old primary is still alive and accepting writes. We now have a split-brain, and our data is permanently corrupted.
GR: GR enforces strict quorum. If a network split happens, the side of the network with the minority of nodes will automatically fence itself off and refuse all writes. Split-brain is mathematically prevented.
The Certification Database:
As we established, GR requires the certification map to ensure the new primary doesn’t accept writes that conflict with its unapplied queue. Semi-Sync does not have this; it relies entirely on the external failover tool to guarantee no writes touch the new primary until the relay log is 100% applied.
3. Final observation
If we are using Single-Primary GR with relaxed flow control, we have essentially built a highly-automated, consensus-driven version of Semi-Sync replication.
We have the exact same apply-queue bottleneck during failover, but we have traded the need for external orchestrator tools for built-in Paxos consensus and native split-brain protection.
Conclusions (for real)
When we run MySQL on a traditional, dedicated Virtual Machine, memory limits are “soft.” If the certification_info database explodes and consumes an extra 10GB of RAM because of the applier lag, the Linux OS might start aggressively swapping inactive pages to disk, but the MySQL process usually survives. Performance degrades, but the database stays online.
In Kubernetes, memory limits are “hard.” As we discussed earlier, Kubernetes enforces pod memory limits via cgroups v2 (memory.max). The Linux kernel’s OOM Killer has no understanding of database quorum, failover states, or apply queues. It only sees math: Working Set Size > memory.max = Terminate Process (Exit Code 137).
The Chain Reaction of Relaxed Flow Control in k8s
If we prioritize “performance” by relaxing Flow Control in a Kubernetes environment, we are essentially setting a ticking time bomb. Here is the chain of events:
The Spike: Our application experiences a massive write spike.
The Queue: The secondary pod’s disk cannot keep up, and its applier queue grows to 1,000,000 transactions.
The Memory Sprawl: Because the queue is large, the global low-watermark stalls. The Primary pod is forbidden from garbage collecting the certification_info map. The in-memory hash map balloons in size.
The Execution: The memory.current metric will reach the memory.max, kernel will trigger the OMMKill process. First action will be to try to free the page.cache related to the process. If the purge is successful and the memory.current is less than memory.max then the process will persist, otherwise the kernel will kill it. We can use the WSS metric to predict a successful OMMKill. The Primary pod’s Working Set Size (WSS) breaches its Kubernetes memory limit, this is a fair estimate not an absolute value.
The Catastrophe: The Linux OOM Killer instantly assassinates the Primary MySQL process.
Because we tried to avoid a few seconds of write latency by keeping relaxed Flow Control, we inadvertently caused a hard crash of the primary database pod, with long write downtime.
The Architectural Law
Therefore, here is my statement as architectural law for containerized environments: In Kubernetes, High Availability and Pod stability are so intrinsically linked that Flow Control must act as hard as it can to cap the apply queue.
We cannot allow unbounded memory growth in a container. The only way to bound certification_info memory is to bound the apply queue.
The only way to bound the apply queue is with strict, aggressive Flow Control.
Increasing the number of replication appliers helps but is not the conclusive answer.
In a Kubernetes environment, we must tune group_replication_flow_control_applier_threshold to a strict, low number, and accept that during massive traffic spikes, our application will experience write throttling. It is infinitely better for our application’s connection pool to wait 2 seconds for a COMMIT to succeed than for the primary database pod to be violently OOMKilled by the kernel, and have to wait for minutes or hours to recover write capabilities.
Note
Just as a mention this is exactly how Percona Operator with Percona Xtradb Cluster works. To be more specific, PXC and in general solutions based on Galera have a Flow Control mechanism that enforces the queue to be inside hard limits. While this more invasive control may be noticeable at application level, it guarantees that the other nodes are not lagging behind the primary and this is why it is a stronger HA solution in the Kubernetes environment.
Managing Resources and OOMKills: Resource Management for Pods and Containers(This page details how memory limits are enforced reactively by the Linux kernel via OOM kills).
How WSS triggers Evictions: Node-pressure Eviction(This page explicitly details how the kubelet uses the memory.available signal, which is derived from node capacity minus the working set size).
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
10:25 am
Percona Operator for MySQL (PXC) 1.20.0: Automatic Storage Resizing, TLS Certificate Rotation, and ARM64 Support
Percona Operator for MySQL PXC 1.20.0 is out today, and it addresses three long-requested operational headaches: storage that grows on its own before it fills up, TLS certificates that rotate without cluster downtime, and images that run natively on ARM64.
Disk-full incidents on PXC clusters often arrive at 2 AM when monitoring alerts fire, and someone has to manually expand PVCs before writes grind to a halt. Certificate rotations have traditionally meant a carefully timed series of kubectl edits with real downtime risk. And ARM64 hardware has been increasingly common in dev clusters and cost-optimized cloud node pools, where x86-only images created extra friction. 1.20.0 addresses all three in a single release.
The operator is open source and runs on any CNCF-conformant Kubernetes distribution, including GKE, EKS, AKS, and OpenShift. It supports Kubernetes 1.33 through 1.36 and PXC 8.4, 8.0, and 5.7.
In this post, you’ll learn about:
Automatic PVC storage resizing with configurable thresholds and a hard cap
Zero-downtime TLS certificate rotation via a new Secret naming convention
Native ARM64 support across all operator images
PITR validation that catches misconfigured targets before restores begin
Configurable leader election for high-latency or unstable networks
Other improvements in this release
Automatic Storage Resizing
Why it matters
A full data volume is the most common cause of unplanned maintenance on a PXC cluster. Until now, avoiding it required external monitoring, manual kubectl patch pvc steps, and waiting for the storage class to honor the resize. Even with good alerting, the operator itself had no mechanism to react: it could only expand PVCs when you changed the spec by hand.
1.20.0 introduces built-in storage autoscaling. The operator polls each PVC’s actual disk usage, and when usage crosses a configured threshold, it automatically expands the claim. You set the trigger percentage, the step size per resize event, and an optional upper bound. The operator handles everything else.
How it works
The autoscaler runs inside the normal reconcile loop. It reads status.capacity.storage from each PXC PVC, compares current usage against triggerThresholdPercent, and issues a PVC resize when the threshold is crossed. It sets a percona.com/pvc-resize-in-progress annotation on the CR while an expansion is active. This annotation blocks concurrent rolling restarts or upgrades from starting, so nothing disrupts the cluster mid-resize.
You can also set enableExternalAutoscaling: trueif an external tool, such as KEDA, already manages PVC sizes for your cluster. When you enable external autoscaling, the built-in loop skips its resize check entirely to avoid conflicts.
Wiring it up
Add storageScaling to your PerconaXtraDBCluster spec:
apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBCluster
metadata:
name: cluster1
spec:
crVersion: 1.20.0
storageScaling:
enableVolumeScaling: true
autoscaling:
enabled: true
triggerThresholdPercent: 80 # resize when a PVC is 80% full
growthStep: 2Gi # add 2Gi per resize event
maxSize: 100Gi # never grow beyond 100Gi per PVC
# enableExternalAutoscaling: false
Any PVC expansion requires enableVolumeScaling: true, whether the autoscaler or a manual spec change triggers it. Setting autoscaling.enabled: true enables the threshold-based path on top of that. Leave the autoscaling block out if you only want to permit manual spec-driven resizes.
Caveats
Storage expansion requires a StorageClass with allowVolumeExpansion: true. Check before enabling:
kubectl get storageclass \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.allowVolumeExpansion}{"\n"}{end}'
Autoscaling applies only to PXC data volumes. If your storage class or CSI driver handles expansion externally, use enableExternalAutoscaling: true to prevent the two mechanisms from racing.
Automated TLS Certificate Rotation
Why it matters
Rotating TLS certificates on a live PXC cluster has always carried risk. The Galera protocol requires all nodes to trust each other’s CA simultaneously. Swap the CA on one node before the others accept it, and inter-node communication breaks. The safe approach requires a three-phase CA swap with rolling restarts between each phase: a process that is easy to get wrong under time pressure.
1.20.0 formalizes this into a first-class operator workflow. Create a Secret named <ssl-secret>-new containing the replacement credentials, and the operator runs the full three-phase rotation automatically, pausing for rolling restarts between each step.
How it works
The rotation proceeds in three steps that the operator coordinates:
Combined CA phase. The old CA and new CA are merged into a single ca.crt and pushed to all nodes. Every node now trusts both roots.
New leaf phase. The new tls.crt and tls.key are pushed node by node with a rolling restart. New leaf certs are signed by the new CA, and the combined CA means all nodes trust them.
New CA only phase. The combined ca.crt is replaced with the new CA only. The old root is removed. Another rolling restart completes the rotation.
When step 3 completes, the operator automatically deletes the -new Secret. The cluster never loses TLS connectivity between nodes during the process.
Wiring it up
Given a cluster named cluster1 using the default SSL Secret cluster1-ssl, create the replacement:
You do not need to change the PerconaXtraDBCluster CR. The operator detects the -new Secret on the next reconcile and starts the rotation. No kubectl patch on the CR, no operator restart.
Caveats
The operator does not yet surface rotation progress in .status.conditions. Monitor the rotation by watching PXC pods restart in sequence and checking that the -new Secret is eventually gone:
kubectl get pods -w -l app.kubernetes.io/component=pxc
kubectl get secret cluster1-ssl-new # should 404 when rotation is complete
ARM64 Support
Why it matters
AWS Graviton3, Google Axion, and Azure Cobalt100 instances deliver better price-to-performance on memory-intensive workloads like PXC. Previously, running the operator on ARM64 nodes required cross-architecture scheduling workarounds or explicit node exclusions for operator pods. All PXC operator images now publish native linux/arm64 layers alongside nodeSelector
What is covered
Every image in the PXC operator stack ships multi-arch manifests in 1.20.0:
The operator manager image
The PXC xtrabackup sidecar
The log collector (Fluentbit-based)
The init container
This release also fixes a logrotate crash on ARM64 (K8SPXC-1821) that a missing dependency in the ARM64 container layer caused. 1.20.0 ships the fix.
Wiring it up
You do not need any configuration change. Pull the 1.20.0 operator image and Kubernetes schedules it on whichever architecture is available. To pin PXC pods explicitly to ARM64 nodes, add a nodeSelector or node affinity in the spec.pxc block:
PITR target validation before restore begins (K8SPXC-1318, K8SPXC-1634, K8SPXC-1635, K8SPXC-1793): The operator now validates PITR targets (type, GTID, timestamp) against available binary logs before starting a restore. It catches a misconfigured target before it pauses the cluster, rather than after.
Configurable leader election (K8SPXC-1805): Three new environment variables tune leader election timing for high-latency or flaky network environments.
SST retry limit (K8SPXC-1619): A new spec.pxc.sstRetryCount field caps the number of State Snapshot Transfer retry attempts, preventing a node that repeatedly fails SST from looping indefinitely.
Custom logrotate configuration (K8SPXC-1789): Supply a custom logrotate config via a ConfigMap reference in spec.logcollector.logRotate for fine-grained control over log rotation for PXC and utility containers.
Enhanced full cluster crash recovery (K8SPXC-1828): 1.20.0 hardens the crash recovery path to prevent potential data loss after sudden node power-offs.
Deprecation notice: PMM2 monitoring integration is deprecated in 1.20.0. Migrate to PMM 3 before version 1.22.0, when PMM2 support will be removed.
Conclusion
PXC Operator 1.20.0 turns three previously manual steps into operator-managed concerns: disk growth, certificate rotation, and ARM64 scheduling. Combined with PITR validation improvements and configurable leader election, this release reduces the operational surface area for clusters running under production pressure. If you run into edge cases with automatic storage resizing or TLS rotation, the community forum is the right place to share them.
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
11:44 am
Migrating from MongoDB 6.0 to 8.0: How Percona ClusterSync Handles Cross-Version Replication
Percona ClusterSync for MongoDB (PCSM) replicates data between MongoDB clusters to keep migrations with near-zero downtime. Prior to version 0.9.0 it required the source and target to run the same major version, which ruled out the lift-and-shift move most migrations want: going from an older major like 6.0 straight onto a newer one like 8.0.
When we first pointed PCSM at a target running a newer major version of MongoDB, we quickly discovered a problem. Cross-version replication failed because a change stream was applying only a small fraction of the events it read. Because so much of what we read never made it to the target, we assumed the problem was format. We braced for a layer that would rewrite change stream events between major versions.
That assumption was wrong, and being wrong about it is the most interesting part of this story.
The problem
The same-major requirement in earlier versions stemmed from a simple assumption: matching majors maintained metadata and API compatibility. The tool never had to reason about a version gap. The catch is that getting both clusters onto the same major version first is exactly the step a low-downtime migration aims to avoid.
PCSM already knew the versions it was talking to. On startup, it asked each cluster for its build version and logged the results. It just did nothing with that information. No downgrade guardrail, no compatibility gate, no feature-compatibility-version check. Cross-version replication worked when it worked and broke when it broke, and nothing in the tool had an opinion either way.
So we ran a spike to find out what cross-version replication actually needed. Two findings reshaped the whole effort.
First, we found that for the operations PCSM cares about, MongoDB 6.0 and 8.0 produce the same change-stream events. There was no format to translate. The layer we expected to build did not need to exist.
Second, the cross-version failures were not many separate problems. They were one problem. A DDL operation would fail, the error got swallowed, and PCSM would attempt to continue applying changes to a target whose state had already diverged from the source. That single swallowed error cascaded into failed selective replication, missing indexes, and capped-collection mismatches downstream. The gap between events read and events applied was just the tool reading events it could no longer apply, because the collection state had already drifted.
The goal shrank from building a complex translation layer to maintaining better discipline about handling the operations we already understood.
The approach
We set out with a longer list of goals than we shipped, and several of them turned out to matter less than we thought going in. Here is what actually landed.
The tool now:
starts a sync when the source major is lower than the target major
blocks a downgrade with a descriptive error
keeps the change stream stable when the target is a newer major and
handles the big jump from 6.0 straight to 8.0.
Two things we had considered did not ship and were moved to the known limitations list: patch-level version enforcement and fetching the feature compatibility version at startup. More on those below.
Stop swallowing DDL errors
The replication path that applied DDL operations (creating and dropping collections, building indexes, changing collection options) used to ignore errors and keep running. We made those failures surface and stopped the run instead. On its own, that fixes no compatibility logic, yet it took the cross-version suite from about half-failing to clean.
The reason is that almost none of those failures were independent. One DDL operation would fail, the error went nowhere, and PCSM kept applying later events against a target that had already drifted from the source. Every subsequent mismatch was counted as a separate failure. Once the first error stopped the run instead of vanishing, what was left was a handful of operations that genuinely behave differently between majors. Surfacing these failures made them findable. Handling those few remaining cases is what got the release as a whole to a clean run.
The version guardrail
Before doing anything else, PCSM now reads the build version from both clusters and compares them. A downgrade is a hard stop. A cross-version upgrade logs an informational line and proceeds. This check is the contract PCSM 0.9.0 that is shipping:
The comparison is limited to the major version. Equal majors are treated as the same version, a lower source major as a supported upgrade, and a higher source major as a refused downgrade. That is the whole gate.
A capped-collection rounding bug
This was the one genuine cross-version-only bug. It’s a good example of the kind of thing that does not show up until you actually run the mismatch. MongoDB 6.x rounds capped-collection sizes up to the nearest 256 bytes internally. When a capped-size change flows through the change stream, the event carries the size the user asked for, say 3333 bytes, not the size 6.x actually stored, 3584. A newer target stores exactly what it is told, so applying the requested size produces a collection that does not match the source.
The fix rounds the size up to the nearest 256-byte boundary when the source is 6.x. It showed up in just three of the suite’s two hundred tests, all in the capped-size checks. That was the entire substance of what we had feared would be a sprawling 6.0-to-8.0 compatibility effort.
Tolerating transient catch-up errors
While PCSM is still catching up to the source, some operations arrive referencing objects that do not exist yet or are already on their way out. Missing namespaces, missing indexes, invalid options, and mid-drop databases are all treated as non-fatal during this window, because the final state converges regardless.
The TTL-index case is the one where the version difference actually shows its face. PCSM clones a TTL index with its expiry set so far in the future that the target will not start deleting documents while the initial sync is still running, then restores the real expiry once the sync is done.
A 7.0 target had no problems when restoring the real expiry. The 8.0 target we tested rejected that operation as an index-options conflict. Same replayed sequence, same options, accepted by one major and refused by the next. PCSM handles it structurally: when the restore is refused, it drops the index and recreates it from the specification it already has. That happens once per affected TTL index, at the end of the sync.
The result
PCSM 0.9.0 covers every pair in the supported matrix, on both replica-set and sharded topologies: 6.0, 7.0, and 8.0 to themselves, and the lower-to-higher pairs 6.0-to-7.0, 6.0-to-8.0, and 7.0-to-8.0. This is what our CI tests cover.
The contract has edges, and you should know them before you plan a migration around it:
The compatibility check is major-only. It does not enforce a minimum target patch, and there is no allowlist of blessed version pairs beyond what CI exercises.
PCSM is separate from the server binary, and a recently upgraded cluster set separately from the server binary can still be running an older FCV. Before you start, confirm the target FCV is at least as high as the source FCV. That way, you ensure the newer target accepts everything the older source sends. The FCV check also runs once, at startup. The cluster that changes underneath a long-lived server will not trigger a fresh check, as PCSM doesn’t reevaluate the FCV when you start or resume a run.
That is the shape of what shipped, and it is enough to plan a migration around safely.
Try it
Percona ClusterSync for MongoDB 0.9.0 ships cross-version replication. The documentation has the setup and the supported matrix, the 0.9.0 release notes cover what landed, and the source is on GitHub. If you put it through a migration and something behaves unexpectedly, Percona Forum is the place to report it.
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
1:29 pm
Percona Operator for PostgreSQL 3.0.0: Hard Fork, OLM Scoping, Major Upgrades
The Percona Operator for PostgreSQL 3.0.0 is here. This is the release that completes the hard fork of the operator from the Crunchy Data PostgreSQL Operator into a fully independent project, with a dedicated upstream.pgv2.percona.com API group for the inherited CRDs, an automatic CRD-rename rollout for existing 2.x installs on upgrade, and a public roadmap that drives what comes next.
This release ships three headline changes that matter for production teams. The CRD renaming under a Percona-owned API group, which finally lets the Crunchy operator and the Percona operator coexist in the same Kubernetes cluster. Proper OLM namespace scoping for OpenShift installations. And the move to the official Percona Distribution image for major PostgreSQL version upgrades, aligning the upgrade path with the same binaries that run in your clusters.
All three land in service of the same goal: making 3.0.0 a clean, durable operational baseline for the operator’s next several years as an independent project. Future releases will be shaped by what the community asks for and contributes back. The public roadmap is the durable signal of that commitment.
In this post, you will learn about:
The hard fork and how the CRD rename unlocks coexistence with the Crunchy operator
OLM namespace-scoping improvements for OpenShift installations
The move to the official Percona Distribution image for major PostgreSQL version upgrades
Other improvements and the 2.7.0 deprecation
Supported PostgreSQL versions and platforms
Hard fork: CRDs renamed under upstream.pgv2.percona.com
The Percona Operator for PostgreSQL has, until now, been a soft fork. Custom Resources inherited from Crunchy PGO used the upstream postgres-operator.crunchydata.com API group. The two operators shared CRDs, which meant you could only run one of them in a given Kubernetes cluster. Installing both would lead to overlapping CRDs, conflicting webhooks, and finalizer collisions, so platform teams had to pick a side before they had finished evaluating.
Starting with 3.0.0, every inherited CRD is renamed into a new dedicated upstream.pgv2.percona.com API group (K8SPG-1007). Percona’s own native CRDs (such as PerconaPGCluster under pgv2.percona.com/v2) are unchanged. The change applies to the inherited resources: PostgresCluster, PGUpgrade, PGAdmin, and the rest.
Coexistence: running both operators in the same cluster
The practical effect is that the Crunchy Data PostgreSQL Operator and the Percona Operator for PostgreSQL can now run on the same Kubernetes cluster at the same time, even in the same namespaces, with no CRD or webhook conflict. That unlocks a few real workflows: evaluating both operators on the same staging cluster without spinning up a second cluster, running existing Crunchy-managed clusters in some namespaces while bringing up new Percona-managed clusters in others, or testing a new database version on the Percona side while production stays on Crunchy until you are confident. The choice between the two operators stops being all-or-nothing.
Upgrade behavior for existing 2.x installs
For an existing install, the upgrade to 3.0.0 is mechanically simple. The operator creates the new-API-group CRDs alongside the legacy ones, then runs a one-time migration that updates dependent objects (Secrets, certificates, finalizer references) to point at the new CRD instances. Existing custom resources keep working through the legacy CRDs during the transition, and once migration completes, all reconciliation moves to the new group.
Day-to-day, your PerconaPGCluster Custom Resource (the one most teams interact with directly) is unchanged. The rename mostly matters in three situations: when a kubectl filter or a GitOps repository hard-codes the old API group, when a CI pipeline references the legacy CRD by name, and when you run the Percona and Crunchy operators side by side and need them not to collide.
Note: During the CRD migration on upgrade, the release notes report brief disruptions to pgBackRest operations (typically 1 to 2 minutes) while Kubernetes propagates certificate changes. Plan the upgrade during a maintenance window if backup continuity is critical, or pause scheduled backups during the upgrade.
OpenShift users install operators through the OpenShift Lifecycle Manager (OLM), and OLM enforces an OperatorGroup to scope which namespaces an operator watches. In practice, 2.x had quirks: teams that selected “Single namespace” mode would sometimes see the operator reconciling CRs in other namespaces, and teams in “All namespaces” mode would sometimes see incomplete coverage when CRs were created in newly-added namespaces.
3.0.0 fixes this by aligning the operator’s namespace watch list with the OperatorGroup that OLM applies. All-namespaces installs watch all namespaces. Single-namespace installs respect the targetNamespaces set on the OperatorGroup.
Why it matters in shared infrastructure
For an OpenShift platform team running shared infrastructure, this distinction matters operationally. A typical setup has the database operator installed once in a platform namespace (such as openshift-operators) but expected to serve PerconaPGCluster resources owned by individual application teams in their own namespaces. If the operator over-reaches into namespaces it should not watch, RBAC noise multiplies. If it under-reaches, application teams file tickets about clusters that never reconcile. The 3.0.0 alignment with OperatorGroup semantics removes both failure modes.
OperatorGroup wiring
For users installing through OLM via the OpenShift web console, the install flow is unchanged. The fix is in how the operator’s reconciler interprets the OLM-supplied namespace scope after install. For users who manage OperatorGroups directly, a single-namespace install looks like this:
The empty spec: {} (or an OperatorGroup with no targetNamespaces) means “watch all namespaces” by OLM convention. The 3.0.0 operator now honors that.
Note: After you upgrade an existing 2.x install to 3.0.0, the operator may begin reconciling PerconaPGCluster resources in namespaces it had previously ignored due to the prior scoping bug. Audit existing CRs across your cluster before upgrading, especially if you have stale test clusters in unintended namespaces. The release notes call this out explicitly.
Note for community vs certified bundle users: Community OLM bundles did not support cluster-wide (all-namespaces) mode in earlier versions, 3.0.0 adds it. Certified bundles already supported cluster-wide mode, but they used a separate stable-cw channel for it with 3.0.0 the channels are unified, so users upgrading from a certified stable-cw install need to switch their subscription channel to stable to receive the upgrade.
Major PostgreSQL version upgrades now use the official Percona Distribution image
Major-version upgrades (for example, PostgreSQL 17 to 18) require running pg_upgrade, which needs binaries for both the source and target versions in the same environment. The operator has supported major-version upgrades since 2.x, but it shipped its own dedicated upgrade image to do so. That worked, but it meant a Percona-specific image lived in the upgrade path, separate from the same Percona Distribution for PostgreSQL build that runs in your clusters.
Switching to the official Percona Distribution image
In 3.0.0, the operator switches to using the official Percona Distribution for PostgreSQL image for major-version upgrades: percona/percona-distribution-postgresql-upgrade (current tag: 18.4-17.10-16.14-15.18-14.23-1, which encodes the bundled major versions). The benefit is alignment: the binaries that run pg_upgrade are the same binaries that ship in the corresponding percona-distribution-postgresql image you already run in production, built from the same source, signed the same way, and patched on the same schedule. The operator orchestrates the upgrade through the PerconaPGUpgrade Custom Resource that names the source and target versions, the upgrade image, and the target component images (PostgreSQL, pgBouncer, pgBackRest).
Running an upgrade through the PerconaPGUpgrade CR
Apply it with kubectl apply -f upgrade.yaml -n <namespace>. The operator reconciles the upgrade as a controlled, observable process: it brings the cluster down for the upgrade window, runs pg_upgrade from the bundled image, brings the cluster back up on the target version, and updates pgBouncer and pgBackRest images in the same step.
Operationally, this matters for teams running on PostgreSQL’s annual major-version cadence. Every September brings a new major release; staying on a supported version means executing one major upgrade per cluster per year. Pulling the upgrade image from the same percona-distribution-postgresql registry path as the runtime image means image-signature verification, mirror-to-private-registry rules, and CVE-scanning policies you already have in place apply to the upgrade flow without any per-image exception.
Note: The pgaudit extension is not upgraded automatically. After the operator completes the major version upgrade, drop and recreate pgaudit manually in each database that uses it: DROP EXTENSION pgaudit; followed by CREATE EXTENSION pgaudit;. The release notes call this out as a required step (K8SPG-1022). Also worth scanning for collation-dependent indexes after the upgrade and refreshing collation metadata with ALTER DATABASE <name> REFRESH COLLATION VERSION; per the upstream PostgreSQL 18 release notes.
Operational polish landed alongside the headline changes:
Go 1.26 update (K8SPG-1019): the operator binary is now built with Go 1.26, picking up performance optimizations, tooling improvements, and the security fixes that landed in the Go runtime since the previous release.
pgaudit upgrade documentation (K8SPG-1022): the major-version upgrade docs now include an explicit pgaudit drop-and-recreate procedure, surfacing the gotcha that previously caught users mid-upgrade.
The release also defaults the cluster-upgrade documentation to PostgreSQL 18 across all examples and tutorials.
Supported software and platforms
The Percona Operator for PostgreSQL 3.0.0 is developed and tested on:
Amazon Elastic Kubernetes Service (EKS) 1.33 to 1.35
OpenShift 4.18 to 4.21
Azure Kubernetes Service (AKS) 1.33 to 1.35
Minikube 1.38.1 (Kubernetes v1.35.1) for local development
Deprecation: 2.7.0 support dropped
Support for Custom Resource Definitions from operator version 2.7.0 has been removed. If you are still on 2.7.0, upgrade to 2.8.x or 2.9.x first, then upgrade to 3.0.0. The CRD migration described above only handles 2.8.x and 2.9.x to 3.0.0 transitions cleanly.
Conclusion
3.0.0 is the release where the Percona Operator for PostgreSQL becomes a fully independent project. The CRD rename removes the last upstream coupling that mattered operationally. The OLM scoping fix removes a long-standing OpenShift quirk. The official major-version upgrade image removes one of the more painful operational gaps in earlier versions.
Beyond the technical work, 3.0.0 is also where Percona’s commitment to community-driven development moves from intent to mechanism. The public roadmap is open. The issue tracker is open. The images are freely redistributable. Future releases will be shaped by what the community asks for, files, and contributes back. If there is a feature you want to see in 3.1.0 or 3.2.0, open an issue or a PR, that is where the work happens now.
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
12:02 pm
Migrate from Crunchy Data PostgreSQL Operator to Percona PostgreSQL Operator: Backup-Restore and PV Reuse
A Percona PostgreSQL operator pgBackRest restore is the simplest way to move off the Crunchy Data PostgreSQL Operator: take a full Crunchy backup, point the new Percona cluster’s dataSource at the existing pgBackRest archive, and the cluster bootstraps from it before its first start. This post covers that path, plus a second option, persistent-volume reuse, for cases where you want to skip the data copy entirely.
This is part 3 of a 3-part series on running PostgreSQL on Kubernetes with a fully open-source operator. Part 1 walked through the changing open-source landscape and announced the hard fork of the Crunchy Data PostgreSQL Operator into the fully independent Percona PostgreSQL Operator v3.0.0. Part 2 covered the standby cluster method, the safest migration path when downtime budget is tight.
This post covers two simpler paths:
Backup and restore, the fastest if you can tolerate a short application-downtime window
Persistent volume reuse, when you want to skip the data copy entirely and keep the existing PGDATA
If you are landing here cold, start with part 1 for the why, then read Part 2 for the standby method. The rest of this post assumes you have already decided to migrate and want a tested playbook.
Tested with
Component
Version
Crunchy Data PostgreSQL Kubernetes Operator
v5.8.x (tested on v5.8.7)
Percona PostgreSQL Kubernetes Operator
v3.x.x (tested on v3.0.0)
PostgreSQL
18 (must match between source and target)
Object storage
SeaweedFS (Apache-2.0), or any S3-compatible service. Required for the backup-and-restore method, optional for PV reuse.
Tools
kubectl, helm (v3)
Different versions may have slight differences in CR fields or behavior. Always consult the official documentation for the operator and PostgreSQL version you are running.
What this post does NOT cover
Application-side connection-string changes beyond updating to the new pgBouncer service
Schema-changing upgrades, major PostgreSQL version upgrades, or extension migrations
Crunchy enterprise-only features like TDE or pgBackRest custom encryption
Operating two operators against the same namespace before the hard fork. Use Percona PostgreSQL Operator v3.0.0 or higher.
1. Migration using backup and restore
This is often the fastest and simplest path, especially when you do not need a live standby. You take a full backup of the Crunchy source cluster, then create a Percona cluster that automatically restores from that backup before its first start.
Data written between the final backup and the application cutover is lost, so the migration window is the time between those two events. For a near-zero-downtime alternative, see part 2: standby cluster method.
Overview
Before you begin
Set the namespace once. Every command in this guide reads from this variable:
Skip this step if you already have an S3-compatible repository (AWS S3, GCS, Ceph). Update the endpoint and credentials in the YAML examples accordingly.
SeaweedFS provides an S3-compatible object store that runs inside Kubernetes. Both operators will use it as the shared pgBackRest WAL archive.
TLS is required. pgBackRest always connects to S3 endpoints over HTTPS, even when repo1-s3-verify-tls: "n" is set (that flag skips certificate verification, it does not fall back to HTTP). The steps below generate a self-signed certificate and pass it to SeaweedFS via Helm values.
# Copy and edit the file first to set your credentials.
kubectl apply -n $MIGRATION_NS \
-f https://raw.githubusercontent.com/percona/percona-postgresql-operator/refs/heads/migration-from-crunchy-guide/e2e-tests/tests/migration-from-crunchy-backup-restore/examples/01-pgbackrest-secrets.yaml
Both contain the same SeaweedFS credentials (pgmigration / pgmigration123). For AWS S3, replace those with your IAM access key ID and secret access key.
Step 1. Start with your existing Crunchy Data cluster
If you already have a running Crunchy cluster, ensure its pgBackRest repo1 points at the shared bucket. The repo1-path value must match the path that will be referenced in the Percona dataSource.pgbackrest.global.repo1-path field.
Optional: deploy the Crunchy operator for testing. The Helm install below is shown only as a quick way to reproduce this blog post’s example. The migration steps in the rest of this post do not depend on how you deployed the source operator.
Step 2. Trigger a full backup (the migration cutover point)
This is the backup the Percona cluster will restore from. Stop accepting writes on the application side before triggering it to ensure a consistent snapshot, or accept that data written after this backup will be lost.
The key section that bootstraps the cluster from the Crunchy backup:
dataSource:
pgbackrest:
stanza: db
configuration:
- secret:
name: percona-pgbackrest-secret
global:
# Must match repo1-path in the Crunchy source cluster exactly.
repo1-path: /crunchy-to-percona/repo1
repo1-s3-uri-style: path
repo1-s3-verify-tls: "n"
repo:
name: repo1
s3:
bucket: pg-migration
endpoint: seaweedfs-all-in-one.postgres-migration.svc.cluster.local:8443
region: us-east-1
The Percona cluster’s own backup repository must use a different path from the Crunchy source:
backups:
pgbackrest:
global:
repo1-path: /percona-restored/repo1 # different from Crunchy's path
As soon as the Custom Resource is applied, the cluster is bootstrapped from the storage referenced in dataSource and then started. Once the cluster becomes ready, you can immediately create new backups; in this case, repo1 from the backups section will be used as the target repository.
This creates a clean recovery baseline on the Percona cluster’s own repository. All future PITR restores will use this backup, independent of the Crunchy archive.
Step 7. Reconnect your application
kubectl get service -n $MIGRATION_NS \
--selector postgres-operator.crunchydata.com/cluster=percona-restored,postgres-operator.crunchydata.com/role=pgbouncer
Step 8. Clean up the Crunchy cluster
Once the migration is verified and your application is connected to the new cluster:
Until Step 8, rollback is straightforward: switch the application connection string back to the Crunchy pgBouncer service. The Crunchy primary still holds the authoritative state because no writes were directed at the Percona cluster during the cutover (you stopped writes before Step 2). Any writes the application sent to the Percona cluster after cutover will not be present on Crunchy and would need to be replayed manually.
After Step 8, rollback requires restoring the Crunchy cluster from a backup, which is feasible because the original repo1 is still in the bucket.
Troubleshooting
archive.info missing. The repo1-path in dataSource.pgbackrest.global must match the Crunchy source cluster’s repo1-path exactly:
kubectl get postgrescluster crunchy-source -n $MIGRATION_NS \
-o jsonpath='{.spec.backups.pgbackrest.global.repo1-path}'
kubectl get perconapgcluster percona-restored -n $MIGRATION_NS \
-o jsonpath='{.spec.dataSource.pgbackrest.global.repo1-path}'
Restore job fails with TLS errors. pgBackRest requires HTTPS even with repo1-s3-verify-tls: "n". Verify SeaweedFS is reachable:
Data missing after restore. The restore captures data up to the latest backup. If post-backup data is critical, re-run the backup on the Crunchy cluster after quiescing writes, then delete and recreate the Percona cluster to restore from the newer backup.
2. Migration using existing persistent volumes
This method reuses the Crunchy primary’s PGDATA persistent volume directly. It avoids a full backup-restore cycle: you retain the Crunchy primary’s PV, delete the Crunchy cluster, then create a Percona cluster whose PVC binds to that same PV. PostgreSQL starts on the existing data directory without any restore step.
It is useful when:
you want to avoid copying data
your storage is very large
you must preserve the original data directory exactly
Both operators run in the same namespace. Crunchy PGO is uninstalled during the migration once the PV is retained.
Note (Crunchy): The Helm install for Crunchy PGO below is shown only as a quick way to reproduce this blog post’s example. If you are running Crunchy PGO in production, follow the official Crunchy Data documentation for installation. The migration steps in the rest of this post do not depend on how you deployed the source operator.
Note (Percona): The kubectl apply of the Percona operator below uses defult configuration of v3.0.0 from the operator repo for reproducibility of this guide. For production deployments, follow the official Percona Operator for PostgreSQL installation documentation to ensure the cluster configuration is properly sized and configured for your workload and traffic requirements.
Stop your application from writing to the database. This is the start of the downtime window. Then identify the primary pod, its PVC, and the backing PV:
PRIMARY=$(kubectl get pod -n $MIGRATION_NS \
--selector postgres-operator.crunchydata.com/cluster=crunchy-source,postgres-operator.crunchydata.com/role=master \
-o jsonpath='{.items[0].metadata.name}')
PVC_NAME=$(kubectl get pod -n $MIGRATION_NS "${PRIMARY}" \
-o jsonpath='{.spec.volumes[?(@.name=="postgres-data")].persistentVolumeClaim.claimName}')
PV_NAME=$(kubectl get pvc -n $MIGRATION_NS "${PVC_NAME}" \
-o jsonpath='{.spec.volumeName}')
echo "Primary pod: ${PRIMARY}"
echo "PVC: ${PVC_NAME}"
echo "PV: ${PV_NAME}"
Step 4. Configure the source cluster to retain PVs
If you want to delete the Crunchy source cluster but keep the persistent volumes, the PV reclaim policy must be set to Retain. For dynamically provisioned PersistentVolumes, the default reclaim policy is Delete, which removes the data once there are no more PersistentVolumeClaims associated with the PV.
The Percona Operator creates a PVC with that selector. The PVC binds to the labelled PV, and PostgreSQL starts on the existing PGDATA directory with no restore needed. pgBackRest uses a local PVC-backed repository (repo1.volume), so no S3 credentials or external storage are required, but you can use S3 storage as well.
Wait for the cluster to become ready and verify the data is intact:
Expected output: f. The cluster is the primary and accepts writes.
Step 6. Scale up replicas
The cluster started with a single replica to reuse the migrated PV. Once the primary is healthy, drop the PVC selector and scale out so the operator can provision fresh replica volumes from the storage class:
Removing the selector here is important: leaving it in place would cause the new replica PVCs to fail provisioning because no other PV carries the migration label.
This creates the first backup on the Percona cluster’s local pgBackRest repository, establishing a baseline for future PITR restores.
Step 8. Reconnect your application
kubectl get service -n $MIGRATION_NS \
--selector postgres-operator.crunchydata.com/cluster=percona-migrated,postgres-operator.crunchydata.com/role=pgbouncer
Step 9. Cleanup
After the migration is verified, remove the migration label from the PV (Step 6 already removed the PVC selector that depended on it):
PV migration is the least rollback-friendly of the three methods. Once the Percona cluster has started writing to the PGDATA directory, the original Crunchy timeline is gone. If you need a way back, take a Crunchy-side pgBackRest backup before Step 4 and treat that backup as your rollback point. Recovery is then a fresh Crunchy cluster restored from that backup.
Troubleshooting
PVC stays in Pending state. The PVC selector did not match the labelled PV. Verify the label and PV phase:
kubectl get pv "${PV_NAME}" --show-labels
kubectl get pv "${PV_NAME}" -o jsonpath='{.status.phase}'
PostgreSQL fails to start (data directory errors). Check the database container logs:
If the Crunchy cluster was shut down uncleanly, there may be incomplete WAL. Patroni will attempt crash recovery automatically; check the logs for progress.
PV was deleted before setting Retain. If the PV was deleted along with the PVC (default Delete policy), the data is gone and PV migration is no longer possible. Use the backup-and-restore migration above, restoring from the most recent pgBackRest backup.
Conclusion
Two more migration paths from the Crunchy Data PostgreSQL Operator to the fully open-source Percona PostgreSQL Operator. Combined with Part 2, the series gives you three production-tested options:
Standby cluster (part 2): near-zero downtime via streaming replication and pgBackRest standby
Backup and restore (this post): the simplest path, restoring directly from a Crunchy pgBackRest backup
Persistent volume reuse (this post): when you want to keep storage and skip the data copy
All three approaches are safe, predictable, and reversible, with the rollback caveats noted in each section. Because Percona’s operator, images, and tooling are 100 percent open source, you keep full control: you can always migrate back to the Crunchy operator, or out to another open-source operator (Zalando, StackGres, CloudNativePG) using the same patterns. That last journey is a topic for a future post.
This post covers basic deployment patterns and simplified configuration examples. If your environment uses custom images, Crunchy enterprise features, or otherwise needs tailored migration steps, contact the Percona team and we will help you plan and execute the move.
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
10:51 am
Running TidesDB as a MySQL 9.7 storage engine
tidesdb-mysql is an experimental build that was developed to verify how TidesDB, the LSM-tree key/value engine, can work with MySQL 9.7 as a storage engine. The current build is v0.2.4, and it’s an experiment, not a finished product. So you can use it in your tests if you also want to try TidesDB with MySQL and compare with MariaDB
Why we made it
There was already a way to use TidesDB from SQL. It’s TideSQL, which loads the engine into MariaDB as ha_tidesdb, and it works fine. But it doesn’t work with MySQL. So we wanted TidesDB to work with MySQL 9.7.
MariaDB and MySQL share a lot of history, but they are not the same. We couldn’t just recompile the MariaDB plugin against MySQL headers and call it done. The one thing that stayed put through all of it was TidesDB itself, doing exactly what it does anywhere else. Only the server wrapped around was changed. In result we got our implementation, so if you’re on MySQL, you no longer have to switch to MariaDB to give TidesDB a try.
What it actually is
tidesdb-mysql is a loadable plugin, ha_tidesdb.so. The engine gets built on its own and loaded into the server at runtime, the same shape as the MariaDB version. It speaks the MySQL handler API and wires MySQL tables and indexes onto TidesDB column families. After it loads, TidesDB sits right next to InnoDB in SHOW ENGINES and you choose it per table.
Getting started
All you need is Docker. Pull the image and start it:
The plugin is baked into this image and loaded on boot, so there’s no INSTALL PLUGIN step to remember. Confirm the engine is live:
docker exec tidesdb mysql -uroot -psecret \
-e "SELECT engine, support FROM information_schema.engines WHERE engine='TidesDB';"
# TidesDB | YES
Now make a table and treat it like any other:
CREATE DATABASE shop;
USE shop;
CREATE TABLE products (
id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(64) NOT NULL,
price DECIMAL(10,2) NOT NULL,
KEY idx_price (price)
) ENGINE=TIDESDB;
INSERT INTO products (name, price) VALUES ('Widget', 9.99), ('Gadget', 24.50);
SELECT * FROM products WHERE price < 20;
Transactions, secondary indexes, the usual SQL, it all behaves:
START TRANSACTION;
UPDATE products SET price = price + 1 WHERE name = 'Widget';
COMMIT;
Per-table TidesDB options ride along in MySQL’s ENGINE_ATTRIBUTE JSON field. MySQL doesn’t have MariaDB’s COMPRESSION=… grammar, so the options are identical but you write them differently:
CREATE TABLE events (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
msg TEXT
) ENGINE=TIDESDB
ENGINE_ATTRIBUTE='{"compression":"ZSTD","bloom_filter":true}';
Compression accepts NONE, SNAPPY, LZ4, ZSTD, or LZ4_FAST. Server-wide knobs live in system variables such as tidesdb_default_compression, tidesdb_block_cache_size, tidesdb_compaction_threads, and tidesdb_flush_threads. The full list is in docs/build-and-load.md.
Prove the crash recovery
Write a handful of rows, kill the server with no clean shutdown, bring it back, and count what’s left:
# 1. Write rows inside a transaction and COMMIT.
docker exec -i tidesdb mysql -uroot -psecret <<'SQL'
CREATE DATABASE IF NOT EXISTS t;
CREATE TABLE IF NOT EXISTS t.kv (k INT PRIMARY KEY, v VARCHAR(32)) ENGINE=TIDESDB;
BEGIN;
INSERT INTO t.kv VALUES (1,'a'),(2,'b'),(3,'c'),(4,'d'),(5,'e');
COMMIT;
SELECT COUNT(*) AS before_crash FROM t.kv; -- 5
SQL
# 2. Hard-kill the server (no graceful shutdown) and restart it.
docker kill -s KILL tidesdb
docker start tidesdb
until docker exec tidesdb mysql -uroot -psecret -e 'SELECT 1' >/dev/null 2>&1; do sleep 2; done
# 3. The committed rows are still there.
docker exec tidesdb mysql -uroot -psecret \
-e "SELECT COUNT(*) AS after_crash FROM t.kv;" -- 5
after_crash should come back equal to before_crash.
A few more things to try
Compression is the one people ask about first, so here’s a table that leans on it. We generate a couple thousand rows of repetitive text, which is exactly the shape ZSTD likes:
CREATE TABLE logs (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
level VARCHAR(8) NOT NULL,
body TEXT,
KEY idx_level (level)
) ENGINE=TIDESDB
ENGINE_ATTRIBUTE='{"compression":"ZSTD","bloom_filter":true}';
INSERT INTO logs (level, body)
SELECT IF(RAND() < 0.2, 'warn', 'info'),
REPEAT('the quick brown fox jumps over the lazy dog ', 40)
FROM information_schema.columns
LIMIT 2000;
SELECT level, COUNT(*) AS rows FROM logs GROUP BY level;
SELECT id, LEFT(body, 30) AS preview FROM logs WHERE id = 1000;
The rows go in compressed and come back out as the original text, so queries don’t change at all. If you want to confirm the option actually landed on the table rather than being silently dropped, ask the server what it stored:
SHOW CREATE TABLE logs\G
-- ENGINE=TIDESDB ... ENGINE_ATTRIBUTE='{"compression":"ZSTD","bloom_filter":true}'
The bloom filter from that same attribute is what keeps point lookups cheap once the data has compacted down into several on-disk files:
SELECT id, level FROM logs WHERE id = 1500;
A JSON column behaves the way you’d expect, including the ->> extraction operator:
CREATE TABLE kv (k VARCHAR(64) PRIMARY KEY, v JSON) ENGINE=TIDESDB;
INSERT INTO kv VALUES
('en', JSON_OBJECT('lang','English', 'msg','hello')),
('es', JSON_OBJECT('lang','Spanish', 'msg','hola')),
('fr', JSON_OBJECT('lang','French', 'msg','bonjour'));
SELECT k, v->>'$.lang' AS language, v->>'$.msg' AS greeting
FROM kv
ORDER BY k;
And the secondary index on products from earlier is a real index, not decoration. A range query uses it, and EXPLAIN will show idx_price in the key column:
SELECT name, price FROM products WHERE price BETWEEN 5 AND 20 ORDER BY price;
EXPLAIN SELECT name, price FROM products WHERE price BETWEEN 5 AND 20;
What works, and what doesn’t yet
Quite a bit works. The common column types are all there, primary keys single and composite, AUTO_INCREMENT, secondary indexes with index-condition pushdown, COMMIT/ROLLBACK, REPLACE and INSERT … ON DUPLICATE KEY UPDATE, online add/drop index, instant add column, full-text search, spatial indexes, per-row TTL, per-table compression and bloom filters, at-rest encryption, and mixed-engine transactions where a TidesDB table and an InnoDB table share one BEGIN … COMMIT. The functional test suite, which we lifted from TideSQL and then extended, passes 58 of 58 executed tests.
A few things you should know about before you lean on it:
Native partitioning and the MySQL 9 vector column type aren’t implemented. Those two test cases are skipped deliberately.
Atomic, crash-safe DDL (the data-dictionary integration) is wired up but we haven’t driven it end-to-end yet. Your data writes are crash-safe; schema changes during a crash are next on the list.
Replication, foreign keys, and nested savepoints aren’t in scope at the moment.
Treat v0.2.5 as a serious experiment. It’s solid enough that committed data rides through a crash, and it’s not something we’d point production traffic at yet.
Try it, then tell us
docker pull perconalab/tidesdb-mysql:0.2.5
That’s the whole setup. Spin up a table with ENGINE=TIDESDB, run the crash demo, and point your own SQL at it. The source, the build scripts, and the engine patches all live in thetidesdb-mysql repository, and the durability fixes are written up in KNOWN-ISSUES.md. This is a tool made by users for users, so if you give it a spin, we’d genuinely like to hear what held up and what fell over.
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
7:49 am
Migrate from Crunchy Data PostgreSQL Operator to Percona PostgreSQL Operator: Standby Cluster Method
A Crunchy to Percona PostgreSQL migration is more straightforward than most cross-operator moves on Kubernetes, because the Percona PostgreSQL Operator is a hard fork of the Crunchy Data PostgreSQL Operator. Same Patroni HA, same pgBackRest backups, same overall CRD shape. This post walks through the safest of the three migration paths: a standby cluster method with near-zero downtime.
This is part 2of a 3-part series on running PostgreSQL on Kubernetes with a fully open-source operator. Part 1 walked through the changing open-source landscape and announced the hard fork of the Crunchy Data PostgreSQL Operator into the fully independent Percona PostgreSQL Operator v3.0.0.
This post is the first practical playbook of the series. It covers the standby cluster method, the safest migration path when the downtime budget is tight. Part 3 will cover two simpler paths: backup-and-restore and persistent-volume reuse.
If you are landing here without context on why you might want to migrate at all, start with part 1. The rest of this post assumes you have already decided to move and want a tested playbook.
Migration approach in one paragraph
The Percona PostgreSQL Kubernetes Operator is a hard fork of the Crunchy Data PostgreSQL Kubernetes Operator, which simplifies the migration paths considerably: the same underlying tools (Patroni, pgBackRest, PgBouncer) and the same overall design are used in both operators. All three migration paths in this series are reversible: because Percona’s operator is fully open source and remains compatible with the same backup format, the move back to Crunchy is also possible if your team decides to walk it
A note on the storage layer
All examples in this guide use an in-cluster SeaweedFS instance as the pgBackRest S3 repository. SeaweedFS is Apache-2.0 licensed, actively maintained, and a clean drop-in replacement for the role MinIO used to fill in this stack. Any other S3-compatible storage works just as well: AWS S3, Google Cloud Storage (via HMAC keys), Ceph RadosGW, Cloudflare R2, and so on. For non-SeaweedFS endpoints, remove repo1-s3-uri-style: path and repo1-s3-verify-tls: “n” from the pgBackRest configuration and replace the endpoint with your provider’s URL.
What this series does NOT cover
To keep scope honest:
Application-side connection-string changes beyond updating to the new pgBouncer service. If your app uses connection-pool tuning, custom auth, or a service mesh, that work stays with you.
Schema-changing upgrades, major PostgreSQL version upgrades, or extension migrations. The PostgreSQL major version must match between the source and the target.
Crunchy enterprise-only features like TDE, Crunchy Postgres for Kubernetes-specific operators, or pgBackRest custom encryption. If your environment uses these, contact the Percona team for a tailored plan.
Operating two operators against the same namespace before the PGO hard fork. Use Percona PostgreSQL Operator v3.0.0 or higher.
SeaweedFS (Apache-2.0), or any other S3-compatible service accessible from all cluster pods
Tools
kubectl, helm(v3), yq
Different versions may differ slightly in CR fields or behavior. Always consult the official documentation for the operator and PostgreSQL version you are running.
Migration using a standby cluster
This is the safest method when the downtime budget is tight. The Percona cluster is brought up as a standby of the Crunchy primary, catches up via pgBackRest plus streaming replication, and is promoted at cutover. The only downtime is the cutover step itself.
You can wire the standby in two ways, and combining both gives you maximum safety:
pgBackRest repo-based standby seeds the standby from the latest base backup and replays archived WAL
Streaming replication keeps the standby in sync with the live primary
Overview
Before you begin
Set the target namespace once. Every command in this guide reads from this variable, so you can change it in a single place:
Skip this step if you already have an S3-compatible repository (AWS S3, GCS, Ceph). Update the endpoint and credentials in the YAML examples accordingly.
SeaweedFS provides an S3-compatible object store that runs inside Kubernetes. Both operators will use it as the shared pgBackRest WAL archive.
TLS is required. pgBackRest always connects to S3 endpoints over HTTPS, even when repo1-s3-verify-tls: “n” is set (that flag skips certificate verification, it does not fall back to HTTP). The steps below generate a self-signed certificate and pass it to SeaweedFS via Helm values.
The Helm values file in the repo creates the pg-migration bucket on first start, so no separate aws s3 mb step is needed.
Step 0. Create pgBackRest secrets
Both operators need credentials to read and write the shared SeaweedFS bucket. Apply the secrets from examples/01-pgbackrest-secret.yaml after filling in your access key and secret key:
# Copy and edit the file first to set your credentials.
kubectl apply -n $MIGRATION_NS \
-f https://raw.githubusercontent.com/percona/percona-postgresql-operator/refs/heads/migration-from-crunchy-guide/e2e-tests/tests/migration-from-crunchy-standby/examples/01-pgbackrest-secret.yaml
Both secrets contain the same SeaweedFS credentials (pgmigration / pgmigration123). For AWS S3, replace those with your IAM access key ID and secret access key.
Step 1. Start with your existing Crunchy Data cluster
If you already have a running Crunchy cluster, ensure its pgBackRest repo1 points at the shared bucket and path. The repo1-path value must be identical in both cluster specs. Mismatched paths will prevent the Percona standby from finding the WAL archive.
The Helm install below is shown only as a quick way to reproduce this blog post’s example. The migration steps in the rest of this post do not depend on how you deployed the source operator.
Optional: deploy a Crunchy operator to test the migration end to end:
Take a full backup before creating the Percona standby. This gives the standby a recent base to restore from, so it only needs to replay a small amount of WAL to catch up. This matches the realistic production migration pattern.
If the Percona cluster is in a different namespace from the Crunchy cluster, copy the Crunchy TLS secrets to the Percona namespace. These allow mutual TLS authentication during streaming replication:
for secret in crunchy-source-cluster-cert crunchy-source-replication-cert; do
kubectl get secret "${secret}" -n <CRUNCHY_NS> -o json | \
yq '{"apiVersion": .apiVersion, "kind": .kind, "data": .data,
"metadata": {"name": .metadata.name}, "type": .type}' -o yaml | \
kubectl -n $MIGRATION_NS apply -f -
done
If both clusters are in the same namespace, skip this step. The secrets are already accessible.
Step 4. Deploy the Percona PG Operator
The Crunchy PGO operator can stay in the same or a different namespace.
Step 5. Create the Percona cluster in standby mode
Note: The kubectl apply below pulls the CR manifest from the migration-from-crunchy-guide branch of the operator repo, which is the source for this guide’s examples. For production deployments, follow the official Percona Operator for PostgreSQL installation documentation and pin to a released version tag rather than a feature branch.
The key settings that wire the Percona cluster to the Crunchy source:
standby:
enabled: true
repoName: repo1 # restore initial base backup from this repo
host: crunchy-source-ha.postgres-migration.svc.cluster.local
port: 5432
secrets:
customTLSSecret:
name: crunchy-source-cluster-cert # Crunchy CA for mutual TLS
customReplicationTLSSecret:
name: crunchy-source-replication-cert # cert for _crunchyreplication user
The Percona operator will:
Restore the base backup from the SeaweedFS bucket.
Replay WAL from SeaweedFS until it catches up with the live Crunchy cluster.
Switch to streaming replication from crunchy-source-ha.
Expected output: t (in recovery) and a non-null LSN.
Step 6. Verify replication lag before cutover
Query the Crunchy primary to confirm the Percona standby has caught up:
CRUNCHY_PRIMARY=$(kubectl get pod \
-l postgres-operator.crunchydata.com/cluster=crunchy-source,postgres-operator.crunchydata.com/role=master \
-n $MIGRATION_NS \
-o jsonpath='{.items[0].metadata.name}')
kubectl -n $MIGRATION_NS exec "${CRUNCHY_PRIMARY}" -c database -- \
psql -c "
SELECT
client_addr,
state,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS byte_lag,
write_lag,
flush_lag,
replay_lag
FROM pg_stat_replication;
"
Proceed to the next step only when write_lag and replay_lag are NULL or under a few seconds.
Step 7. Cutover the Crunchy cluster
This is the only step that causes downtime. Stop accepting writes on the application side, then patch the Crunchy cluster into standby mode. Patroni steps down and archives the final WAL.
This creates a clean recovery point on the new timeline. All future PITR restores will use this backup as their starting point, independent of the old Crunchy WAL archive.
Reconnecting your application
Update your application’s connection string to point at the Percona cluster’s pgBouncer service:
kubectl get service -n $MIGRATION_NS \
-l postgres-operator.crunchydata.com/cluster=percona-standby,postgres-operator.crunchydata.com/role=pgbouncer
This migration path works almost entirely out of the box. For users coming from the Crunchy Data PostgreSQL Operator, this method feels familiar because it leverages the same standby/replica mechanisms used for HA and disaster recovery. The key difference is that you can now use this familiar mechanism to migrate safely to the Percona PostgreSQL Operator, a fully open-source alternative running on a fully open-source storage layer.
Rollback
The standby method is the most rollback-friendly of the three. Until you take the post-migration backup, the Crunchy cluster still holds the original timeline. To roll back:
Stop writes on the Percona side and patch the Percona cluster back into standby mode (spec.standby.enabled: true).
Patch the Crunchy cluster out of standby mode and let Patroni promote it.
Verify with pg_is_in_recovery() on both sides.
Switch the application connection string back to the Crunchy pgBouncer service.
After Step 11 (post-migration backup), the timelines have diverged. From that point, the rollback story is the same as a fresh restore, and you should treat the Crunchy cluster as a historical reference, not a live target.
Troubleshooting
Percona standby not connecting to the Crunchy primary. Verify the crunchy-source-ha service resolves from within the Percona pod:
Replication authentication errors. The Percona standby authenticates as the _crunchyreplication PostgreSQL user using the certificate in crunchy-source-replication-cert. Verify the secret exists and matches what the Crunchy operator generated:
kubectl get secret crunchy-source-replication-cert -n $MIGRATION_NS
pgBackRest restore fails. Confirm both secrets contain identical credentials and that repo1-path is the same in both cluster specs (/crunchy-to-percona/repo1 in this guide). Mismatched paths cause an archive.info missing error. Verify the bucket is reachable:
Timeline history file (00000002.history) missing after promotion.This is a known issue with Crunchy PGO’s async archive mode. After promotion, push the history file synchronously:
This was the safest migration path. Part 3 will cover two simpler options:
Backup and restore. The simplest path. You take a Crunchy pgBackRest backup and the Percona cluster bootstraps from it. Cutover is the time between the final backup and pointing the application at the new cluster.
Persistent volume reuse. For when you want to skip the data copy entirely. The Percona cluster takes over the existing PGDATA volume, no restore step required.
Pick the method that fits your downtime budget, data size, and storage layout.
This post covers basic deployment patterns and simplified configuration examples. If your environment is more complex, uses custom images, includes Crunchy enterprise features like TDE, or otherwise needs tailored migration steps, contact the Percona team and we will help you plan and execute the move.
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
7:33 am
MySQL 9.7.0 PGO Benchmark Analysis
Overview
Servers Tested:
MySQL 9.7.0 (PGO-enabled build released by Oracle)
MySQL 9.7.0 Non-PGO (built without Profile-Guided Optimization — see BUILD.md)
MySQL 9.7.0 with Profile-Guided Optimization (PGO) demonstrates measurable performance improvements over the non-PGO build:
Overall Performance Summary:
Average improvement: 6.5% across all configurations
Peak improvement: 14.3% (Tier 32G, 1 thread), gradually tapering to 10.3% at 512 threads as concurrency increases
Performance gains range from 0.5% to 14.3% in most scenarios
Minor regression (-3.1% at Tier 12G, 128 threads)
Performance by Buffer Pool Size:
Tier 2G (2GB buffer pool): Average improvement of 3.0%
– Best gains at 4 threads (5.5% improvement)
– Gains range from 0.5% to 5.5% across all thread counts
– Modest improvements with no regressions
Tier 12G (12GB buffer pool): Average improvement of 4.1%
– Best gains at 4 threads (8.6% improvement)
– Strong gains at low concurrency (1-4 threads: 7.3%-8.6%)
– Minor regression at 128 threads (-3.1%), neutral at 512 threads (-0.0%)
Tier 32G (32GB buffer pool): Average improvement of 12.2%
– Consistently strong gains across all thread counts (10.3% to 14.3%)
– Peak performance at lowest concurrency (1 thread: 14.3%)
– Maintains 11-12% improvement even at highest concurrency (128-512 threads)
Key Observations:
PGO provides the most significant benefits with larger buffer pools (32GB tier shows 12.2% average improvement)
Largest buffer pool configuration benefits from PGO across all concurrency levels with no regressions
Low to moderate concurrency (1-32 threads) shows best PGO gains across all tiers
Smaller buffer pools (2GB, 12GB) show more modest improvements and occasional regressions at very high thread counts
The performance improvements demonstrate PGO’s effectiveness in optimizing hot code paths, particularly when memory resources are abundant
InnoDB Metrics Analysis
Deep analysis of InnoDB metrics reveals the source of PGO’s performance improvements:
Root Cause: CPU-Level Optimizations
PGO improvements are NOT from I/O optimization, caching, or lock reduction
Buffer pool hit ratios remain virtually identical between PGO and non-PGO builds
Lock contention is minimal in both builds
All I/O metrics scale proportionally with increased throughput
What PGO Actually Optimizes:
✓ Better instruction cache utilization
✓ Improved branch prediction in hot code paths
✓ Optimized function inlining
✓ More efficient CPU instruction ordering
The metrics confirm that PGO’s 6.5% average improvement comes entirely from making the CPU more efficient at executing MySQL’s hot code paths, allowing it to process more transactions per second with the same hardware resources.
What is PGO?
Profile-Guided Optimization (PGO) is a compiler optimization technique that uses runtime profiling data to guide code optimization. The compiler first instruments the code, collects execution profiles during typical workload runs, and then recompiles the code with optimizations targeted at the most frequently executed code paths.
Benefits of PGO:
Improved branch prediction
Better instruction cache utilization
Optimized function inlining
Reduced code bloat
Better register allocation
Benchmark Methodology
Workload
Tool: Sysbench OLTP Read/Write benchmark
Tables: 20 tables
Table Size: 5,000,000 rows per table
Thread Counts: 1, 4, 16, 32, 64, 128, 256, 512
Configuration
Warmup:
– Read-only: 180 seconds
– Read-write: 600 seconds
Measurement Duration: 900 seconds (15 minutes) per thread count
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
3:13 pm
CVE-2026-8053: “We don’t use time-series” is not a mitigation
TL;DR: A bug in MongoDB’s time-series collection code allows a user with the standard readWrite
role to corrupt memory within the mongod process. Best case: your database crashes, and you spend the night writing a postmortem. Worst case: an attacker is running their code as mongod, with the same access to your data that the database process itself has — every collection on that node, every index, every secret stored in it. The patch for Percona Server for MongoDB 7.0 is already available; 8.0 will be available tomorrow, and 6.0 will be available early next week.
Every time a bug like this lands, the same conversation plays out in incident channels across the industry. Are we affected? We don’t even use time-series collections! Heads nod. Everyone moves on.
That’s the mistake.
CVE-2026-8053 is an out-of-bounds memory write in MongoDB’s time-series collection — specifically in the internal mapping between measurement field names and column indexes. Under the right input, the mapping drifts out of sync with the underlying buffer and mongod writes off the end of an allocation. From there, under the right conditions, you can execute arbitrary code as the database process.
Upstream tracking lives at SERVER-126021. CVSS v3.1 puts it at 8.8. CVSS v4.0 puts it at 8.7. The labels say “High.” How that “High” translates into your week depends on a couple of assumptions worth questioning.
Read literally, the prerequisite is “an authenticated user with database write privileges.” Read operationally, that bar is lower than most teams treat it as.
The mitigation you think you have doesn’t exist
Modern stacks have dozens of service accounts, with secrets scattered across config files, pipelines, and laptops you’ve long forgotten about. Others end up in log files on bad days. And every user with write access to your cluster sits one step away from the vulnerable code path. In a world like that, “the attacker would need credentials first” isn’t a speed bump — it’s a shrug.
So the real question was never authenticated vs. unauthenticated. It’s what authentication unlocks. Here, it unlocks Remote Code Execution (RCE), which is exactly what the CVSS score is trying to tell you — even if the industry’s reaction hasn’t quite caught up. Attackers don’t need your time-series collection to already exist – they just need someone’s credentials in the wrong hands, and there are more ways for that to happen than most teams want to admit.
I’m not raising this to be smug. I’m raising it because too many incident channels keep stalling on the wrong question. It isn’t: “Does our app use time-series?” It’s: “What can a user holding our readWrite role actually do this week?”
Until you patch, the answer is more than you think.
Percona Server for MongoDB 8.0.23-10 — May 21, 2026
Percona Server for MongoDB 6.0.28-22 — May 25, 2026
6.0 is on the End-Of-Life (EOL) track. The easy call would be to point at the lifecycle page, note that the upgrade conversation is overdue, and stop there. We’re shipping the fix anyway. Customers running 6.0 in production have real reasons they haven’t migrated yet — frozen application stacks, certification cycles, dependencies that don’t move on quarterly cadences — and none of those reasons are worth exploiting while a migration plan gets approved.
Percona is not building binary packages for the 5.x line. We’re being upfront about that — the calculus on extended support has a limit, and 5.x is past it for us. But the fix itself is already in our public release branch: release-5.0.33-26. If you have a hard requirement on 5.x and the time pressure to meet it, the source is available for building. Percona customers on 5.x can open a ticket, and we’ll work on the case individually.
What to do this week?
Patch! Specifically:
If you’re on 7.0, upgrade to 7.0.34-19 from May 20 onward.
If you’re on 8.0, upgrade to 8.0.23-10 from May 21 onward.
If you’re on 6.0, upgrade to 6.0.28-22 from May 25 onward.
If you’re on 5.0 and you can’t move, build from release-5.0.33-26. Customers — open a ticket and we’ll help.
As usual, you can download patches from your package manager or Percona Software Downloads page.
If you’re running PSMDB on Kubernetes via the Percona Operator for MongoDB, edit the image tag in your PerconaServerMongoDB custom resource and let the operator roll the cluster. Don’t wait for the June operator release to do it for you. See details in our documentation on how to Upgrade Percona Server for MongoDB.
While you’re in there, audit your custom roles. Anything granting createCollection on a production database is, today, an RCE primitive in waiting. Decide whether the service accounts that hold it actually need it. Decide whether your application users need full readWrite or whether a narrower role would do the same job. Treat the answer as part of your security posture, not as a quarterly cleanup task you’ll get to.
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
2:45 am
Manually Migrate Hash Slots in a Valkey/Redis Cluster
This article explains how to manually migrate hash slots in Valkey/Redis clusters to expand your deployment with minimal disruption to availability.
Note: Valkey 9.0 introduces the Atomic Slot Migration (ASM) feature, which significantly improves migration speed (up to 9.52 times faster) and reliability, while reducing migration complexity. So you should use ASM instead if you are using Valkey version 9.0 and later. You can read more about ASM in the Valkey’s community blog.
Refresher on hash slots
Valkey and Redis clusters partition their keyspace into 16,384 hash slots. Each key is assigned to a slot based on the CRC16 hash of its name (or hashtag), ensuring consistent and deterministic routing across cluster nodes:
CRC16(<key>) % 16384
So, for example, the command SET hello world will store the key hello in slot 866.
Accessing keys in different slots
Because hash slots in Valkey/Redis can be located on different processes, if you access multiple slots in a single command, the cluster would have to coordinate between nodes. This can impact cluster performance and data consistency, as nodes can fail while processing a command. So, to keep things simple and fast, Valkey/Redis will disallow accessing keys that hash to different slots.
You can read more about distributing data in Valkey/Redis cluster in my colleague Agustin’s blog post.
Why do you need to migrate hash slots?
As your dataset grows, your Valkey/Redis cluster may not have enough memory to hold all the data; or the resource utilization between nodes is not evenly distributed. To address these issues, you might need to add more nodes to the cluster, or move the hash slots around so that data doesn’t get evicted, and no nodes are under/overutilized.
How the slot migration process works
The slot migration process can be roughly divided into 3 steps:
1. Update the hash slot’s state so that clients can know where to get the keys during the migration stage:
3. Update the hash slot’s metadata to reflect the new ownership:
CLUSTER SETSLOT <slot> NODE <target ID>
What happens when the hash slot is accessed during migration
During hash slot migration, if a command accesses keys stored in the migrating slot, the instance will first check its local hash table. If the keys are not found locally, the client will receive an ASK redirection to the target node. Unlike MOVED (which also tells the client to retry requests at another node), an ASK redirection is meant to be temporary; it will not update the client’s slot cache and affects only a single request. So, when you re-run the same request, even if you are on the correct node, you will still get redirected:
127.0.0.1:30002> GET 123
-> Redirected to slot [5970] located at 127.0.0.1:30001
"123"
127.0.0.1:30001> GET 123
-> Redirected to slot [5970] located at 127.0.0.1:30002
-> Redirected to slot [5970] located at 127.0.0.1:30001
"123"
Commands executed during migration may fail if they access keys that are distributed across different nodes. In this case, the command will encounter the same limitation as a CROSSSLOT error, because all keys involved in the operation must reside on the same instance:
If the target node is the one receiving commands, the flow is much easier. It just checks if you are following an ASK redirection. If so, the command will be processed normally. If not, then the node will respond with -ASK and direct you to the original owner:
So, during the slot migration, your request will be redirected at most twice, first to the source node (since it is still registered as the owner of the hash slot), then to the target node if the keys are not found on the source:
# connect to the target node, and GET a non-existent key
valkey-cli -c -p 30001
127.0.0.1:30001> GET {123}1
-> Redirected to slot [5970] located at 127.0.0.1:30002
-> Redirected to slot [5970] located at 127.0.0.1:30001
(nil)
Why using valkey-cli --cluster rebalance won’t work for all cases
Depending on your application’s key pattern, the number of keys each slot holds can vary, and a slot might have a disproportionate number of keys (the hot-slot problem). And, the --cluster rebalance command only attempts to evenly distribute the number of slots each node has:
/* Calculate the slots balance for each node. It's the number of
* slots the node should lose (if positive) or gain (if negative)
* in order to be balanced. */
int threshold_reached = 0, total_balance = 0;
float threshold = config.cluster_manager_command.threshold;
i = 0;
listRewind(involved, &li);
while ((ln = listNext(&li)) != NULL) {
clusterManagerNode *n = ln->value;
weightedNodes[i++] = n;
int expected = (int)(((float)CLUSTER_MANAGER_SLOTS / total_weight) * n->weight);
n->balance = n->slots_count - expected;
total_balance += n->balance;
/* Compute the percentage of difference between the
* expected number of slots and the real one, to see
* if it's over the threshold specified by the user. */
int over_threshold = 0;
if (threshold > 0) {
if (n->slots_count > 0) {
float err_perc = fabs((100 - (100.0 * expected / n->slots_count)));
if (err_perc > threshold) over_threshold = 1;
} else if (expected > 1) {
over_threshold = 1;
}
}
if (over_threshold) threshold_reached = 1;
}
So, using --cluster rebalance could migrate all hot slots to the same node, further exacerbating the issue.
How to manually migrate a hash slot to a different node
1. Gather usage statistics on the slot
To find the list of big hash slots, we can use CLUSTER SLOT-STATS, which provides details for each slot (number of keys, CPU time, and network I/O).
Note: to display stats other than key-count, the config cluster-slot-stats-enabled needs to be set. The config can be modified during runtime, but remember to set it for all nodes in the cluster:
To find which node the hash slot belongs to, we can use the CLUSTER SLOTS command. In the example below, node 127.0.0.1:30001 holds slots 0-5460, 127.0.0.1:30003 holds slots 10923-16383:
After collecting statistics for each hash slot, we should have a clear understanding of the data size and usage patterns for each slot. Based on this information, we can decide which underutilized node a slot should be migrated to. This helps ensure that resource utilization is evenly balanced across all nodes in the cluster.
On the target node, execute CLUSTER SETSLOT <SLOT> IMPORTING <SOURCE-NODE-ID>
valkey-cli -p 30001 -c CLUSTER SETSLOT 5970 IMPORTING b32b8042b9280c6a5d266fcaf68c90f5167f8463
OK
On the source node, execute CLUSTER SETSLOT <SLOT> MIGRATING <TARGET-NODE-ID>
valkey-cli -p 30002 -c CLUSTER SETSLOT 5970 MIGRATING b1bb71c7d39d2c061ce3fc010b444cc20cbfb7b8
OK
When set correctly, commands that create new keys in the hash slot will be directed to the target node:
127.0.0.1:30002> SET 123 hello
-> Redirected to slot [5970] located at 127.0.0.1:30001
OK
Note: remember to double-check the node IDs used in the commands. Currently, Valkey/Redis does not check if the node ID provided is the actual owner of the hash slot, so the (wrong) commands like below will still be able to execute successfully:
# set the hash slot 5970 as IMPORTING from myself
valkey-cli -p 30001 -c CLUSTER SETSLOT 5970 IMPORTING "$(valkey-cli -p 30001 -c CLUSTER MYID)"
OK
# set the hash slot 5970 as MIGRATING to myself
valkey-cli -p 30002 -c CLUSTER SETSLOT 5970 MIGRATING "$(valkey-cli -p 30002 -c CLUSTER MYID)"
OK
So, when a new key is created (or a migrated key is updated), the request still goes to the original owner, instead of the target node:
127.0.0.1:30002> SET 123 hello
OK
And if you attempt to migrate a duplicated key, you will get the following error:
127.0.0.1:30002> MIGRATE 127.0.0.1 30001 123 0 5000
(error) ERR Target instance replied with error: BUSYKEY Target key name already exists.
Since valkey-cli --cluster check will still report the correct status, debugging this issue will be pretty confusing:
valkey-cli --cluster check 127.0.0.1:30001
127.0.0.1:30001 (ed38d440...) -> 1 keys | 5461 slots | 1 replicas.
127.0.0.1:30002 (654a8caa...) -> 1 keys | 5462 slots | 1 replicas.
127.0.0.1:30003 (e0ed692b...) -> 0 keys | 5461 slots | 1 replicas.
[OK] 2 keys in 3 primaries.
0.00 keys per slot on average.
>>> Performing Cluster Check (using node 127.0.0.1:30001)
M: ed38d44008b75d1b04f9670672e117d5764da51c 127.0.0.1:30001
slots:[0-5460] (5461 slots) master
1 additional replica(s)
S: 00fa8982389641fb2b73b416f08fafb18d9cc404 127.0.0.1:30006
slots: (0 slots) slave
replicates e0ed692bdf8f203322077dfeefb4d72f3ae121d2
S: fa4c2f539e74c18fe3aba7e8cdae7a01b91dd782 127.0.0.1:30004
slots: (0 slots) slave
replicates ed38d44008b75d1b04f9670672e117d5764da51c
M: 654a8caa0b5f53cd02698cf46ef9a102d0277ead 127.0.0.1:30002
slots:[5461-10922] (5462 slots) master
1 additional replica(s)
M: e0ed692bdf8f203322077dfeefb4d72f3ae121d2 127.0.0.1:30003
slots:[10923-16383] (5461 slots) master
1 additional replica(s)
S: 2063030dab7cce9dcc248031f53e6224b7c727c3 127.0.0.1:30005
slots: (0 slots) slave
replicates 654a8caa0b5f53cd02698cf46ef9a102d0277ead
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
[WARNING] Node 127.0.0.1:30001 has slots in importing state 5970.
[WARNING] Node 127.0.0.1:30004 has slots in importing state 5970.
[WARNING] Node 127.0.0.1:30002 has slots in migrating state 5970.
[WARNING] Node 127.0.0.1:30005 has slots in migrating state 5970.
[WARNING] The following slots are open: 5970.
>>> Check slots coverage...
[OK] All 16384 slots covered.
I have created a Pull Request to the Valkey project, disallowing the command CLUSTER SETSLOT MIGRATING/IMPORTING to point to itself, so future versions of Valkey should not encounter this issue.
3. Migrate keys in the slot
After updating the slots’ state, we need to perform the actual data migration in those slots. We need to get the list of keys in the migrating slot using CLUSTER GETKEYSINSLOT, then execute the MIGRATE command on the found keys.
The script below will get and migrate the keys in DB 0 in slot 5970 from node 127.0.0.1:30002 to 127.0.0.1:30001 in a batch of 10 keys, with the timeout for each MIGRATE command being 5000 milliseconds:
keys="$(valkey-cli -p 30002 -c CLUSTER GETKEYSINSLOT 5970 10)"
while [[ "${keys}" != "" ]]
do
valkey-cli -p 30002 -c MIGRATE 127.0.0.1 30001 "" 0 5000 KEYS $keys
keys="$(valkey-cli -p 30002 -c CLUSTER GETKEYSINSLOT 5970 10)"
done
OK
OK
OK
...
The migration is finished when CLUSTER GETKEYSINSLOT returns an empty array:
After the keys migration is completed, update the slot’s metadata on the original and new node using CLUSTER SETSLOT <SLOT> NODE <TARGET-NODE-ID> to reflect the new ownership:
valkey-cli -p 30001 -c CLUSTER SETSLOT 5970 NODE b1bb71c7d39d2c061ce3fc010b444cc20cbfb7b8
OK
valkey-cli -p 30002 -c CLUSTER SETSLOT 5970 NODE b1bb71c7d39d2c061ce3fc010b444cc20cbfb7b8
OK
It will take a little while for the cluster to agree on the new slot distribution, as the metadata is propagated via the gossip protocol.
5. Validate the cluster’s state
Validate the cluster’s hash slots distribution with valkey-cli –cluster check to ensure that all slots are covered:
We can see that the migration is successful, with all 16384 hash slots covered, and the slot distribution is reported correctly, with node 127.0.0.1:30001 having 1 more slot than the rest.
Conclusion
In this article, we learned how to perform manual hash slot migration to balance/expand a Valkey/Redis cluster. And while manual hash slot migration is a necessary process to resolve hot-slot issues that automated rebalancing cannot, users on Valkey 9.0 and later should leverage the faster, more reliable Atomic Slot Migration (ASM) feature.
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
5:12 pm
Not All Open Source Is Equal: Choosing a PostgreSQL Operator for Kubernetes in 2026
Choosing an open source PostgreSQL operator for Kubernetes used to be a question about features and community size. In 2026, it has become a question about licensing posture, image distribution, and whether the project you pick today will still be operationally open in three years.
This is part 1 of a 3-part series on running PostgreSQL on Kubernetes with a fully open-source operator.
Part 1 (this post): how the open-source landscape has shifted under your feet, and what to look for in an operator before you commit
Part 2: migrating from the Crunchy Data PostgreSQL Operator to the Percona PostgreSQL Operator using the standby cluster method (near-zero downtime)
Part 3: two simpler migration paths: backup-and-restore and persistent-volume reuse
In this post, you will learn about:
What has changed in the open-source landscape over the last few years, with specific examples
What licensing and redistribution actually mean for Kubernetes operators in production
How to evaluate whether a project is “open source in theory” or open source in practice
Where Percona’s PostgreSQL Operator fits in, and what the practical migration looks like
Open source isn’t what it used to be The landscape of open source has undergone significant changes in recent years, and selecting the right operator and tooling for PostgreSQL clusters in Kubernetes has never been more important. Three recent shifts illustrate the pattern.
MinIO
MinIO was the default open-source S3-compatible storage backend for Kubernetes workloads for years. The trajectory over the last few years tells the story:
Entered what amounted to maintenance mode, narrowing community engagement, limiting support to paid subscriptions, and reducing acceptance of community contributions
On April 25, 2026, the github.com/minio/minio repository was archived by the project owner, ending public development of the open-source version
The code is still cloneable, but the project is no longer maintained as open source. Teams running MinIO in production now need an exit plan.
August 28, 2025: deprecation of non-hardened Debian-based images in the free tier began, and non-latest images started to be removed
September 29, 2025 (after community pushback): the public docker.io/bitnami catalog was reduced. The remaining free images were limited to a small curated set of latest-version, hardened images intended for development use; older versions of most applications were moved to a “Bitnami Legacy” repository
For Kubernetes teams, the practical impact was immediate: any Helm chart that pinned a specific Bitnami image version (a recommended practice) found that image gone or moved, breaking CI pipelines and air-gapped deployments.
Crunchy Data PostgreSQL images
Crunchy Data illustrates the same dynamic in the Postgres operator space. To be clear: the Crunchy Data PostgreSQL Operator is a mature, well-engineered project, and the team behind it has done a lot of valuable work upstream and around pgBackRest and Patroni integrations. The point of this section is not the engineering, it is the redistribution and usage terms that govern the official builds.
Crunchy’s licensing shifts, 2022 to 2024
Between 2022 and 2024, several shifts occurred:
Redistribution restrictions. While the PostgreSQL code is open source, Crunchy’s official Docker images include branding and enterprise features that are not freely redistributable. The Crunchy Data Developer Program terms describe the software as intended for internal or personal use; production use by larger organizations typically requires an active support subscription.
Restrictions on consulting and resale. The terms explicitly prohibit using Crunchy’s images to deliver support or consulting services to others without an authorized agreement. The PostgreSQL source code remains open source, but the official images and their packaging are not freely redistributable, which limits practical use in commercial and customer-facing settings.
Registry move. Most images were moved to registry.developers.crunchydata.com, which requires authentication and acceptance of terms before pulling. That draws a clearer line between open-source code and proprietary builds.
In other words, the project is open source on the code side, but the practical artifacts (images, Helm releases) are gated.
What these restrictions really mean for Kubernetes users
When container images and operators come with redistribution limits, authentication requirements, or “internal-use-only” clauses, the impact on Kubernetes environments is immediate and concrete. Teams can no longer:
Build air-gapped clusters by mirroring images to a private registry without working through a license review
Rely on GitOps workflows that assume publicly accessible OCI images
Fork or customize the operator freely, because official images cannot be redistributed with modifications
Use the software in commercial or customer-facing products without additional licensing
Run multi-cluster or multi-tenant Postgres at scale without bumping into usage terms
For a database operator, where almost every operational pattern depends on the container images you can pull and run, these restrictions effectively turn a project into a “source-available but not operationally open” solution. The code is open. The operating story is not. As a result, many teams are switching to fully open-source alternatives: the Percona Operator for PostgreSQL, CloudNativePG, Zalando Postgres Operator, StackGres, and a few others.
How to evaluate “open source” in 2026
The bigger picture here is that “open source” today often exists more in theory than in practice. It pays to look past the badge and check the operating reality. Three questions to ask before you commit to an operator:
1. Are the container images publicly redistributable?
If you cannot pull the official images without authentication, or you cannot mirror them to your private registry without a license review, your air-gapped and GitOps stories are constrained from day one. This is the question that turned out to be the most consequential one for MinIO, Bitnami, and Crunchy users in 2025.
2. Are core operational features in the open-source build, or behind a paywall?
Backup, monitoring, HA, and security features should be in the build everyone uses, not gated behind an enterprise tier. A “community edition” that omits the feature most teams actually need is a marketing build, not a real open-source build.
3. Is the governance and roadmap public?
A project where you can see the issues, the PRs, and the roadmap is one you can plan around. The Percona PG Operator’s public roadmap is an example of what this looks like in practice. A project run inside a vendor’s private tracker, by contrast, gives you no visibility.
These are not gotchas. They are the questions that decide whether a project will still serve you the same way in three years.
Migrate to freedom
Announcing the hard fork
We strongly believe in fully open-source software and want to increase our investment in the PostgreSQL and Kubernetes ecosystems. To back that up, we have decided to hard fork the Crunchy Data PostgreSQL Kubernetes Operator. Starting from version 3.0.0 (coming soon), the Percona PostgreSQL Kubernetes Operator is a fully independent project, with a public roadmap, public issue tracker, and freely redistributable images.
The hard fork is not a critique of Crunchy’s engineering. It is a commitment that the operator will keep evolving in a fully open-source direction, with no surprises about which features will be available to which audience.
Why migration is straightforward
Because the Percona PostgreSQL Operator is a hard fork of the Crunchy operator, the migration paths are surprisingly straightforward. The same underlying tools (Patroni, pgBackRest, PgBouncer) and the same overall design are used in both, which means migration can be done in multiple ways, sometimes with near-zero downtime, sometimes faster with a small downtime window. The next two posts in this series walk through three concrete options.
What’s next
This was the “why.” The next two posts are the “how”:
Part 2: Standby cluster migration. Bring up a Percona cluster as a standby of the Crunchy primary, catch it up via pgBackRest plus streaming replication, and promote it at cutover. The only downtime is the cutover itself.
Part 3: Backup-restore and PV reuse. Two simpler paths: bootstrap a Percona cluster directly from a Crunchy pgBackRest backup, or retain the existing PGDATA persistent volume and have Percona pick up where Crunchy left off.
Reversibility and exit options
All three paths are reversible: because Percona’s operator, images, and tooling are 100 percent open source and remain compatible with the same backup format and the same Patroni HA model, you keep full control. You can migrate back to Crunchy if your team decides to, or out to another open-source operator (CloudNativePG, Zalando, StackGres) using the same patterns. That last journey is a topic for a future article.
This series covers basic deployment patterns and simplified configuration examples. If your environment is more complex, uses custom images, includes Crunchy enterprise features like TDE, or otherwise needs tailored migration steps, contact the Percona team and we will help you plan and execute the move.
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
1:08 pm
Keeping pgBackRest Open, Healthy, and Community Driven
When the future of pgBackRest suddenly became uncertain, the PostgreSQL ecosystem reacted quickly.
At Percona, we believed the most important question was not:
what replaces it?
but:
how do we ensure pgBackRest remains healthy, sustainable, and open for everyone?
That distinction matters.
pgBackRest is critical infrastructure used by enterprises around the world to protect some of their most important data. When projects like this face maintainership or sustainability challenges, organizations need trusted open source partners that can help provide continuity, stability, and confidence.
Supporting continuity, not fragmentation
From the beginning, Percona believed the best outcome for pgBackRest was not fragmentation, forks, or closed alternatives.
What the project needed was continuity.
That meant working collaboratively across the ecosystem to help strengthen the project itself:
– coordinating funding discussions – contributing engineering resources – helping expand the maintainer base – encouraging participation from multiple organizations
The goal was never to control the project. The goal was to help ensure pgBackRest remained open, healthy, and sustainable for the entire PostgreSQL community.
A joint effort across maintainers, contributors, and multiple companies is helping ensure pgBackRest returns in a stronger and healthier position than before. Funding, engineering support, and long-term sustainability discussions are now happening collaboratively across the ecosystem.
Percona is proud to play a part in that effort. Just as importantly, this moment would likely never have happened without David Steele bringing visibility to the sustainability realities behind maintaining critical open source infrastructure.
For more than a decade, David built pgBackRest into one of the most trusted backup and recovery solutions in the PostgreSQL ecosystem. The current momentum around the project reflects the value of that work and the trust the community has in what he created.
That is how healthy open source ecosystems should work.
The role of trusted open source partners
At Percona, this is not simply a business decision. It reflects how we see open source itself: the strongest ecosystems are built in the open, through collaboration, shared responsibility, and long-term commitment.
Enterprises need more than software alone. They need trusted partners that can help support continuity, sustainability, and long-term ecosystem health.
We believe critical open source infrastructure is strongest when it remains: – community driven – vendor neutral – collaboratively maintained – available to everyone
The pgBackRest story is also a reminder that the PostgreSQL ecosystem needs stronger long-term sustainability structures around critical community infrastructure.
Whether that ultimately takes the form of an ecosystem foundation or another collaborative model, the goal should remain the same: ensuring the projects enterprises rely on stay healthy, trusted, and sustainably maintained.
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
1:52 am
Bringing pt-query-digest-Style Slow Query Analysis to PostgreSQL with pg_enhanced_query_logging
In this blog post, we are going to briefly discuss pg_enhanced_query_logging (PEQL for short), a PostgreSQL extension that produces slow query logs in the same format MySQL and Percona Server users have been feeding into pt-query-digest for years. The idea is simple: reuse the tried-and-true tools and concepts we have been using for performing full query audits with low performance hits. This tool was conceived and developed for the recent Percona Build with AI Competition.
A quick word of caution before we begin: PEQL is under active development and has not been validated for production use. We will use it in a development environment here, and you should do the same.
Why a new slow log for PostgreSQL?
Out of the box, PostgreSQL gives us log_min_duration_statement and a handful of related GUCs that print slow queries to the server log. That is useful, but the format is line-oriented and mixed in with everything else PostgreSQL writes there. On the MySQL side, the Percona Server extended slow query log goes much further: per-query counters, lock and I/O times, plan-quality flags, and a structured format that pt-query-digest can group by query fingerprint and rank by total time, average time, lock time, etc. This introduces the more powerful concept of performance of a family of queries, and not just individual query executions.
PEQL ports that same workflow to PostgreSQL. It hooks into the executor and planner, captures timing, buffer I/O, WAL, JIT and row-count metrics for every query slower than a configurable threshold, and writes them to a dedicated log file using a pt-query-digest-compatible format.
Motivation and benefits
The original idea behind this extension is doing query audits with minimal impact on the running server. We want to be able to ask “what queries will we benefit more from tuning?” without paying for it in latency, I/O or in a flood of unrelated log lines.
That goal drives most of the design decisions:
Statistically accurate sampling with low overhead. We don’t need to log every single query to draw useful conclusions. PEQL can sample 1 out of every N queries (or 1 out of every N sessions), and doing this for enough time will mean that we can have a sample that represents the overall workload for that time period. The cost on the producer side stays low even on busy servers.
pt-query-digest compatibility out of the box. The output format mirrors the MySQL/Percona Server slow log, so the same toolchain we already use for MySQL audits works for PostgreSQL with no extra steps.
Logging to a separate file. All entries go to a dedicated file (default peql-slow.log), not to PostgreSQL’s main error log. That keeps the error log clean for actual errors and lets us point the slow log at a separate mountpoint if we want to isolate its I/O from the rest of the server.
Rate limiting by both queries and bytes per second. On top of the per-session/per-query 1-in-N sampling, peql.rate_limit_auto_max_queries and peql.rate_limit_auto_max_bytes give us a cluster-wide cap on logged queries per second and on bytes written per second. Useful for guaranteeing that the slow log itself never becomes a performance issue.
Always-log override for slow outliers. Even when sampling is on, peql.rate_limit_always_log_duration lets us say “but always log anything that takes longer than X ms”. The common queries get randomly sampled; the long-running ones always get logged.
Extended resource usage metrics. Each entry includes buffer hit/read/dirtied/written counts (shared, local and temp), block I/O timings, WAL records/bytes/full-page images, JIT compilation timings, planning time, optional memory context allocations and an optional wait-event histogram.
Execution plans embedded in the entry. With peql.log_query_plan = on, the full EXPLAIN ANALYZE output (text or JSON) is appended to each entry, so the plan that produced the metrics is right there next to them when we are reviewing the log later.
Automatic pause when disk space is low. If the log mountpoint drops below a configurable free-space threshold, PEQL pauses logging on its own (with optional auto-purge of old rotated files) and resumes once there is room again. The database keeps serving traffic; the slow log gets out of the way.
Installing the extension
PEQL is a regular PGXS extension, to build it we can execute the following steps:
git clone https://github.com/guriandoro/pg_enhanced_query_logging.git
cd pg_enhanced_query_logging
make USE_PGXS=1
sudo make install USE_PGXS=1
This installs the shared library into $(pg_config --pkglibdir) and the SQL/control files into $(pg_config --sharedir)/extension/. The hooks live in the shared library, so we need to preload it. Add the following line to postgresql.conf (or edit your current value to include it):
Restart PostgreSQL, and then create the extension in any database where we want the SQL helper functions:
CREATE EXTENSION pg_enhanced_query_logging;
To easily test it, the repository ships a Docker-based quick start that builds Rocky Linux 9 + PostgreSQL 18 with the extension preloaded:
./test/deploy_docker_pg18_rhel.sh
A minimal configuration
For a first look, the easiest thing to do is to log every query at full verbosity:
shared_preload_libraries = 'pg_enhanced_query_logging'
peql.log_min_duration = 0 # log every query
peql.log_verbosity = 'full' # emit all metric lines
While we are at it, we can also silence PostgreSQL’s native query logging so we have a single place to look:
log_statement = 'none'
log_min_duration_statement = -1
log_duration = off
By default, PEQL writes to peql-slow.log inside PostgreSQL’s log_directory. The location and filename are configurable via peql.log_directory and peql.log_filename.
What an entry looks like
After running a few queries, opening peql-slow.log shows entries like this one (trimmed for brevity):
# Time: 2026-03-11T09:15:32.847291
# User@Host: app_user[app_user] @ 10.0.1.42 []
# Thread_id: 48712 Schema: mydb.public
# Query_id: -6432758210044805760
# Query_time: 1.285034 Lock_time: 0.000000 Rows_sent: 256 Rows_examined: 87500
# Shared_blks_hit: 4096 Shared_blks_read: 312 Shared_blks_dirtied: 0 Shared_blks_written: 0
# Temp_blks_read: 0 Temp_blks_written: 48
# Shared_blk_read_time: 0.024310 Shared_blk_write_time: 0.000000
# WAL_records: 0 WAL_bytes: 0 WAL_fpi: 0
# Plan_time: 0.003210
# Full_scan: Yes Temp_table: No Temp_table_on_disk: Yes Filesort: Yes Filesort_on_disk: No
# JIT_functions: 4 JIT_generation_time: 0.001250 JIT_emission_time: 0.003100
SET timestamp=1741680931;
SELECT o.id, o.total, c.name FROM orders o JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'pending' ORDER BY o.total DESC LIMIT 256;
The full breakdown of every field, with the GUCs that produce it, lives in doc/annotated-sample.md. This is a great place to start reading the documentation.</p>
Feeding it to pt-query-digest
Because the format mirrors the MySQL slow log, we can point pt-query-digest at it directly:
We get the familiar profile at the top (queries grouped by fingerprint, ranked by total time), followed by the per-query detail blocks. The plan-quality flags above can also be used as filters, for example to look only at queries that did a sequential scan:
Example pt-query-digest outputs will look like the following images.
Queries grouped by fingerprint, ranked by total time.
Per-query detail blocks.
A few useful knobs
Once we move beyond logging everything, there are a handful of GUCs worth knowing about:
peql.rate_limit: 1-in-N sampling, either per session or per query, with a peql.rate_limit_always_log_duration override so that very slow queries are always captured even when sampling is on.
peql.log_parameter_values: include actual bind parameter values for prepared statements alongside the placeholder query text.
peql.log_query_plan: embed the full EXPLAIN ANALYZE output (text or JSON) inside the log entry, so the plan that produced the metrics is right there next to them. This can be expensive in terms of I/O, so use sparingly and only if needed.
The full list, with default values and contexts, is documented in doc/configuration.md.
Future work
The pt-query-digest compatibility is a feature, but it’s also a constraint: the MySQL slow log format was designed to be human readable, which means it’s way too verbose. For instance, the plan-quality flags line only has 5 bits of actual information, but uses around 100 bytes to encode them:
# Full_scan: Yes Temp_table: No Temp_table_on_disk: No Filesort: Yes Filesort_on_disk: No
We can do this better by simply logging YNNYN or 10010 (hence the 5 bits of information mentioned above), and have the position within the query log entry make it self-explanatory as to what this information is.
This is a 20x amplification factor! And other lines suffer of similar issues… Multiply that by every query on a busy server and the overhead adds up quickly, both in disk space and in the I/O the backend has to do to write the entries out.
There are two pieces of follow-up work we have in mind to address this:
A PEQL-native log format. A more compact, structured format (think key-value pairs with short keys, bitfields for the boolean flags, or a binary framing for the numeric metrics) that drops the bytes-per-query cost without losing any of the information we currently emit. The verbose pt-query-digest-compatible format could still be available for users that want it; the native format would be the recommended option for high-throughput workloads.
Tooling for the new format. Once the native format exists, we will either contribute a parser to pt-query-digest so that it can ingest it natively (--type peql or similar), or ship a small companion tool that either post-processes them or produces the same kind of profile reports pt-query-digest does today. Either way, the goal is to keep the analysis workflow we are used to while removing the format-imposed overhead from the producer side.
If any of this sounds interesting and you would like to help shape it, the repository’s doc/contributing.md is the right place to start.
Conclusion
PostgreSQL has had rich per-query metrics available for a while now, but stitching them together into the kind of “show me the worst-performing family of queries from the last hour” workflow MySQL users have enjoyed for years has taken more effort. PEQL closes that gap by emitting a single, pt-query-digest-compatible log file with timing, buffer, WAL, JIT and plan-quality data attached to every query.
If you want to dig deeper, the doc/ directory in the repository has detailed pages on the output format, the architecture of the hooks, the rate limiter and the disk-space protection logic. And if you have not used pt-query-digest before, this is a great time to do it!
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
5:12 pm
PSMDB Sandbox: A Browser-Based UI for Deploying MongoDB with Terraform and Ansible
If you’ve ever wrestled with .tfvars files, juggled Ansible inventory paths, or tried to remember the exact command sequence for a MongoDB setup — this post is for you.
PSMDB Sandbox is a lightweight web frontend built in Go that ships inside the Percona MongoDB Automation repository. It puts a clean browser interface on top of the full Terraform + Ansible automation stack, so you can spin up, manage, and tear down MongoDB environments without ever touching a config file by hand.
This project was built using vibe coding — the result is a fully functional application developed rapidly without writing every line from scratch. It’s a great example of how AI-assisted development can accelerate tooling projects that would otherwise sit in the backlog forever.
Why a Web UI?
The mongo_terraform_ansible project already automates a lot: it can deploy Percona Server for MongoDB (PSMDB), Percona Backup for MongoDB (PBM), and Percona Monitoring and Management (PMM) across AWS, GCP, Azure, Docker, and Libvirt/KVM. That’s powerful — but the workflow traditionally meant editing .tfvars files, running commands in the right order, and tracking state in your head.
The Go UI changes that. It wraps the same Terraform and Ansible automation in a wizard-style interface, streams live output to your browser, and keeps track of environment state so you always know what’s running, stopped, or in progress.
It’s particularly useful as a testing sandbox for PSMDB features. You can quickly spin up a replica set or sharded cluster, test backup and restore workflows with PBM, explore audit logging, and observe everything through PMM monitoring — all from the browser, and all torn down just as easily when you’re done.
What You Can Configure
Cluster Topology
Define how many clusters and replica sets you want, the number of nodes per replica set, and whether to deploy a sharded cluster or a simple replica set. Each cluster is independently configurable.
PSMDB Version and Packages
Pick the exact Percona Server for MongoDB release you want to test — package identifiers are fetched automatically from the Percona repository listing on startup, so you’re always selecting from what’s genuinely available. For Docker-based environments, image tags are pulled live from Docker Hub and cached for five minutes.
Backup and Restore with PBM
Percona Backup for MongoDB (PBM) can be included in the deployment. PBM is configured with the native storage backend for the supported environments (e.g. an S3 bucket is automatically created for AWS). This makes the sandbox ideal for testing backup policies, point-in-time recovery, and restore scenarios without touching production.
PMM Monitoring
You can include a PMM Server in your environment so every PSMDB node is monitored from the moment it comes up. This makes it straightforward to test alerting rules, explore query analytics, or simply validate that your monitoring setup looks right before applying it elsewhere.
Live Deployment Logs
When you hit Deploy, the UI kicks off terraform init && terraform apply (plus Ansible playbooks for cloud platforms) in a background goroutine and streams the output directly to your browser via Server-Sent Events. No more tailing log files in a separate terminal.
Hosts & Connections Panel
After a successful deployment, the environment detail page shows every host (or container) with:
Its IP address
A ready-to-copy connect command (ssh user@host or docker exec -it <name> bash)
MongoDB connection strings for every replica set and cluster
Clickable Open buttons for PMM and MinIO Console URLs
Stop, Restart, Reset, and Destroy
Full lifecycle management is available from the UI. For Docker environments, Stop and Restart call docker stop / docker restart filtered by the environment’s prefix. For cloud environments, the corresponding Ansible stop.yml and restart.yml playbooks run. Destroy calls terraform destroy and, on success, automatically cleans up the inventory and redirects you back to the environments list.
Getting Started
git clone https://github.com/percona/mongo_terraform_ansible.git
cd mongo_terraform_ansible/ui-go
go run .
You can customize the bind address and port with environment variables:
Security note: The UI is designed for local use. It binds to 127.0.0.1 by default. Don’t expose it to the public internet without adding authentication.
Try It and Share Your Feedback
PSMDB Sandbox is a community-contributed tool. If you try it out, run into issues, or have ideas for improvements, open an issue or pull request on GitHub. The project is licensed under Apache 2.0.
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
12:43 pm
I Know Kung Fu
You might find this hard to believe, but AI has become kind of a thing around here.
Bennie published a post on our Build with AI competition last week, in which he shared that I was lucky enough to land the second place prize. Genuinely flattered, and a real thank you to Peter F, PZ, Vadim, and Bennie for organizing it. The recognition is great. But the part that does not quite come through in the recap is what those six weeks actually felt like from the inside. Forty-plus submissions, 10+ teams, marathon demo sessions that ran out of time twice over, and a constant drumbeat of ideas where every fifth one made me think “wait, we can just… ship that?”
Two submissions that really impressed me (and are worthy of high praise): Kedar Vaijanapurkar shipped a four-tool MySQL stack (Advisor, random data generator, CleanPrompt, and a Query Reviewer), any one of which on its own would have been a strong submission. And Daniil built a leaderboard for Percona ecosystem contributors plus a vector-search prototype running on Percona’s own products, which is exactly the dogfood story we want.
There were a lot more than three projects worth backing, which is part of why a second contest round is being coordinated later this year. A lot of the entries are not waiting for it either – they are already developing into real, operational utilities (some of mine included).
The two submissions of my own that I would point to first are IBEX and percona-dk.
IBEX (Integration Bridge for EXtended systems) is a local MCP multi-tool server that connects either a local model or a Percona-owned LLM to the systems where the most valuable context actually lives. Slack, Notion, Jira, ServiceNow, Salesforce, etc. A solution was needed here since we could not point the standard Claude or ChatGPT connectors at our sensitive internal data, and obviously most of the context that makes LLMs so valuable is precisely that kind of data.
percona-dk is the other one. It started as a way to keep AI honest about our own products by giving the AI tools our teams use (Claude, Cursor, anything that speaks MCP) direct access to Percona’s documentation, so the answer to a question about our products comes from real docs with linked citations instead of stale training data or even scraped web results that can get things wrong. It has evolved a fair bit since the contest. The Percona Community blog and forums are now indexed alongside the docs, Perconians are getting real day-to-day value out of it, and it is starting to look like the kind of thing that could grow into a community utility (perhaps even beyond Percona docs).
Those two were just the start. Once IBEX worked, I needed shared memory across LLMs, so I built that. Once I had three MCP servers running, the boilerplate got annoying, so I built CAIRN, a scaffolding tool that builds on Anthropic’s official MCP builder skill. The official skill walks you through writing a server step by step, but CAIRN spins up a complete, working project in minutes with a streamlined install wizard for non-technical users. It is now in the hands of other Perconians building their own MCP tools, and providing real value of its own. Then I learned about .mcpb files and Desktop Extensions (.dxt), packaged everything that way, and stood up an internal Claude plugin marketplace so any Perconian can install the lot from one place. Each layer opened a door I did not know existed until I was already through it. Some of those doors seemingly materialized from thin air as they magically aligned with new releases from Anthropic.
What started as a competition entry is now a small internal ecosystem. I am still a product person, not a software engineer. I am not going to pretend any of the code is pristine, and a lot of it was vibe-coded with Claude as a partner. But the architecture holds together, it works, and most of it is in daily use by people who are not me. That last part is the bit I am most proud of.
The next batch is pointed squarely at product operations. Making customer signals legible. Making internal telemetry something any teammate can talk to in plain English. The early returns are promising, and what gets me most excited is not the tech itself, it is watching people across Product, Engineering, and Support pull in the same direction with an AI colleague in the room. Turns out the interesting part of AI at work is not the model. It is the connective tissue.
I know Kung Fu
For a product guy who does not code for a living, this era is my “I know kung fu” moment. Not because I suddenly learned to fight. Because the move set I already had – product judgment, systems thinking, customer empathy, the ability to spec a thing precisely – just got a massive upgrade. The gap between “that would be useful” and “that exists now” is short enough to cross in an evening. I do not see it getting longer again.
Thanks for reading this far. If you want more detail or want to try anything not linked here, ping me. I am happy to share more.
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
5:45 am
Curious case of PXC node that refused to start due to SSL
In this blog, I am going to share a real-world debugging case study where a routine Percona XtraDB Cluster node restart led to an unexpected failure. I will walk through what we observed, what we checked, and how we ultimately identified the root cause.
Let’s see how the maintenance goes. It was supposed to be a simple restart. The kind you’ve done a hundred times. You SSH in, run the maintenance, bring the node back up, and go grab a coffee. Except this time, the coffee went cold on the desk… because MySQL refused to start.
The Problem
The error log of Percona XtraDB Cluster (8.0) had the following information:
2025-11-05T05:26:10.982984Z 0 [ERROR] [MY-000059] [Server] SSL error: Unable to get certificate from '/var/lib/mysql/server-cert.pem'.
2025-11-05T05:26:10.983030Z 0 [Warning] [MY-013595] [Server] Failed to initialize TLS for channel: mysql_main. See below for the description of exact issue.
2025-11-05T05:26:10.983045Z 0 [Warning] [MY-010069] [Server] Failed to set up SSL because of the following SSL library error: Unable to get certificate
2025-11-05T05:26:10.983052Z 0 [Note] [MY-000000] [WSREP] New joining cluster node configured to use specified SSL artifacts
2025-11-05T05:26:10.983083Z 0 [Note] [MY-000000] [Galera] Loading provider /usr/lib64/galera4/libgalera_smm.so initial position: 07c67757-0d18-11ef-b5a9-ee5d87b39aa8:4147053897
2025-11-05T05:26:10.983098Z 0 [Note] [MY-000000] [Galera] wsrep_load(): loading provider library '/usr/lib64/galera4/libgalera_smm.so'
2025-11-05T05:26:10.983742Z 0 [Note] [MY-000000] [Galera] wsrep_load(): Galera 4.22(f6c0465) by Codership Oy <info@codership.com> (modified by Percona <https://percona.com/>) loaded successfully.
2025-11-05T05:26:10.983771Z 0 [Note] [MY-000000] [Galera] Resolved symbol 'wsrep_node_isolation_mode_set_v1'
2025-11-05T05:26:10.983784Z 0 [Note] [MY-000000] [Galera] Resolved symbol 'wsrep_certify_v1'
2025-11-05T05:26:10.983807Z 0 [Note] [MY-000000] [Galera] CRC-32C: using 64-bit x86 acceleration.
2025-11-05T05:26:10.983995Z 0 [Note] [MY-000000] [Galera] not using SSL compression
2025-11-05T05:26:10.984341Z 0 [ERROR] [MY-000000] [Galera] Bad value '/var/lib/mysql/server-cert.pem' for SSL parameter 'socket.ssl_cert': 336245135: 'error:140AB18F:SSL routines:SSL_CTX_use_certificate:ee key too small'
at /mnt/jenkins/workspace/pxc80-autobuild-RELEASE/test/rpmbuild/BUILD/Percona-XtraDB-Cluster-8.0.42/percona-xtradb-cluster-galera/galerautils/src/gu_asio.cpp:ssl_prepare_context():471
2025-11-05T05:26:10.984401Z 0 [ERROR] [MY-000000] [Galera] Failed to create a new provider '/usr/lib64/galera4/libgalera_smm.so' with options 'gcache.size=1G;gcache.recover=yes;socket.ssl=yes;socket.ssl_ca=/data00/mysqldata/ca.pem;socket.ssl_cert=/data00/mysqldata/server-cert.pem;socket.ssl_key=/data00/mysqldata/server-key.pem;socket.ssl_key=/var/lib/mysql/server-key.pem;socket.ssl_ca=/var/lib/mysql/ca.pem;socket.ssl_cert=/var/lib/mysql/server-cert.pem': Failed to initialize wsrep provider
2025-11-05T05:26:10.984434Z 0 [ERROR] [MY-000000] [WSREP] Failed to load provider
2025-11-05T05:26:10.984448Z 0 [ERROR] [MY-010119] [Server] Aborting
2025-11-05T05:26:10.984602Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.42-33.1) Percona XtraDB Cluster (GPL), Release rel33, Revision 6673f8e, WSREP version 26.1.4.3.
2025-11-05T05:26:10.985473Z 0 [ERROR] [MY-010065] [Server] Failed to shutdown components infrastructure.
MySQL was down, and the maintenance clock was running. The certificate file sitting at /var/lib/mysql/server-cert.pem was the same file that had been working perfectly fine before the restart!! From past history, it was known that the following commands were executed correctly on the same cluster node
SET GLOBAL ssl_ca = '/var/lib/mysql/ca.pem';
SET GLOBAL ssl_cert = '/var/lib/mysql/server-cert.pem';
SET GLOBAL ssl_key = '/var/lib/mysql/server-key.pem';
ALTER INSTANCE RELOAD TLS;
Clients connected over TLS. Galera nodes communicated securely. There were zero complaints from the error log. In other words, the SSL reload at runtime inherited the process environment that existed when MySQL originally booted. Everything was smooth, but after a restart? MySQL complains and declines to start. So what has changed?
Checking Usual Suspects
File permissions
We checked the PEM files.
Ownership: mysql:mysql. Permissions: 644 for the cert, 600 for the key.
We compared them against the other Galera nodes, and they were identical. This didn’t look like a permissions problem.
Is SELinux to blame here?
SELinux has ruined enough DBA time that it is one of the top spots on such checklists – but it was permissive.
$ getenforce
Permissive
That means it was logging any security issues, but not blocking. And there were no AVC denials related to MySQL or the PEM files in /var/log/audit/audit.log or dmesg!
File corruption
Did the files get corrupted/replaced during or before the MySQL restart?
$ openssl x509 -in /var/lib/mysql/server-cert.pem -noout -text
# Output looked perfectly valid when compared to the output from other nodes
$ openssl rsa -in /var/lib/mysql/server-key.pem -check
RSA key ok
The files were fine. They parsed cleanly. OpenSSL could read them. So why couldn’t MySQL?
More Logs review
We scanned /var/log/messages and journalctl for anything unusual around the time of the restart. No disk errors. No OOM kills. No kernel panics. Nothing that screamed “I am the Dhurandhar that’s destroyed your node.” At this point, most of the usual suspects were guilt-free, staring at us, asking, “Who did it?”
The Clue
It is good to communicate with stakeholders, and we did – “Was there any recent change on your side?” to the client, and then uttered the golden words “Last week the crypto-policy was updated on all of the DB servers to comply with PCI.”
PCI > Crypto-policy – Let’s go and check it !!
$ update-crypto-policies --show
FUTURE
The system was running RHEL’s FUTURE cryptographic policy.
For those unfamiliar (including me at the time), Red Hat Enterprise Linux (and its derivatives, such as Rocky, Alma, and Oracle Linux) ships with a system-wide cryptographic policy framework. It’s a centralized way to enforce minimum standards for TLS versions, cipher suites, key lengths, and signature algorithms across all applications on the system that include OpenSS and yes, anything that links against those libraries… like MySQL.
Here’s a table that shows information about the crypto-policy levels:
Policy
RSA Minimum
TLS Minimum
SHA-1 Signatures
Use Case
LEGACY
1024-bit
TLS 1.0
Allowed
Old systems compatibility
DEFAULT
2048-bit
TLS 1.2
Allowed
Standard operations
FUTURE
3072-bit
TLS 1.2
Blocked
Forward-looking hardening
FIPS
2048-bit
TLS 1.2
Blocked
FIPS 140 compliance
So FUTURE demands a 3072-bit RSA key; otherwise, it is blocked. What do we have?
2048 bits! C’mon! And now I recall the error log again… The hint was there:
error:140AB18F:SSL routines:SSL_CTX_use_certificate:ee key too small
Now we have our story straight. On restart, our PXC cluster node started a new process linked against OpenSSL, which now enforced the FUTURE policy. OpenSSL looked at the 2048-bit RSA certificate and said: “Nope. Too small.”
Fixture
The quick fix here would be to adjust the policy to DEFAULT.
sudo update-crypto-policies --set DEFAULT
This will accept the current SSLs, and the node will join the cluster readily.
Alternatively, to remain compliant and adhere to the security policy strictness, the fixture will be to
Generate new certificates
Deploy the keys/certs to all Galera nodes
Perform a rolling restart
Conclusion
This was a classic case of a problem hiding at the boundary between two domains, database administration and operating system security. The DBA saw valid certificates and correct MySQL configuration. The sysadmin saw a properly hardened system with a strong crypto policy. Neither was wrong. But the intersection of their two correct configurations produced a failure.
This incident reinforces the importance of cross-domain awareness, where resolving database issues sometimes requires understanding and challenging system-level security decisions.
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
5:00 am
Building Query Analysis and Insights Dashboard in PMM
Percona Monitoring and Management is a great open source database monitoring, observability, and management tool. Query analytics is one of the prominent features DBA uses actively to trace the incidents and query performance identification.
We all know and love the Query Analytics (QAN) dashboard… It’s the first place we look when an incident alert fires or when a developer asks, “Why is the app slow?” or “What was going on during the midnight production outage?”
But sometimes, the standard dashboards just don’t tell the whole story or maybe are not clear enough. QAN is great, but shouldn’t we have more? If you have PMM running, you already have a Ferrari engine under the hood: ClickHouse. Most of us just drive it in first gear using the default UI.
In this post, we are going to take the training wheels off. We will bypass the standard QAN interface and talk directly to the ClickHouse backend to build highly specialised dashboards. We aren’t just looking for “slow” queries anymore; we are hunting for inefficiency, volatility, and the “silent killers” that standard monitoring often misses.
This is the hands-on blog, so grab your coffee and let’s turn that PMM instance into a deep-dive forensic tool.
Create a New Dashboard in PMM
Connect to PMM > Dashboards > Create New Dashboard
Save it with name “Slow Query Analysis” and Description “Slow Query Analysis from PMM’s QAN database (clickhouse)”
Click on add visualisation & select datasource “ClickHouse”
Choose SQL Builder
Paste the following query to get top 10 slow queries from the database
SELECT fingerprint
FROM pmm.metrics
WHERE service_type = 'mysql'
AND $__timeFilter(period_start)
GROUP BY fingerprint
ORDER BY sum(m_query_time_sum) DESC
LIMIT 10
Choose “Table View” on the top to view the list When you click “Run Query” you will see the top 10 slow queries in the chosen time period.
Let’s Save the dashboard after Panel Options updates as follows7.1 Change Panel Name and Description to: “Slow Query Analysis”7.2 Legend Placement to “Bottom”, Values to “min”,”max”, “mean”7.3 Change Axis’ Scale to “Logarithmic”Logarithmic scale on an axis compresses large ranges of data, making it ideal for visualizing metrics with vastly different magnitudes. This provides good visualisation for queries of different execution time frames.7.4 Save DashboardAlright, we’re at our first step. This first result set shows the top 10 slow query fingerprints across all MySQL services tracked by PMM for the selected time range. It provides a quick, environment-wide view of the most expensive query patterns. But this does not provide a clear picture. Let’s refine the dashboard to focus on specific queries, servers and observe their performance over time.Now, let’s introduce a variable to filter the data.
Click on Settings on Dashboard’s home page8.1 Choose “Variables” tab and click on “Add Variable”8.2 Add variable configuration and Save Dashboard
Go Back to Dashboard and Edit “Slow Query Analysis” Panel.
Now you should see the Query ID filter on the top.
Change the query to the following
SELECT
period_start AS time,
left(fingerprint, 80) AS query_text,
sum(m_query_time_sum/m_query_time_cnt) AS query_time
FROM
pmm.metrics
WHERE
service_type = 'mysql'
AND $__timeFilter(period_start)
AND fingerprint IN (
SELECT fingerprint
FROM pmm.metrics
WHERE service_type = 'mysql'
AND $__timeFilter(period_start)
AND ($queryid = '' OR queryid = $queryid)
GROUP BY fingerprint
ORDER BY sum(m_query_time_sum) DESC
LIMIT 10
)
GROUP BY
time,
fingerprint
ORDER BY
time,
query_time DESC
</p>
Basically the query is fetching start time, query text and average query time for the selected period for the top 10 Queries in that time-frame.
There is a filter for the “queryid” variable which you may use if you want to filter on a specific queryid.
Choose “Time Series” as “Query Type”
Adjust Panel Options11.1 Choose “Standard options” > “Unit” as “Time / Seconds (s)” from drop down.11.2 Choose “Standard options” > “Display name” as “${__field.labels.query_text}”11.3 Click on “Save Dashboard”
Your dashboard should be ready
Now, by default this dashboard is plotting top 10 queries. If you have a query fingerprint handy, you may be able to filter the search by that specific query. That said, this is still plotting queries across all the monitored instances. Let’s move on to add the service_name filter.
Adding service_name filter
Add Variable
Create new variable named “service_name”
Use variable type “Query”
Use Data Source as “ClickHouse”
Query:
select distinct service_name from pmm.metrics where service_type = 'mysql';
Unselect all checkboxes in “Selection options”
Save Dashboard
Update Query
SELECT
period_start AS time,
left(fingerprint, 80) AS query_text,
sum(m_query_time_sum/m_query_time_cnt) AS query_time
FROM
pmm.metrics
WHERE
(service_name = '' OR service_name = '$service_name')
AND service_type = 'mysql'
AND $__timeFilter(period_start)
AND fingerprint IN (
SELECT fingerprint
FROM pmm.metrics
WHERE service_type = 'mysql'
AND $__timeFilter(period_start)
AND (service_name = '' OR service_name = '$service_name')
GROUP BY fingerprint
ORDER BY sum(m_query_time_sum) DESC
LIMIT 10
)
GROUP BY
time,
left(fingerprint, 80)
ORDER BY
time,
query_time DESC
I know many of you are naturally curious and enjoy experimenting with PMM and Grafana… So you’ve probably already started thinking about how far this can be taken. Feel free to share your ideas or custom dashboards in the comments.
Sample Dashboards:
The Query Analysis and Insights Dashboard
Okay, for those who are looking to have quick results, I’ve prepared the complete Query Analysis and Insights Dashboard for you to import and use instantly.
By importing the JSON file, you’ll get the full working dashboard with all panels preconfigured, including:
Slow Query Analysis
Latency Distribution Heatmap
Query Volatility (P99 vs Average)
Lock Wait Ratio Over Time (Top Contended Queries)
Temporary Table Usage (Disk & Memory)
Query Efficiency (Rows Examined vs Rows Sent)
Error Rate vs Throughput
Workload Distribution by User
Query Volume by Client Host
Execution Time vs Lock Wait Time
This allows you to instantly explore PMM Query Analytics data, adjust time ranges and filters, and correlate query performance, contention, and workload behavior without recreating the dashboard from scratch.
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.