F.32. multimaster
multimaster is a Postgres Pro Enterprise extension with a set of patches that turns Postgres Pro Enterprise into a synchronous shared-nothing cluster to provide Online Transaction Processing (OLTP) scalability for read transactions and high availability with automatic disaster recovery.
As compared to a standard PostgreSQL master-standby cluster, a cluster configured with the multimaster extension offers the following benefits:
Cluster-wide transaction isolation
Synchronous logical replication and DDL replication
Working with temporary tables on each cluster node
Fault tolerance and automatic node recovery
PostgreSQL online upgrades
Important
The multimaster extension is no longer supported in Postgres Pro Enterprise 10. Please consider migrating to the latest Postgres Pro Enterprise major version to ensure stable performance of multimaster.
The multimaster extension replicates your database to all nodes of the cluster and allows write transactions on each node. To ensure data consistency in the case of concurrent updates, multimaster enforces transaction isolation cluster-wide, using multiversion concurrency control (MVCC) at the read committed or repeatable read isolation levels. Any write transaction is synchronously replicated to all nodes, which increases commit latency for the time required for synchronization. Read-only transactions and queries are executed locally, without any measurable overhead.
To ensure high availability and fault tolerance of the cluster, multimaster uses three-phase commit protocol and heartbeats for failure discovery. A multi-master cluster of N nodes can continue working while the majority of the nodes are alive and reachable by other nodes. To be configured with multimaster, the cluster must include at least two nodes. In most cases, three cluster nodes are enough to ensure high availability. Since the data on all cluster nodes is the same, you do not typically need more than five cluster nodes.
When a failed node is reconnected to the cluster, multimaster can automatically fast-forward the node to the actual state based on the Write-Ahead Log (WAL) data in the corresponding replication slot. If WAL data is no longer available for the time when the node was excluded from the cluster, you can restore the node using pg_basebackup.
Note
When using multimaster, make sure to take its replication restrictions into account. For details, see Section F.32.1.
To learn more about the multimaster internals, see Section F.32.2.
F.32.1. Limitations
The multimaster extension takes care of the database replication in a fully automated way. You can perform write transactions on any node and work with temporary tables on each cluster node simultaneously. However, make sure to take the following replication restrictions into account:
Microsoft Windows operating system is not supported.
multimastercan only replicate one database per cluster. This database must be specified in themultimaster.conn_stringsvariable. If you connect to a different database, all operations will fail with the corresponding error message.The replicated tables must have primary keys or replica identity because of logical replication restrictions. Although tables without primary keys can be replicated,
multimasterdoes not allowUPDATEandDELETEoperations on such tables. For details, seemultimaster.ignore_tables_without_pk. Unlogged tables are not replicated, as in the standard PostgreSQL.Isolation level. The
multimasterextension supportsread committedandrepeatable readisolation levels.Serializableisolation level is currently not supported.Important
Using
repeatable readisolation level increases the probability of serialization failure at commit time. Unlike in the standard PostgreSQL,read committedlevel can also cause serialization failures on a multi-master cluster (with an SQLSTATE code '40001').When performing a write transaction,
multimasterblocks the affected objects only on the node on which the transaction is performed. However, since write transactions are allowed on all nodes, other transactions can try to change the same objects on the neighbor nodes at the same time. In this case, the replication of the first transaction can fail because the affected objects on the neighbor nodes are already blocked by another transaction. Similarly, the latter transaction cannot be replicated to the first node. In this case, a distributed deadlock occurs. As a result, one of the transactions is automatically rolled back and needs to be repeated. The application must be ready to retry transactions.If your typical workload has too many rollbacks, it is recommended to use
read committedisolation level. However, theread committedstill does not guarantee the absence of deadlocks on a multi-master cluster. If using theread committedlevel does not help, you can try directing all the write transactions to a single node.In a multi-master cluster, the
ALTER SYSTEMcommand only affects the configuration of the current node. If you would like to change configuration parameters across the whole database cluster, you need to run this command on each node.Sequence generation. To avoid conflicts between unique identifiers on different nodes,
multimastermodifies the default behavior of sequence generators. By default, ID generation on each node is started with this node number and is incremented by the number of nodes. For example, in a three-node cluster, 1, 4, and 7 IDs are allocated to the objects written onto the first node, while 2, 5, and 8 IDs are reserved for the second node. If you change the number of nodes in the cluster, the incrementation interval for new IDs is adjusted accordingly. Thus, the generated sequence values are not monotonic. If it is critical to get a monotonically increasing sequence cluster-wide, you can set themultimaster.monotonic_sequencestotrue.DDL replication. While
multimasterreplicates data on the logical level, DDL is replicated on the statement level, which results in distributed commits of the same statement on different nodes. As a result, complex DDL scenarios, such as stored procedures and temporary tables, may work differently as compared to the standard PostgreSQL.Commit latency. In the current implementation of logical replication,
multimastersends data to subscriber nodes only after the local commit, so you have to wait for transaction processing twice: first on the local node, and then on all the other nodes simultaneously. In the case of a heavy-write transaction, this may result in a noticeable delay.
If you have any data that must be present on one of the nodes only, you can exclude a particular table from replication, as follows:
SELECT mtm.make_table_local('table_name') F.32.2. Architecture
F.32.2.1. Replication
Since each server in a multi-master cluster can accept writes, any server can abort a transaction because of a concurrent update — in the same way as it happens on a single server between different backends. To ensure high availability and data consistency on all cluster nodes, multimaster uses logical replication and the three-phase E3PC commit protocol.
When Postgres Pro Enterprise loads the multimaster shared library, multimaster sets up a logical replication producer and consumer for each node, and hooks into the transaction commit pipeline. The typical data replication workflow consists of the following phases:
PREPAREphase.multimastercaptures and implicitly transforms eachCOMMITstatement to aPREPAREstatement. All the nodes that get the transaction via the replication protocol (the cohort nodes) send their vote for approving or declining the transaction to the arbiter process on the initiating node. This ensures that all the cohort can accept the transaction, and no write conflicts occur. For details onPREPAREtransactions support in PostgreSQL, see the PREPARE TRANSACTION topic.PRECOMMITphase. If all the cohort nodes approve the transaction, the arbiter process sends aPRECOMMITmessage to all the cohort nodes to express an intention to commit the transaction. The cohort nodes respond to the arbiter with thePRECOMMITTEDmessage. In case of a failure, all the nodes can use this information to complete the transaction using a quorum-based voting procedure.COMMITphase. IfPRECOMMITis successful, the arbiter commits the transaction to all nodes.
Important
multimaster currently supports the read committed and repeatable read isolation levels only, which can cause unexpected serialization failures in your workload. For details, see Section F.32.1.
If a node crashes or gets disconnected from the cluster between the PREPARE and COMMIT phases, the PRECOMMIT phase ensures that the survived nodes have enough information to complete the prepared transaction. The PRECOMMITTED messages help avoid the situation when the crashed node has already committed or aborted the transaction, but has not notified other nodes about the transaction status. In a two-phase commit (2PC), such a transaction would block resources (hold locks) until the recovery of the crashed node. Otherwise, you could get data inconsistencies in the database when the failed node is recovered. For example, if the failed node committed the transaction but the survived node aborted it.
To complete the transaction, the arbiter must receive a response from the majority of the nodes. For example, for a cluster of 2N+1 nodes, at least N+1 responses are required. Thus, multimaster ensures that your cluster is available for reads and writes while the majority of the nodes are connected, and no data inconsistencies occur in case of a node or connection failure. For details on the failure detection mechanism, see Section F.32.2.2.
F.32.2.2. Failure Detection and Recovery
Since multimaster allows writes to each node, it has to wait for responses about transaction acknowledgment from all the other nodes. Without special actions in case of a node failure, each commit would have to wait until the failed node recovery. To deal with such situations, multimaster periodically sends heartbeats to check the node state and the connectivity between nodes. When several heartbeats to the node are lost in a row, this node is kicked out of the cluster to allow writes to the remaining alive nodes. You can configure the heartbeat frequency and the response timeout in the multimaster.heartbeat_send_timeout and multimaster.heartbeat_recv_timeout parameters, respectively.
For alive nodes, there is no way to distinguish between a failed node that stopped serving requests and a network-partitioned node that can be accessed by database users, but is unreachable for other nodes. To avoid conflicting writes to nodes in different network partitions, multimaster only allows writes to the nodes that see the majority of other nodes.
For example, suppose a five-node multi-master cluster experienced a network failure that split the network into two isolated subnets, with two and three cluster nodes. Based on heartbeats propagation information, multimaster will continue accepting writes at each node in the bigger partition, and deny all writes in the smaller one. Thus, a cluster consisting of 2N+1 nodes can tolerate N node failures and stay alive if any N+1 nodes are alive and connected to each other.
Tip
For clusters with an even number of nodes, you can override this behavior. For details, see Section F.32.3.3.
In case of a partial network split when different nodes have different connectivity, multimaster finds a fully connected subset of nodes and disconnects other nodes. For example, in a three-node cluster, if node A can access both B and C, but node B cannot access node C, multimaster isolates node C to ensure data consistency on nodes A and B.
If you try to access a disconnected node, multimaster returns an error message indicating the current status of the node. To prevent stale reads, read-only queries are also forbidden. Additionally, you can break connections between the disconnected node and the clients using the multimaster.break_connection variable.
Each node maintains a data structure that keeps the information about the state of all nodes in relation to this node. You can get this data in the mtm.get_nodes_state() view.
When a failed node connects back to the cluster, multimaster starts automatic recovery:
The reconnected node selects a random cluster node and starts catching up with the current state of the cluster based on the Write-Ahead Log (WAL).
When the node gets synchronized up to the minimum recovery lag, all the cluster nodes get locked for write transactions to allow the recovery process to finish. By default, the minimum recovery lag is 10KB. You can change this value in the
multimaster.min_recovery_lagvariable.When the recovery is complete,
multimasterpromotes the reconnected node to the online state and includes it into the replication scheme.
Note
Automatic recovery is only possible if the failed node WAL lag behind the working ones does not exceed the multimaster.max_recovery_lag value. When the WAL lag is bigger than multimaster.max_recovery_lag, you can manually restore the node from one of the working nodes using pg_basebackup.
See Also
F.32.3. Installation and Setup
To use multimaster, you need to install Postgres Pro Enterprise on all nodes of your cluster. Postgres Pro Enterprise includes all the required dependencies and extensions.
F.32.3.1. Setting up a Multi-Master Cluster
After installing Postgres Pro Enterprise on all nodes, you need to configure the cluster with multimaster.
Suppose you are setting up a cluster of three nodes, with node1, node2, and node3 domain names. First, set up the database to be replicated, and make sure you have a DBMS user with superuser rights to perform replication:
If you are starting from scratch, initialize a cluster, create an empty database
mydband a new DBMS usermyuserwith superuser rights, on each node of the cluster. For details, see Section 18.2.If you already have a database
mydbrunning onnode1, create a new DBMS usermyuserwith superuser rights and initialize new nodes from the working node using pg_basebackup on behalf of this user. Run the following command on each node you are going to add:pg_basebackup -D
datadir-h node1 -U myuser -c fastwhere
datadiris the directory containing the database cluster. This directory is specified at the cluster initialization stage, or set in thePGDATAenvironment variable. You can also use any other DBMS user with superuser rights instead ofmyuserto perform this task.For details on using pg_basebackup, see pg_basebackup.
Once the database is set up, complete the following steps on each cluster node:
Allow replication of the
mydbdatabase to each cluster node on behalf ofmyuser, as explained in Section 20.1. Make sure to use the authentication method that satisfies your security requirements.Modify the
postgresql.confconfiguration file, as follows:Add
multimasterto theshared_preload_librariesvariable:shared_preload_libraries = 'multimaster'
Tip
If the
shared_preload_librariesvariable is already defined inpostgresql.auto.conf, you will need to modify its value using the ALTER SYSTEM command. For details, see Section 19.1.2. Note that in a multi-master cluster, theALTER SYSTEMcommand only affects the configuration of the node from which it was run.Specify the transaction isolation level for your cluster.
multimastercurrently supports read committed and repeatable read isolation levels.default_transaction_isolation = 'read committed'
Important
Using
repeatable readisolation level increases the probability of serialization failure at commit time. If such cases are not handled by your application, you are recommended to useread committedisolation level.Set up PostgreSQL parameters related to replication.
wal_level = logical max_connections = 100 max_prepared_transactions = 300 max_wal_senders = 10 # at least the number of nodes max_replication_slots = 10 # at least the number of nodes
You must change the replication level to
logicalasmultimasterrelies on logical replication. For a cluster ofNnodes, enable at leastNWAL sender processes and replication slots. Sincemultimasterimplicitly adds aPREPAREphase to eachCOMMITtransaction, make sure to set the number of prepared transactions toN*max_connections. Otherwise, prepared transactions may be queued.Make sure you have enough background workers allocated for each node:
max_worker_processes = 250
For example, for a three-node cluster with
max_connections= 100,multimastermay need up to 206 background workers at peak times: 200 workers for connections from the neighbor nodes, two workers for WAL sender processes, two workers for WAL receiver processes, and two workers for the arbiter sender and receiver processes. When setting this parameter, remember that other modules may also use background workers at the same time.Add
multimaster-specific options:multimaster.max_nodes = 3 # cluster size multimaster.node_id = 1 # the 1-based index of this node # in the cluster multimaster.conn_strings = 'dbname=mydb user=myuser host=node1 port=5432 arbiter_port=5433,dbname=mydb user=myuser host=node2 port=5432 arbiter_port=5433,dbname=mydb user=myuser host=node3 port=5432 arbiter_port=5433' # comma-separated list # of connection strings # to neighbor nodes multimaster.arbiter_port = 5433The
multimaster.max_nodesvariable defines the maximum cluster size. If you plan to add new nodes to your cluster, themultimaster.max_nodesvalue should exceed the initial number of nodes. In this case, you can add new nodes without restarting Postgres Pro Enterprise until the specified number of nodes is reached.In most cases, three cluster nodes are enough to ensure high availability. Since the data on all cluster nodes is the same, you do not typically need more than five cluster nodes.
Important
The
multimaster.node_idvariable takes natural numbers starting from 1, without any gaps in numbering. For example, for a cluster of five nodes, set node IDs to 1, 2, 3, 4, and 5. In themultimaster.conn_stringsvariable, make sure to list the nodes in the order of their IDs. Themultimaster.conn_stringsvariable must be the same on all nodes.You can customize connection parameters by adding other libpq connection options to connection strings in the
multimaster.conn_stringsvariable. By default,multimasteruses port 5432 to establish connections between the nodes and port 5433 for the arbiter process to listen for connections. If these ports are already in use, you must specify other ports inportandarbiter_portoptions in each connection string in themultimaster.conn_stringsvariable.If you change the
arbiter_portoption, you must also specify this port in themultimaster.arbiter_portvariable. For details, seemultimaster.arbiter_portandmultimaster.conn_strings.To check whether the default ports are available, you can run the following command:
netstat -ln | grep -E '5432|5433'
Make sure these ports are not blocked by firewall.
Depending on your network environment and usage patterns, you may want to tune other
multimasterparameters. For details, see Section F.32.3.2.On behalf of an OS superuser, restart Postgres Pro Enterprise:
pg_ctl -D
datadir-lpg.logrestart
When Postgres Pro Enterprise is started on all nodes, connect to any node on behalf of the postgres OS user and create the multimaster extension in the mydb database that will be replicated:
psql -h node1 -d mydb CREATE EXTENSION multimaster;
The CREATE EXTENSION query is replicated to all the cluster nodes.
To ensure that multimaster is enabled, check the mtm.get_cluster_state() view:
SELECT mtm.get_cluster_state();
If liveNodes is equal to allNodes, your cluster is successfully configured and ready to use.
See Also
Tuning Configuration Parameters
F.32.3.1.1. Sample Cluster Configuration
This section shows a sample configuration of a three-node cluster with host names node1, node2 and node3. The name of the database to be replicated is mydb. The name of DBMS user with superuser rights to perform database replication is myuser.
First, run the following commands on behalf of OS user
postgres, on all cluster nodes:# Configure each node, and then create the myuser user # and mydb database owned by myuser. export PATH=/opt/pgpro/ent-10/bin:$PATH export PGDATA=/var/lib/pgpro/ent-10/data for i in `seq 1 3`; do echo "host replication myuser node$i md5" >> $PGDATA/pg_hba.conf echo "host mydb myuser node$i md5" >> $PGDATA/pg_hba.conf echo "node$i:5432:mydb:myuser:myuserpassword" >> ~/.pgpass done chmod 0600 ~/.pgpass cat << EOF | psql --dbname=postgres --username=postgres --port=5432 ALTER SYSTEM SET default_transaction_isolation = 'read committed'; ALTER SYSTEM SET wal_level = logical; ALTER SYSTEM SET max_connections = 100; ALTER SYSTEM SET max_prepared_transactions = 300; ALTER SYSTEM SET max_wal_senders = 10; ALTER SYSTEM SET max_replication_slots = 10; ALTER SYSTEM SET max_worker_processes = 250; ALTER SYSTEM SET shared_preload_libraries = 'multimaster'; CREATE USER myuser WITH SUPERUSER PASSWORD 'myuserpassword'; CREATE DATABASE mydb OWNER myuser; EOF # Define the number of nodes in the cluster, assign node IDs, # and connection strings to the nodes. This example assumes # that node hostnames are node1, node2, and node3, and their # IDs are the corresponding numbers starting with 1. cat << EOF >> $PGDATA/postgresql.conf multimaster.max_nodes = 3 multimaster.node_id = `hostname | awk '{ print substr($1,5,1) }'` multimaster.arbiter_port = 5433 multimaster.conn_strings = 'dbname=mydb user=myuser host=node1 port=5432 arbiter_port=5433,dbname=mydb user=myuser host=node2 port=5432 arbiter_port=5433,dbname=mydb user=myuser host=node3 port=5432 arbiter_port=5433' EOFOn behalf of an OS user with superuser privileges, restart Postgres Pro Enterprise service on all cluster nodes:
sudo service postgrespro-ent-10 restart
Now create the
multimasterextension on one of the cluster nodes, on behalf of OS userpostgres. It will be replicated to all the other nodes automatically. The following example illustrates extension creation on node1.psql --dbname=mydb --username=myuser --host=node1 --port=5432 -c "CREATE EXTENSION IF NOT EXISTS multimaster"
The cluster is set up and ready to use.
F.32.3.2. Tuning Configuration Parameters
While you can use multimaster in the default configuration, you may want to tune several parameters for faster failure detection or more reliable automatic recovery.
F.32.3.2.1. Setting Timeout for Failure Detection
To check availability of the neighbor nodes, multimaster periodically sends heartbeat packets to all nodes. You can define the timeout for failure detection with the following variables:
The
multimaster.heartbeat_send_timeoutvariable defines the time interval between the heartbeats. By default, this variable is set to 200ms.The
multimaster.heartbeat_recv_timeoutvariable sets the timeout for the response. If no heartbeats are received during this time, the node is assumed to be disconnected and is excluded from the cluster. By default, this variable is set to 1000ms.
It's a good idea to set multimaster.heartbeat_send_timeout based on typical ping latencies between the nodes. Small recv/send ratio decreases the time of failure detection, but increases the probability of false-positive failure detection. When setting this parameter, take into account the typical packet loss ratio between your cluster nodes.
F.32.3.2.2. Configuring Automatic Recovery Parameters
If a cluster node fails, multimaster can automatically restore it based on the WAL collected on other cluster nodes. To control the recovery settings, use the following variables:
multimaster.min_recovery_lag— sets the minimal WAL lag between the node to be restored and the current cluster state. By default,multimaster.min_recovery_lagis set to 10KB. When the disconnected node is fast-forwarded up to themultimaster.min_recovery_lagthreshold,multimasterstops all new commits to the alive nodes until the node fully catches up with the current state of the cluster. When the data is fully synchronized, the disconnected node is promoted to the online state, and the cluster resumes its work.multimaster.max_recovery_lag— sets the maximum size of WAL. Upon reaching themultimaster.max_recovery_lagthreshold, WAL for the disconnected node is overwritten. At this point, automatic recovery is no longer possible. In this case, you can restore the node manually by cloning the data from one of the alive nodes using pg_basebackup.
By default, multimaster.max_recovery_lag is set to 1GB. Setting multimaster.max_recovery_lag to a larger value increases the timeframe for automatic recovery, but requires more disk space for WAL collection.
See Also
F.32.3.3. Defining Quorum Settings for Clusters with an Even Number of Nodes
By default, multimaster uses a majority-based algorithm to determine whether the cluster nodes have a quorum: a cluster can only continue working if the majority of its nodes are alive and can access each other. For clusters with an even number of nodes, this approach is not optimal. For example, if a network failure splits the cluster into equal parts, or one of the nodes fails in a two-node cluster, all the nodes stop accepting queries, even though at least half of the cluster nodes are running normally.
To enable a smooth failover for such cases, you can modify the multimaster majority-based behavior using one of the following options:
Set up a standalone referee node to assign the quorum status to a subset of nodes that constitutes half of the cluster.
Choose the major node that continues working regardless of the status of other nodes. Use this option in two-node cluster configurations only.
Important
To avoid split-brain problems, do not use the major node together with a referee in the same cluster.
F.32.3.3.1. Setting up a Standalone Referee Node
A referee is a voting node used to determine which subset of nodes has a quorum if the cluster is split into equal parts. The referee node does not store any cluster data, so it is not resource-intensive and can be configured on virtually any system with Postgres Pro Enterprise installed.
To set up a referee for your cluster:
Install Postgres Pro Enterprise on the node you are going to make a referee and create the
refereeextension:CREATE EXTENSION referee;
Make sure the
pg_hba.conffile allows access to the referee node.On all your cluster nodes, specify the referee connection string in the
postgresql.conffile:multimaster.referee_connstring =
connstringwhere
connstringholds libpq options required to access the referee.
The first subset of nodes that gets connected to the referee wins the voting and continues working. The referee keeps the voting result until all the other cluster nodes get online again. Then the result is discarded, and a new winner can be chosen in case of another network failure.
To avoid split-brain problems, you can only have a single referee in your cluster. Do not set up a referee if you have already configured the major node.
F.32.3.3.2. Configuring the Major Node
If you configure one of the nodes to be the major one, this node will continue accepting queries even if it is isolated by a network failure, or other nodes get broken. This setting is useful in a two-node cluster configuration, or to quickly restore a single node in a broken cluster.
To make one of the nodes major, enable the multimaster.major_node parameter on this node:
ALTER SYSTEM SET multimaster.major_node TO on; SELECT pg_reload_conf();
Do not set the major_node parameter on more than one cluster node. When enabled on several nodes, it can cause the split-brain problem. If you have already set up a referee for your cluster, the major_node option is forbidden.
Important
If your cluster has more than two nodes, do not use the major node setting, as it can lead to split-brain problems in case of network failures, and reduce the number of possible failover options. If the major node gets isolated by a network failure while the majority of nodes are connected to each other, both the major node and the subset that has a quorum will continue working. Consider setting up a standalone referee instead.
F.32.4. Multi-Master Cluster Administration
F.32.4.1. Monitoring Cluster Status
multimaster provides several views to check the current cluster state.
To check node-specific information, use mtm.get_nodes_state():
SELECT mtm.get_nodes_state();
To check the status of the whole cluster, use the mtm.get_cluster_state() view:
SELECT mtm.get_cluster_state();
For details on all the returned information, see Section F.32.5.2.
F.32.4.2. Adding New Nodes to the Cluster
With the multimaster extension, you can add or drop cluster nodes without stopping the database service.
To add a new node, you need to change the cluster configuration on alive nodes, load all the data to the new node using pg_basebackup, and start the node.
Suppose we have a working cluster of three nodes, with node1, node2, and node3 domain names. To add node4, follow these steps:
Check whether the current number of cluster nodes has reached the value specified in the
multimaster.max_nodesvariable. If this value is reached, increase themultimaster.max_nodesvalue on each node and restart all nodes. You can restart the nodes one by one, without stopping the database. If the maximum number of nodes is not reached, proceed to the next step.Figure out the required connection string to access the new node. For example, for the database
mydb, usermyuser, and the new nodenode4, the connection string can be"dbname=mydb user=myuser host=node4". For details, seemultimaster.conn_strings.In
psqlconnected to any alive node, run:SELECT mtm.add_node('dbname=mydb user=myuser host=node4');This command changes the cluster configuration on all nodes and starts replication slots for the new node.
Connect to the new node and clone all the data from one of the alive nodes to the new node:
pg_basebackup -D
datadir-h node1 -U myuser -c fastpg_basebackup copies the entire data directory from
node1, together with configuration settings.Update
postgresql.confsettings onnode4:multimaster.node_id = 4 multimaster.conn_strings = 'dbname=mydb user=myuser host=node1, dbname=mydb user=myuser host=node2, dbname=mydb user=myuser host=node3, dbname=mydb user=myuser host=node4'Start Postgres Pro on the new node:
pg_ctl -D
datadir-lpg.logstartWhen the node gets synchronized up to the minimum recovery lag, all the cluster nodes get locked for write transactions until the new node retrieves all the updates. When data recovery is complete,
multimasterpromotes the new node to the online state and includes it into the replication scheme.
To ensure that the new configuration is loaded in the case of PostgreSQL restart, update configuration settings on all the cluster nodes:
Change
multimaster.conn_stringsto include the new node.Make sure the
pg_hba.conffile allows replication to the new node.
See Also
F.32.4.3. Removing Nodes from the Cluster
multimaster provides the mtm.stop_node() function that can temporarily or permanently remove nodes from the cluster.
To temporarily exclude a node from the cluster, run the mtm.stop_node() function specifying the node ID. For example, to exclude node 3, run the following command on any other node of the cluster:
SELECT mtm.stop_node(3);
This command excludes node 3 from the cluster and stops replication to this node.
While the WAL lag between the node and the current cluster state is less than the multimaster.max_recovery_lag value, you can restore the node by running the following command:
SELECT mtm.recover_node(3);
Otherwise, follow the procedure described in Section F.32.4.4.
Note
If you simply shut down a node, it will be excluded from the cluster as well. However, all transactions in the cluster will be frozen until other nodes detect the offline state of the node. This time interval is defined by the multimaster.heartbeat_recv_timeout parameter.
To permanently drop the node from the cluster:
Run the
mtm.stop_node()function with thedrop_slotparameter set totrue:SELECT mtm.stop_node(3, true);
This disables replication slots for node 3 on all cluster nodes and stops replication to this node.
Adjust
multimaster.node_idandmultimaster.conn_stringssettings inpostgresql.confon the remaining cluster nodes to reflect the new state of the cluster.Edit the
pg_hba.conffile on the remaining cluster nodes to disable replication to the removed node, if required.
If you would like to return the node to the cluster later, you will have to add it as a new node, as explained in Section F.32.4.2.
F.32.4.4. Restoring a Cluster Node Manually
The multimaster extension can automatically restore a failed node if the WAL is available for the time when the node was disconnected from the cluster. However, if the data updates on the alive nodes exceed the allowed WAL size specified in the multimaster.max_recovery_lag variable, automatic recovery is impossible. In this case, you can manually restore the failed node.
Suppose node2 got disconnected from your three-node cluster and needs to be manually restored. The typical workflow is as follows:
In
psqlconnected to any alive node, create a new replication slot for the disconnected node with the following command:SELECT mtm.recover_node(2);
where 2 is the ID of the disconnected node specified in the
multimaster.node_idvariable.Connect to
node2and clone all the data from one of the alive nodes:pg_basebackup -D
datadir-h node1 -U myuser -c fastpg_basebackup copies the entire data directory from
node1, together with configuration settings.On the restored node, update the
multimaster.node_idsetting to the value this node used to have before the failure.Make sure replication is enabled between the restored node and the rest of the cluster.
Start Postgres Pro on the restored node:
pg_ctl -D
datadir-lpg.logstartWhen the node gets synchronized up to the minimum recovery lag, all the cluster nodes get locked for write transactions until the restored node retrieves all the updates. When data recovery is complete,
multimasterpromotes the new node to the online state and includes it into the replication scheme.
See Also
F.32.5. Reference
F.32.5.1. GUC Variables
multimaster.node_idNode ID — a unique natural number identifying the node of a multi-master cluster. You must start node numbering from 1. There must be no gaps in numbering. For example, for a cluster of five nodes, set node IDs to 1, 2, 3, 4, and 5.
multimaster.conn_stringsConnection strings for each node of a multi-master cluster, separated by commas. The
multimaster.conn_stringsparameter must be identical on all nodes. Each connection string must include the name of the database to replicate and the cluster node domain name. For example, 'dbname=mydb host=node1,dbname=mydb host=node2,dbname=mydb host=node3'. Optionally, you can add other connection parameters to change the default connection settings. Connection strings must appear in the order of the node IDs specified in themultimaster.node_idvariable. Connection string for the i-th node must be on the i-th position. If you specify a custom port in themultimaster.arbiter_port, you must provide this value in thearbiter_portparameter in the connection string for the corresponding node.multimaster.max_nodesThe maximum number of nodes allowed in the cluster. If you plan to add new nodes to your cluster, the
multimaster.max_nodesvalue should exceed the initial number of nodes. In this case, you can add new nodes without restarting Postgres Pro Enterprise until the specified number of nodes is reached. In most cases, three cluster nodes are enough to ensure high availability. Since the data on all cluster nodes is the same, you do not typically need more than five cluster nodes. The maximum possible number of nodes is limited to 64.Default: the number of nodes specified in the
multimaster.conn_stringsvariablemultimaster.arbiter_portPort for the arbiter process to listen on. If you change the default value, you must specify this value in the
arbiter_portparameter in the connection string for the corresponding node.Default: 5433
multimaster.heartbeat_send_timeoutTime interval between heartbeat messages, in milliseconds. An arbiter process broadcasts heartbeat messages to all nodes to detect connection problems.
Default: 200
multimaster.heartbeat_recv_timeoutTimeout, in milliseconds. If no heartbeat message is received from the node within this timeframe, the node is excluded from the cluster.
Default: 1000
multimaster.min_recovery_lagMinimal WAL lag between the node to be restored and the current cluster state, in kB. When this threshold is reached during node recovery, the cluster is locked for write transactions until the recovery is complete.
Default: 10KB
multimaster.max_recovery_lagMaximal WAL lag size, in kB. When a node is disconnected from the cluster, other nodes copy WAL data for all new transactions into the replication slot of this node. Upon reaching the
multimaster.max_recovery_lagvalue, the replication slot for the disconnected node is dropped to avoid overflow. At this point, automatic recovery of the node is no longer possible. In this case, you can restore the node manually by cloning the data from one of the alive nodes using pg_basebackup or a similar tool. If you set this variable to zero, replication slot will not be dropped.Default: 1GB
multimaster.ignore_tables_without_pkBoolean. This variable enables/disables replication of
INSERToperations for tables without primary keys. By default, such replication is enabled. Setting this parameter toondisables replication ofINSERToperations for these tables. Regardless of this setting, DDL operations on tables without primary keys are always replicated, whileUPDATEandDELETEoperations are not replicated because of the logical replication restrictions.Default:
falsemultimaster.cluster_nameName of the cluster. If you define this variable when setting up the cluster,
multimasterchecks that the cluster name is the same for all the cluster nodes.multimaster.break_connectionBreak connection with clients connected to the node if this node disconnects from the cluster. If this variable is set to
false, the client stays connected to the node but receives an error that the node is in minority.Default:
falsemultimaster.major_nodeThe node with this flag continues working even if there is no quorum. This may be required to break the symmetry in two-node clusters, or to quickly restore a single node in a broken cluster.
Important
This parameter should be used with caution to avoid split-brain problems:
Do not use the
major_nodeparameter on clusters with more than two nodes. If the major node gets isolated by a network failure while the majority of nodes are connected to each other, both the major node and the subset that has a quorum will continue working. Consider setting up a standalone referee instead.Never set more than one major node in the cluster.
multimaster.referee_connstringConnection string to access the referee node. You must set this parameter on all cluster nodes if the referee is set up.
multimaster.max_workersThe maximum number of
walreceiverworkers on this server.Important
This parameter should be used with caution. If the number of simultaneous transactions in the whole cluster is bigger than the provided value, it can lead to undetected deadlocks.
Default: 100
multimaster.trans_spill_thresholdThe maximal size of transaction, in kB. When this threshold is reached, the transaction is written to the disk.
Default: 100MB
multimaster.monotonic_sequencesDefines the sequence generation mode for unique identifiers. This variable can take the following values:
false(default) — ID generation on each node is started with this node number and is incremented by the number of nodes. For example, in a three-node cluster, 1, 4, and 7 IDs are allocated to the objects written onto the first node, while 2, 5, and 8 IDs are reserved for the second node. If you change the number of nodes in the cluster, the incrementation interval for new IDs is adjusted accordingly.true— the generated sequence increases monotonically cluster-wide. ID generation on each node is started with this node number and is incremented by the number of nodes, but the values are omitted if they are smaller than the already generated IDs on another node. For example, in a three-node cluster, if 1, 4 and 7 IDs are already allocated to the objects on the first node, 2 and 5 IDs will be omitted on the second node. In this case, the first ID on the second node is 8. Thus, the next generated ID is always higher than the previous one, regardless of the cluster node.
Default:
falsemultimaster.remote_functionsProvides a comma-separated list of function names that should be executed remotely on all multimaster nodes instead of replicating the result of their work.
multimaster.use_rdmaEnables node-to-node connections using Remote Direct Memory Access (RDMA) technology. You must have an RDMA implementation set up on all the cluster nodes. For details, see Section 18.11.
F.32.5.2. Functions
-
mtm.get_nodes_state() Shows the status of all nodes in the cluster. Returns a tuple of the following values:
id,integerNode ID.
enabled,booleanShows whether the node is excluded from the cluster. The node can only be disabled if responses to heartbeats are not received within the
heartbeat_recv_timeouttime interval. When the node starts responding to heartbeats,multimastercan automatically restore the node and switch it back to the enabled state. Automatic recovery is only possible if the replication slot is still active. Otherwise, you can restore the node manually.connected,booleanShows whether the node is connected to the WAL sender.
slot_active,booleanShows whether the node has an active replication slot. For a disabled node, the slot remains active until the
max_recovery_lagvalue is reached.stopped,booleanShows whether replication to this node was stopped by the
mtm.stop_node()function. A stopped node acts as a disabled one, but cannot be automatically recovered. Callmtm.recover_node()to re-enable such a node.catchUp,booleanDuring the node recovery, shows whether the data is recovered up to the
min_recovery_lagvalue.slotLag,bigintThe size of WAL data that the replication slot holds for a disabled/stopped node. The slot is dropped when
slotLagreaches themax_recovery_lagvalue.avgTransDelay,bigintAn average commit delay caused by this node, in microseconds.
lastStatusChange,timestampLast time when the node changed its status (enabled/disabled).
oldestSnapshot,bigintThe oldest global snapshot existing on this node.
SenderPid,integerProcess ID of the WAL sender.
SenderStartTime,timestampWAL sender start time.
ReceiverPid,integerProcess ID of the WAL receiver.
ReceiverStartTime,timestampWAL receiver start time.
connStr,textConnection string to this node.
connectivityMask,bigintBitmask representing connectivity to neighbor nodes. Each bit represents a connection to node.
nHeartbeats,integerThe number of heartbeat responses received from this node.
-
mtm.collect_cluster_info() Collects the data returned by the
mtm.get_cluster_state()function from all available nodes. For this function to work, in addition to replication connections,pg_hba.confmust allow ordinary connections to the node with the specified connection string.-
mtm.get_cluster_state() Shows the status of the
multimasterextension. Returns a tuple of the following values:id,integerNode ID.
status,textNode status. Possible values are:
Initialization,Offline,Connected,Online,Recovery,Recovered,InMinor,OutOfService.disabledNodeMask,bigintBitmask of disabled nodes.
disconnectedNodeMask,bigintBitmask of disconnected nodes.
catchUpNodeMask,bigintBitmask of nodes that completed the recovery.
liveNodes,integerNumber of enabled nodes.
allNodes,integerNumber of nodes in the cluster. The majority of alive nodes is calculated based on this parameter.
nActiveQueries,integerNumber of queries being currently processed on this node.
nPendingQueries,integerNumber of queries waiting for execution on this node.
queueSize,bigintSize of the pending query queue, in bytes.
transCount,bigintThe total number of replicated transactions processed by this node.
timeShift,bigintGlobal snapshot shift caused by unsynchronized clocks on nodes, in microseconds.
recoverySlot,integerThe node from which a failed node gets data updates during automatic recovery.
xidHashSize,bigintSize of xid2state hash.
gidHashSize,bigintSize of gid2state hash.
oldestXid,bigintThe oldest transaction ID on this node.
configChanges,integerNumber of state changes (enabled/disabled) since the last reboot.
stalledNodeMask,bigintBitmask of nodes for which replication slots were dropped.
stoppedNodeMask,bigintBitmask of nodes that were stopped by
mtm.stop_node().lastStatusChange,timestampTimestamp of the last state change.
-
mtm.add_node(connstrtext) Adds a new node to the cluster.
Arguments:
connstr— connection string for the new node. For example, for the databasemydb, usermyuser, and the new nodenode4, the connection string is"dbname=mydb user=myuser host=node4".Type:
text
-
mtm.alter_sequences() Fixes unique identifiers on all cluster nodes. This may be required after restoring all nodes from a single base backup.
-
mtm.stop_node(nodeinteger,drop_slotbooldefault false) Excludes a node from the cluster.
Arguments:
node— ID of the node to be dropped that you specified in themultimaster.node_idvariable.Type:
integerdrop_slot— Optional. Defines whether the replication slot should be dropped together with the node. Set this option totrueif you do not plan to restore the node in the future.Type:
booleanDefault:
false
-
mtm.recover_node(nodeinteger) Creates a replication slot for the node that was previously dropped together with its slot.
Arguments:
node— ID of the node to be restored.
-
mtm.make_table_local('table_name'regclass) Stops replication for the specified table.
Arguments:
table_name— the table you would like to exclude from the replication scheme.Type:
regclass
-
mtm.copy_table(table_nameregclass,node_idinteger) Copies the specified table to another node. You can use this function to restore the corrupted data on one or more nodes of the cluster. This function must be called on the node from which the table is copied.
Arguments:
table_name— the table you would like to copy.Type:
regclassnode_id— the node to copy the table to.Type:
integer
-
mtm.broadcast_table(table_nameregclass) Copies the specified table to all the alive nodes of the cluster. You can use this function to restore the corrupted data. This function must be called on the node from which the table is copied.
Arguments:
table_name— the table you would like to copy.Type:
regclass
F.32.6. Compatibility
The multimaster extension currently passes almost all PostgreSQL regression tests, except for a few tests for edge cases related to working with temporary tables and updating enum, which are not always transactional in PostgreSQL. We are working right now on providing full compatibility with the standard PostgreSQL.
F.32.7. Authors
Postgres Professional, Moscow, Russia.
F.32.7.1. Credits
The replication mechanism is based on logical decoding and an earlier version of the pglogical extension provided for community by the 2ndQuadrant team.
The three-phase E3PC commit protocol is based on the following works:
Idit Keidar, Danny Dolev. Increasing the Resilience of Distributed and Replicated Database Systems.
Tim Kempster, Colin Stirling, Peter Thanisch. A More Committed Quorum-Based Three Phase Commit Protocol.