Fix self-deadlock during DROP SUBSCRIPTION.
authorAmit Kapila <akapila@postgresql.org>
Tue, 19 Aug 2025 04:54:19 +0000 (04:54 +0000)
committerAmit Kapila <akapila@postgresql.org>
Tue, 19 Aug 2025 04:54:19 +0000 (04:54 +0000)
The DROP SUBSCRIPTION command performs several operations: it stops the
subscription workers, removes subscription-related entries from system
catalogs, and deletes the replication slot on the publisher server.
Previously, this command acquired an AccessExclusiveLock on
pg_subscription before initiating these steps.

However, while holding this lock, the command attempts to connect to the
publisher to remove the replication slot. In cases where the connection is
made to a newly created database on the same server as subscriber, the
cache-building process during connection tries to acquire an
AccessShareLock on pg_subscription, resulting in a self-deadlock.

To resolve this issue, we reduce the lock level on pg_subscription during
DROP SUBSCRIPTION from AccessExclusiveLock to RowExclusiveLock. Earlier,
the higher lock level was used to prevent the launcher from starting a new
worker during the drop operation, as a restarted worker could become
orphaned.

Now, instead of relying on a strict lock, we acquire an AccessShareLock on
the specific subscription being dropped and re-validate its existence
after acquiring the lock. If the subscription is no longer valid, the
worker exits gracefully. This approach avoids the deadlock while still
ensuring that orphan workers are not created.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: vignesh C <vignesh21@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 13
Discussion: https://postgr.es/m/18988-7312c868be2d467f@postgresql.org

src/backend/commands/subscriptioncmds.c
src/backend/replication/logical/worker.c
src/test/subscription/t/100_bugs.pl

index e1260fc0e9014a61d6852d96a21ab65465e7e033..aebfaecd9d84b5ebb6dace1283a470b047afb502 100644 (file)
@@ -1491,10 +1491,12 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
    bool        must_use_password;
 
    /*
-    * Lock pg_subscription with AccessExclusiveLock to ensure that the
-    * launcher doesn't restart new worker during dropping the subscription
+    * The launcher may concurrently start a new worker for this subscription.
+    * During initialization, the worker checks for subscription validity and
+    * exits if the subscription has already been dropped. See
+    * InitializeLogRepWorker.
     */
-   rel = table_open(SubscriptionRelationId, AccessExclusiveLock);
+   rel = table_open(SubscriptionRelationId, RowExclusiveLock);
 
    tup = SearchSysCache2(SUBSCRIPTIONNAME, MyDatabaseId,
                          CStringGetDatum(stmt->subname));
index 4df0a594dc9397bf5297ac52cc5be9a2f4751627..9b5c641941fad28f724c92f3d622c1795d9096cb 100644 (file)
@@ -4492,6 +4492,13 @@ InitializeApplyWorker(void)
    StartTransactionCommand();
    oldctx = MemoryContextSwitchTo(ApplyContext);
 
+   /*
+    * Lock the subscription to prevent it from being concurrently dropped,
+    * then re-verify its existence. After the initialization, the worker will
+    * be terminated gracefully if the subscription is dropped.
+    */
+   LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid, 0,
+                    AccessShareLock);
    MySubscription = GetSubscription(MyLogicalRepWorker->subid, true);
    if (!MySubscription)
    {
index 6744998b620401392d2bc1656cabdbcb14713a5c..87643b8e620251dfa69e1c97eab4934b13e3a450 100644 (file)
@@ -568,4 +568,34 @@ is($result, 't',
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
+# BUG #18988
+# The bug happened due to a self-deadlock between the DROP SUBSCRIPTION
+# command and the walsender process for accessing pg_subscription. This
+# occurred when DROP SUBSCRIPTION attempted to remove a replication slot by
+# connecting to a newly created database whose caches are not yet
+# initialized.
+#
+# The bug is fixed by reducing the lock-level during DROP SUBSCRIPTION.
+$node_publisher->start();
+
+$publisher_connstr = $node_publisher->connstr . ' dbname=regress_db';
+$node_publisher->safe_psql(
+   'postgres', qq(
+   CREATE DATABASE regress_db;
+   CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub WITH (connect=false);
+));
+
+my ($ret, $stdout, $stderr) =
+  $node_publisher->psql('postgres', q{DROP SUBSCRIPTION regress_sub1});
+
+isnt($ret, 0, "replication slot does not exist: exit code not 0");
+like(
+   $stderr,
+   qr/ERROR:  could not drop replication slot "regress_sub1" on publisher/,
+   "could not drop replication slot: error message");
+
+$node_publisher->safe_psql('postgres', "DROP DATABASE regress_db");
+
+$node_publisher->stop('fast');
+
 done_testing();