Repair PANIC condition in hash indexes when a previous index extension attempt

author Tom Lane <tgl@sss.pgh.pa.us>

Thu, 19 Apr 2007 20:24:36 +0000 (20:24 +0000)

committer Tom Lane <tgl@sss.pgh.pa.us>

Thu, 19 Apr 2007 20:24:36 +0000 (20:24 +0000)
author Tom Lane <tgl@sss.pgh.pa.us>
Thu, 19 Apr 2007 20:24:36 +0000 (20:24 +0000)
committer Tom Lane <tgl@sss.pgh.pa.us>
Thu, 19 Apr 2007 20:24:36 +0000 (20:24 +0000)
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README

new file mode 100644 (file)

index 0000000..edd2193
--- /dev/null
+++ b/src/backend/access/hash/README
@@ -0,0 +1,435 @@
+$Header$
+
+This directory contains an implementation of hash indexing for Postgres.
+
+A hash index consists of two or more "buckets", into which tuples are
+placed whenever their hash key maps to the bucket number.  The
+key-to-bucket-number mapping is chosen so that the index can be
+incrementally expanded.  When a new bucket is to be added to the index,
+exactly one existing bucket will need to be "split", with some of its
+tuples being transferred to the new bucket according to the updated
+key-to-bucket-number mapping.  This is essentially the same hash table
+management technique embodied in src/backend/utils/hash/dynahash.c for
+in-memory hash tables.
+
+Each bucket in the hash index comprises one or more index pages.  The
+bucket's first page is permanently assigned to it when the bucket is
+created.  Additional pages, called "overflow pages", are added if the
+bucket receives too many tuples to fit in the primary bucket page.
+The pages of a bucket are chained together in a doubly-linked list
+using fields in the index page special space.
+
+There is currently no provision to shrink a hash index, other than by
+rebuilding it with REINDEX.  Overflow pages can be recycled for reuse
+in other buckets, but we never give them back to the operating system.
+There is no provision for reducing the number of buckets, either.
+
+
+Page addressing
+---------------
+
+There are four kinds of pages in a hash index: the meta page (page zero),
+which contains statically allocated control information; primary bucket
+pages; overflow pages; and bitmap pages, which keep track of overflow
+pages that have been freed and are available for re-use.  For addressing
+purposes, bitmap pages are regarded as a subset of the overflow pages.
+
+Primary bucket pages and overflow pages are allocated independently (since
+any given index might need more or fewer overflow pages relative to its
+number of buckets).  The hash code uses an interesting set of addressing
+rules to support a variable number of overflow pages while not having to
+move primary bucket pages around after they are created.
+
+Primary bucket pages (henceforth just "bucket pages") are allocated in
+power-of-2 groups, called "split points" in the code.  Buckets 0 and 1
+are created when the index is initialized.  At the first split, buckets 2
+and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;
+when bucket 8 is needed, buckets 8-15 are allocated; etc.  All the bucket
+pages of a power-of-2 group appear consecutively in the index.  This
+addressing scheme allows the physical location of a bucket page to be
+computed from the bucket number relatively easily, using only a small
+amount of control information.  We take the log2() of the bucket number
+to determine which split point S the bucket belongs to, and then simply
+add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the
+metapage) to compute the physical address.  hashm_spares[S] can be
+interpreted as the total number of overflow pages that have been allocated
+before the bucket pages of splitpoint S.  hashm_spares[0] is always 0,
+so that buckets 0 and 1 (which belong to splitpoint 0) always appear at
+block numbers 1 and 2, just after the meta page.  We always have
+hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
+former.  The difference between the two represents the number of overflow
+pages appearing between the bucket page groups of splitpoints N and N+1.
+
+When S splitpoints exist altogether, the array entries hashm_spares[0]
+through hashm_spares[S] are valid; hashm_spares[S] records the current
+total number of overflow pages.  New overflow pages are created as needed
+at the end of the index, and recorded by incrementing hashm_spares[S].
+When it is time to create a new splitpoint's worth of bucket pages, we
+copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
+stored in the hashm_ovflpoint field of the meta page).  This has the
+effect of reserving the correct number of bucket pages at the end of the
+index, and preparing to allocate additional overflow pages after those
+bucket pages.  hashm_spares[] entries before S cannot change anymore,
+since that would require moving already-created bucket pages.
+
+The last page nominally used by the index is always determinable from
+hashm_spares[S].  To avoid complaints from smgr, the logical EOF as seen by
+the filesystem and smgr must always be greater than or equal to this page.
+We have to allow the case "greater than" because it's possible that during
+an index extension we crash after allocating filesystem space and before
+updating the metapage.  Note that on filesystems that allow "holes" in
+files, it's entirely likely that pages before the logical EOF are not yet
+allocated: when we allocate a new splitpoint's worth of bucket pages, we
+physically zero the last such page to force the EOF up, and the first such
+page will be used immediately, but the intervening pages are not written
+until needed.
+
+Since overflow pages may be recycled if enough tuples are deleted from
+their bucket, we need a way to keep track of currently-free overflow
+pages.  The state of each overflow page (0 = available, 1 = not available)
+is recorded in "bitmap" pages dedicated to this purpose.  The entries in
+the bitmap are indexed by "bit number", a zero-based count in which every
+overflow page has a unique entry.  We can convert between an overflow
+page's physical block number and its bit number using the information in
+hashm_spares[] (see hashovfl.c for details).  The bit number sequence
+includes the bitmap pages, which is the reason for saying that bitmap
+pages are a subset of the overflow pages.  It turns out in fact that each
+bitmap page's first bit represents itself --- this is not an essential
+property, but falls out of the fact that we only allocate another bitmap
+page when we really need one.  Bit number zero always corresponds to block
+number 3, which is the first bitmap page and is allocated during index
+creation.
+
+
+Lock definitions
+----------------
+
+We use both lmgr locks ("heavyweight" locks) and buffer context locks
+(LWLocks) to control access to a hash index.  lmgr locks are needed for
+long-term locking since there is a (small) risk of deadlock, which we must
+be able to detect.  Buffer context locks are used for short-term access
+control to individual pages of the index.
+
+We define the following lmgr locks for a hash index:
+
+LockPage(rel, 0) represents the right to modify the hash-code-to-bucket
+mapping.  A process attempting to enlarge the hash table by splitting a
+bucket must exclusive-lock this lock before modifying the metapage data
+representing the mapping.  Processes intending to access a particular
+bucket must share-lock this lock until they have acquired lock on the
+correct target bucket.
+
+LockPage(rel, page), where page is the page number of a hash bucket page,
+represents the right to split or compact an individual bucket.  A process
+splitting a bucket must exclusive-lock both old and new halves of the
+bucket until it is done.  A process doing VACUUM must exclusive-lock the
+bucket it is currently purging tuples from.  Processes doing scans or
+insertions must share-lock the bucket they are scanning or inserting into.
+(It is okay to allow concurrent scans and insertions.)
+
+The lmgr lock IDs corresponding to overflow pages are currently unused.
+These are available for possible future refinements.
+
+Note that these lock definitions are conceptually distinct from any sort
+of lock on the pages whose numbers they share.  A process must also obtain
+read or write buffer lock on the metapage or bucket page before accessing
+said page.
+
+Processes performing hash index scans must hold share lock on the bucket
+they are scanning throughout the scan.  This seems to be essential, since
+there is no reasonable way for a scan to cope with its bucket being split
+underneath it.  This creates a possibility of deadlock external to the
+hash index code, since a process holding one of these locks could block
+waiting for an unrelated lock held by another process.  If that process
+then does something that requires exclusive lock on the bucket, we have
+deadlock.  Therefore the bucket locks must be lmgr locks so that deadlock
+can be detected and recovered from.  This also forces the page-zero lock
+to be an lmgr lock, because as we'll see below it is held while attempting
+to acquire a bucket lock, and so it could also participate in a deadlock.
+
+Processes must obtain read (share) buffer context lock on any hash index
+page while reading it, and write (exclusive) lock while modifying it.
+To prevent deadlock we enforce these coding rules: no buffer lock may be
+held long term (across index AM calls), nor may any buffer lock be held
+while waiting for an lmgr lock, nor may more than one buffer lock
+be held at a time by any one process.  (The third restriction is probably
+stronger than necessary, but it makes the proof of no deadlock obvious.)
+
+
+Pseudocode algorithms
+---------------------
+
+The operations we need to support are: readers scanning the index for
+entries of a particular hash code (which by definition are all in the same
+bucket); insertion of a new tuple into the correct bucket; enlarging the
+hash table by splitting an existing bucket; and garbage collection
+(deletion of dead tuples and compaction of buckets).  Bucket splitting is
+done at conclusion of any insertion that leaves the hash table more full
+than the target load factor, but it is convenient to consider it as an
+independent operation.  Note that we do not have a bucket-merge operation
+--- the number of buckets never shrinks.  Insertion, splitting, and
+garbage collection may all need access to freelist management, which keeps
+track of available overflow pages.
+
+The reader algorithm is:
+
+       share-lock page 0 (to prevent active split)
+       read/sharelock meta page
+       compute bucket number for target hash key
+       release meta page
+       share-lock bucket page (to prevent split/compact of this bucket)
+       release page 0 share-lock
+-- then, per read request:
+       read/sharelock current page of bucket
+               step to next page if necessary (no chaining of locks)
+       get tuple
+       release current page
+-- at scan shutdown:
+       release bucket share-lock
+
+By holding the page-zero lock until lock on the target bucket is obtained,
+the reader ensures that the target bucket calculation is valid (otherwise
+the bucket might be split before the reader arrives at it, and the target
+entries might go into the new bucket).  Holding the bucket sharelock for
+the remainder of the scan prevents the reader's current-tuple pointer from
+being invalidated by other processes.  Notice though that the reader need
+not prevent other buckets from being split or compacted.
+
+The insertion algorithm is rather similar:
+
+       share-lock page 0 (to prevent active split)
+       read/sharelock meta page
+       compute bucket number for target hash key
+       release meta page
+       share-lock bucket page (to prevent split/compact of this bucket)
+       release page 0 share-lock
+-- (so far same as reader)
+       read/exclusive-lock current page of bucket
+       if full, release, read/exclusive-lock next page; repeat as needed
+       >> see below if no space in any page of bucket
+       insert tuple
+       write/release current page
+       release bucket share-lock
+       read/exclusive-lock meta page
+       increment tuple count, decide if split needed
+       write/release meta page
+       done if no split needed, else enter Split algorithm below
+
+It is okay for an insertion to take place in a bucket that is being
+actively scanned, because it does not change the position of any existing
+item in the bucket, so scan states are not invalidated.  We only need the
+short-term buffer locks to ensure that readers do not see a
+partially-updated page.
+
+It is clearly impossible for readers and inserters to deadlock, and in
+fact this algorithm allows them a very high degree of concurrency.
+(The exclusive metapage lock taken to update the tuple count is stronger
+than necessary, since readers do not care about the tuple count, but the
+lock is held for such a short time that this is probably not an issue.)
+
+When an inserter cannot find space in any existing page of a bucket, it
+must obtain an overflow page and add that page to the bucket's chain.
+Details of that part of the algorithm appear later.
+
+The page split algorithm is entered whenever an inserter observes that the
+index is overfull (has a higher-than-wanted ratio of tuples to buckets).
+The algorithm attempts, but does not necessarily succeed, to split one
+existing bucket in two, thereby lowering the fill ratio:
+
+       exclusive-lock page 0 (assert the right to begin a split)
+       read/exclusive-lock meta page
+       check split still needed
+       if split not needed anymore, drop locks and exit
+       decide which bucket to split
+       Attempt to X-lock old bucket number (definitely could fail)
+       Attempt to X-lock new bucket number (shouldn't fail, but...)
+       if above fail, drop locks and exit
+       update meta page to reflect new number of buckets
+       write/release meta page
+       release X-lock on page 0
+       -- now, accesses to all other buckets can proceed.
+       Perform actual split of bucket, moving tuples as needed
+       >> see below about acquiring needed extra space
+       Release X-locks of old and new buckets
+
+Note the page zero and metapage locks are not held while the actual tuple
+rearrangement is performed, so accesses to other buckets can proceed in
+parallel; in fact, it's possible for multiple bucket splits to proceed
+in parallel.
+
+Split's attempt to X-lock the old bucket number could fail if another
+process holds S-lock on it.  We do not want to wait if that happens, first
+because we don't want to wait while holding the metapage exclusive-lock,
+and second because it could very easily result in deadlock.  (The other
+process might be out of the hash AM altogether, and could do something
+that blocks on another lock this process holds; so even if the hash
+algorithm itself is deadlock-free, a user-induced deadlock could occur.)
+So, this is a conditional LockAcquire operation, and if it fails we just
+abandon the attempt to split.  This is all right since the index is
+overfull but perfectly functional.  Every subsequent inserter will try to
+split, and eventually one will succeed.  If multiple inserters failed to
+split, the index might still be overfull, but eventually, the index will
+not be overfull and split attempts will stop.  (We could make a successful
+splitter loop to see if the index is still overfull, but it seems better to
+distribute the split overhead across successive insertions.)
+
+A problem is that if a split fails partway through (eg due to insufficient
+disk space) the index is left corrupt.  The probability of that could be
+made quite low if we grab a free page or two before we update the meta
+page, but the only real solution is to treat a split as a WAL-loggable,
+must-complete action.  I'm not planning to teach hash about WAL in this
+go-round.
+
+The fourth operation is garbage collection (bulk deletion):
+
+       next bucket := 0
+       read/sharelock meta page
+       fetch current max bucket number
+       release meta page
+       while next bucket <= max bucket do
+               Acquire X lock on target bucket
+               Scan and remove tuples, compact free space as needed
+               Release X lock
+               next bucket ++
+       end loop
+       exclusive-lock meta page
+       check if number of buckets changed
+       if so, release lock and return to for-each-bucket loop
+       else update metapage tuple count
+       write/release meta page
+
+Note that this is designed to allow concurrent splits.  If a split occurs,
+tuples relocated into the new bucket will be visited twice by the scan,
+but that does no harm.  (We must however be careful about the statistics
+reported by the VACUUM operation.  What we can do is count the number of
+tuples scanned, and believe this in preference to the stored tuple count
+if the stored tuple count and number of buckets did *not* change at any
+time during the scan.  This provides a way of correcting the stored tuple
+count if it gets out of sync for some reason.  But if a split or insertion
+does occur concurrently, the scan count is untrustworthy; instead,
+subtract the number of tuples deleted from the stored tuple count and
+use that.)
+
+The exclusive lock request could deadlock in some strange scenarios, but
+we can just error out without any great harm being done.
+
+
+Free space management
+---------------------
+
+(Question: why is this so complicated?  Why not just have a linked list
+of free pages with the list head in the metapage?  It's not like we
+avoid needing to modify the metapage with all this.)
+
+Free space management consists of two sub-algorithms, one for reserving
+an overflow page to add to a bucket chain, and one for returning an empty
+overflow page to the free pool.
+
+Obtaining an overflow page:
+
+       read/exclusive-lock meta page
+       determine next bitmap page number; if none, exit loop
+       release meta page lock
+       read/exclusive-lock bitmap page
+       search for a free page (zero bit in bitmap)
+       if found:
+               set bit in bitmap
+               write/release bitmap page
+               read/exclusive-lock meta page
+               if first-free-bit value did not change,
+                       update it and write meta page
+               release meta page
+               return page number
+       else (not found):
+       release bitmap page
+       loop back to try next bitmap page, if any
+-- here when we have checked all bitmap pages; we hold meta excl. lock
+       extend index to add another overflow page; update meta information
+       write/release meta page
+       return page number
+
+It is slightly annoying to release and reacquire the metapage lock
+multiple times, but it seems best to do it that way to minimize loss of
+concurrency against processes just entering the index.  We don't want
+to hold the metapage exclusive lock while reading in a bitmap page.
+(We can at least avoid repeated buffer pin/unpin here.)
+
+The normal path for extending the index does not require doing I/O while
+holding the metapage lock.  We do have to do I/O when the extension
+requires adding a new bitmap page as well as the required overflow page
+... but that is an infrequent case, so the loss of concurrency seems
+acceptable.
+
+The portion of tuple insertion that calls the above subroutine looks
+like this:
+
+       -- having determined that no space is free in the target bucket:
+       remember last page of bucket, drop write lock on it
+       call free-page-acquire routine
+       re-write-lock last page of bucket
+       if it is not last anymore, step to the last page
+       update (former) last page to point to new page
+       write-lock and initialize new page, with back link to former last page
+       write and release former last page
+       insert tuple into new page
+       -- etc.
+
+Notice this handles the case where two concurrent inserters try to extend
+the same bucket.  They will end up with a valid, though perhaps
+space-inefficient, configuration: two overflow pages will be added to the
+bucket, each containing one tuple.
+
+The last part of this violates the rule about holding write lock on two
+pages concurrently, but it should be okay to write-lock the previously
+free page; there can be no other process holding lock on it.
+
+Bucket splitting uses a similar algorithm if it has to extend the new
+bucket, but it need not worry about concurrent extension since it has
+exclusive lock on the new bucket.
+
+Freeing an overflow page is done by garbage collection and by bucket
+splitting (the old bucket may contain no-longer-needed overflow pages).
+In both cases, the process holds exclusive lock on the containing bucket,
+so need not worry about other accessors of pages in the bucket.  The
+algorithm is:
+
+       delink overflow page from bucket chain
+       (this requires read/update/write/release of fore and aft siblings)
+       read/share-lock meta page
+       determine which bitmap page contains the free space bit for page
+       release meta page
+       read/exclusive-lock bitmap page
+       update bitmap bit
+       write/release bitmap page
+       if page number is less than what we saw as first-free-bit in meta:
+       read/exclusive-lock meta page
+       if page number is still less than first-free-bit,
+               update first-free-bit field and write meta page
+       release meta page
+
+We have to do it this way because we must clear the bitmap bit before
+changing the first-free-bit field (hashm_firstfree).  It is possible that
+we set first-free-bit too small (because someone has already reused the
+page we just freed), but that is okay; the only cost is the next overflow
+page acquirer will scan more bitmap bits than he needs to.  What must be
+avoided is having first-free-bit greater than the actual first free bit,
+because then that free page would never be found by searchers.
+
+All the freespace operations should be called while holding no buffer
+locks.  Since they need no lmgr locks, deadlock is not possible.
+
+
+Other notes
+-----------
+
+All the shenanigans with locking prevent a split occurring while *another*
+process is stopped in a given bucket.  They do not ensure that one of
+our *own* backend's scans is not stopped in the bucket, because lmgr
+doesn't consider a process's own locks to conflict.  So the Split
+algorithm must check for that case separately before deciding it can go
+ahead with the split.  VACUUM does not have this problem since nothing
+else can be happening within the vacuuming backend.
+
+Should we instead try to fix the state of any conflicting local scan?
+Seems mighty ugly --- got to move the held bucket S-lock as well as lots
+of other messiness.  For now, just punt and don't split.
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c

index 1f89611258293d440e5990343144cc1e985bdb1b..04e5442ff0d777dd587349025c3249433da4446c 100644 (file)
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -271,19 +271,12 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
         blkno = bitno_to_blkno(metap, bit);
  
         /*
-        * We have to fetch the page with P_NEW to ensure smgr's idea of the
+        * Fetch the page with _hash_getnewbuf to ensure smgr's idea of the
          * relation length stays in sync with ours.  XXX It's annoying to do this
          * with metapage write lock held; would be better to use a lock that
-        * doesn't block incoming searches.  Best way to fix it would be to stop
-        * maintaining hashm_spares[hashm_ovflpoint] and rely entirely on the
-        * smgr relation length to track where new overflow pages come from;
-        * then we could release the metapage before we do the smgrextend.
-        * FIXME later (not in beta...)
+        * doesn't block incoming searches.
          */
-       newbuf = _hash_getbuf(rel, P_NEW, HASH_WRITE);
-       if (BufferGetBlockNumber(newbuf) != blkno)
-               elog(ERROR, "unexpected hash relation size: %u, should be %u",
-                        BufferGetBlockNumber(newbuf), blkno);
+       newbuf = _hash_getnewbuf(rel, blkno, HASH_WRITE);
  
         metap->hashm_spares[splitnum]++;
  
@@ -508,19 +501,14 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno)
         /*
          * It is okay to write-lock the new bitmap page while holding metapage
          * write lock, because no one else could be contending for the new page.
-        * Also, the metapage lock makes it safe to extend the index using P_NEW,
-        * which we want to do to ensure the smgr's idea of the relation size
-        * stays in step with ours.
+        * Also, the metapage lock makes it safe to extend the index using
+        * _hash_getnewbuf.
          *
          * There is some loss of concurrency in possibly doing I/O for the new
          * page while holding the metapage lock, but this path is taken so
          * seldom that it's not worth worrying about.
          */
-       buf = _hash_getbuf(rel, P_NEW, HASH_WRITE);
-       if (BufferGetBlockNumber(buf) != blkno)
-               elog(ERROR, "unexpected hash relation size: %u, should be %u",
-                        BufferGetBlockNumber(buf), blkno);
-
+       buf = _hash_getnewbuf(rel, blkno, HASH_WRITE);
         pg = BufferGetPage(buf);
  
         /* initialize the page */
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c

index 2778c207df37c10a15272a6257dbdf150d1d62f8..4aa8bddb3ba2e7b60e3558e25ee8b085c1b7f801 100644 (file)
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -35,7 +35,8 @@
  #include "utils/lsyscache.h"
  
  
-static BlockNumber _hash_alloc_buckets(Relation rel, uint32 nblocks);
+static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
+                                                               uint32 nblocks);
  static void _hash_splitbucket(Relation rel, Buffer metabuf,
                                                           Bucket obucket, Bucket nbucket,
                                                           BlockNumber start_oblkno,
@@ -105,14 +106,18 @@ _hash_droplock(Relation rel, BlockNumber whichlock, int access)
   *             requested buffer and its reference count has been incremented
   *             (ie, the buffer is "locked and pinned").
   *
- *             blkno == P_NEW is allowed, but it is caller's responsibility to
- *             ensure that only one process can extend the index at a time.
+ *             P_NEW is disallowed because this routine should only be used
+ *             to access pages that are known to be before the filesystem EOF.
+ *             Extending the index should be done with _hash_getnewbuf.
   */
  Buffer
  _hash_getbuf(Relation rel, BlockNumber blkno, int access)
  {
         Buffer          buf;
  
+       if (blkno == P_NEW)
+               elog(ERROR, "hash AM does not use P_NEW");
+
         buf = ReadBuffer(rel, blkno);
  
         if (access != HASH_NOLOCK)
@@ -122,6 +127,51 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access)
         return buf;
  }
  
+/*
+ *     _hash_getnewbuf() -- Get a new page at the end of the index.
+ *
+ *             This has the same API as _hash_getbuf, except that we are adding
+ *             a page to the index, and hence expect the page to be past the
+ *             logical EOF.  (However, we have to support the case where it isn't,
+ *             since a prior try might have crashed after extending the filesystem
+ *             EOF but before updating the metapage to reflect the added page.)
+ *
+ *             It is caller's responsibility to ensure that only one process can
+ *             extend the index at a time.
+ *
+ *             All call sites should call _hash_pageinit on the returned page.
+ *             Also, it's difficult to imagine why access would not be HASH_WRITE.
+ */
+Buffer
+_hash_getnewbuf(Relation rel, BlockNumber blkno, int access)
+{
+       BlockNumber     nblocks = RelationGetNumberOfBlocks(rel);
+       Buffer          buf;
+
+       if (blkno == P_NEW)
+               elog(ERROR, "hash AM does not use P_NEW");
+       if (blkno > nblocks)
+               elog(ERROR, "access to noncontiguous page in hash index \"%s\"",
+                        RelationGetRelationName(rel));
+
+       /* smgr insists we use P_NEW to extend the relation */
+       if (blkno == nblocks)
+       {
+               buf = ReadBuffer(rel, P_NEW);
+               if (BufferGetBlockNumber(buf) != blkno)
+                       elog(ERROR, "unexpected hash relation size: %u, should be %u",
+                                BufferGetBlockNumber(buf), blkno);
+       }
+       else
+               buf = ReadBuffer(rel, blkno);
+
+       if (access != HASH_NOLOCK)
+               LockBuffer(buf, access);
+
+       /* ref count and lock type are correct */
+       return buf;
+}
+
  /*
   *     _hash_relbuf() -- release a locked buffer.
   *
@@ -254,12 +304,11 @@ _hash_metapinit(Relation rel)
  
         /*
          * We initialize the metapage, the first two bucket pages, and the
-        * first bitmap page in sequence, using P_NEW to cause smgrextend()
-        * calls to occur.  This ensures that the smgr level has the right
-        * idea of the physical index length.
+        * first bitmap page in sequence, using _hash_getnewbuf to cause
+        * smgrextend() calls to occur.  This ensures that the smgr level
+        * has the right idea of the physical index length.
          */
-       metabuf = _hash_getbuf(rel, P_NEW, HASH_WRITE);
-       Assert(BufferGetBlockNumber(metabuf) == HASH_METAPAGE);
+       metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, HASH_WRITE);
         pg = BufferGetPage(metabuf);
         _hash_pageinit(pg, BufferGetPageSize(metabuf));
  
@@ -312,8 +361,7 @@ _hash_metapinit(Relation rel)
          */
         for (i = 0; i <= 1; i++)
         {
-               buf = _hash_getbuf(rel, P_NEW, HASH_WRITE);
-               Assert(BufferGetBlockNumber(buf) == BUCKET_TO_BLKNO(metap, i));
+               buf = _hash_getnewbuf(rel, BUCKET_TO_BLKNO(metap, i), HASH_WRITE);
                 pg = BufferGetPage(buf);
                 _hash_pageinit(pg, BufferGetPageSize(buf));
                 pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
@@ -361,7 +409,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
         Bucket          old_bucket;
         Bucket          new_bucket;
         uint32          spare_ndx;
-       BlockNumber firstblock = InvalidBlockNumber;
         BlockNumber start_oblkno;
         BlockNumber start_nblkno;
         uint32          maxbucket;
@@ -413,39 +460,15 @@ _hash_expandtable(Relation rel, Buffer metabuf)
         if (metap->hashm_maxbucket >= (uint32) 0x7FFFFFFE)
                 goto fail;
  
-       /*
-        * If the split point is increasing (hashm_maxbucket's log base 2
-        * increases), we need to allocate a new batch of bucket pages.
-        */
-       new_bucket = metap->hashm_maxbucket + 1;
-       spare_ndx = _hash_log2(new_bucket + 1);
-       if (spare_ndx > metap->hashm_ovflpoint)
-       {
-               Assert(spare_ndx == metap->hashm_ovflpoint + 1);
-               /*
-                * The number of buckets in the new splitpoint is equal to the
-                * total number already in existence, i.e. new_bucket.  Currently
-                * this maps one-to-one to blocks required, but someday we may need
-                * a more complicated calculation here.
-                */
-               firstblock = _hash_alloc_buckets(rel, new_bucket);
-               if (firstblock == InvalidBlockNumber)
-                       goto fail;                      /* can't split due to BlockNumber overflow */
-       }
-
         /*
          * Determine which bucket is to be split, and attempt to lock the old
          * bucket.  If we can't get the lock, give up.
          *
          * The lock protects us against other backends, but not against our own
          * backend.  Must check for active scans separately.
-        *
-        * Ideally we would lock the new bucket too before proceeding, but if
-        * we are about to cross a splitpoint then the BUCKET_TO_BLKNO mapping
-        * isn't correct yet.  For simplicity we update the metapage first and
-        * then lock.  This should be okay because no one else should be trying
-        * to lock the new bucket yet...
          */
+       new_bucket = metap->hashm_maxbucket + 1;
+
         old_bucket = (new_bucket & metap->hashm_lowmask);
  
         start_oblkno = BUCKET_TO_BLKNO(metap, old_bucket);
@@ -456,6 +479,45 @@ _hash_expandtable(Relation rel, Buffer metabuf)
         if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
                 goto fail;
  
+       /*
+        * Likewise lock the new bucket (should never fail).
+        *
+        * Note: it is safe to compute the new bucket's blkno here, even though
+        * we may still need to update the BUCKET_TO_BLKNO mapping.  This is
+        * because the current value of hashm_spares[hashm_ovflpoint] correctly
+        * shows where we are going to put a new splitpoint's worth of buckets.
+        */
+       start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
+
+       if (_hash_has_active_scan(rel, new_bucket))
+               elog(ERROR, "scan in progress on supposedly new bucket");
+
+       if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
+               elog(ERROR, "could not get lock on supposedly new bucket");
+
+       /*
+        * If the split point is increasing (hashm_maxbucket's log base 2
+        * increases), we need to allocate a new batch of bucket pages.
+        */
+       spare_ndx = _hash_log2(new_bucket + 1);
+       if (spare_ndx > metap->hashm_ovflpoint)
+       {
+               Assert(spare_ndx == metap->hashm_ovflpoint + 1);
+               /*
+                * The number of buckets in the new splitpoint is equal to the
+                * total number already in existence, i.e. new_bucket.  Currently
+                * this maps one-to-one to blocks required, but someday we may need
+                * a more complicated calculation here.
+                */
+               if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
+               {
+                       /* can't split due to BlockNumber overflow */
+                       _hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
+                       _hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
+                       goto fail;
+               }
+       }
+
         /*
          * Okay to proceed with split.  Update the metapage bucket mapping info.
          */
@@ -480,20 +542,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
                 metap->hashm_ovflpoint = spare_ndx;
         }
  
-       /* now we can compute the new bucket's primary block number */
-       start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
-
-       /* if we added a splitpoint, should match result of _hash_alloc_buckets */
-       if (firstblock != InvalidBlockNumber &&
-               firstblock != start_nblkno)
-               elog(PANIC, "unexpected hash relation size: %u, should be %u",
-                        firstblock, start_nblkno);
-
-       Assert(!_hash_has_active_scan(rel, new_bucket));
-
-       if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
-               elog(PANIC, "could not get lock on supposedly new bucket");
-
         /*
          * Copy bucket mapping info now; this saves re-accessing the meta page
          * inside _hash_splitbucket's inner loop.  Note that once we drop the
@@ -556,23 +604,16 @@ fail:
   * for the purpose.  OTOH, adding a splitpoint is a very infrequent operation,
   * so it may not be worth worrying about.
   *
- * Returns the first block number in the new splitpoint's range, or
- * InvalidBlockNumber if allocation failed due to BlockNumber overflow.
+ * Returns TRUE if successful, or FALSE if allocation failed due to
+ * BlockNumber overflow.
   */
-static BlockNumber
-_hash_alloc_buckets(Relation rel, uint32 nblocks)
+static bool
+_hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
  {
-       BlockNumber     firstblock;
         BlockNumber     lastblock;
         BlockNumber     endblock;
         char            zerobuf[BLCKSZ];
  
-       /*
-        * Since we hold metapage lock, no one else is either splitting or
-        * allocating a new page in _hash_getovflpage(); hence it's safe to
-        * assume that the relation length isn't changing under us.
-        */
-       firstblock = RelationGetNumberOfBlocks(rel);
         lastblock = firstblock + nblocks - 1;
  
         /*
@@ -580,9 +621,7 @@ _hash_alloc_buckets(Relation rel, uint32 nblocks)
          * extend the index anymore.
          */
         if (lastblock < firstblock || lastblock == InvalidBlockNumber)
-               return InvalidBlockNumber;
-
-       /* Note: we assume RelationGetNumberOfBlocks did RelationOpenSmgr for us */
+               return false;
  
         MemSet(zerobuf, 0, sizeof(zerobuf));
  
@@ -604,7 +643,7 @@ _hash_alloc_buckets(Relation rel, uint32 nblocks)
  
         rel->rd_nblocks = lastblock+1;
  
-       return firstblock;
+       return true;
  }
  
  
diff --git a/src/include/access/hash.h b/src/include/access/hash.h

index 2f8012dd40d8c36f184d0300f5681602ceae8066..866a77bfd4de3ee1953a6454b12596d49ab599d7 100644 (file)
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -278,6 +278,7 @@ extern void _hash_getlock(Relation rel, BlockNumber whichlock, int access);
  extern bool _hash_try_getlock(Relation rel, BlockNumber whichlock, int access);
  extern void _hash_droplock(Relation rel, BlockNumber whichlock, int access);
  extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno, int access);
  extern void _hash_relbuf(Relation rel, Buffer buf);
  extern void _hash_dropbuf(Relation rel, Buffer buf);
  extern void _hash_wrtbuf(Relation rel, Buffer buf);
author	Tom Lane <tgl@sss.pgh.pa.us>
	Thu, 19 Apr 2007 20:24:36 +0000 (20:24 +0000)
committer	Tom Lane <tgl@sss.pgh.pa.us>
	Thu, 19 Apr 2007 20:24:36 +0000 (20:24 +0000)
src/backend/access/hash/README	[new file with mode: 0644]	patch \| blob
src/backend/access/hash/hashovfl.c		patch \| blob \| blame \| history
src/backend/access/hash/hashpage.c		patch \| blob \| blame \| history
src/include/access/hash.h		patch \| blob \| blame \| history