--- /dev/null
+$Header$
+
+This directory contains an implementation of hash indexing for Postgres.
+
+A hash index consists of two or more "buckets", into which tuples are
+placed whenever their hash key maps to the bucket number. The
+key-to-bucket-number mapping is chosen so that the index can be
+incrementally expanded. When a new bucket is to be added to the index,
+exactly one existing bucket will need to be "split", with some of its
+tuples being transferred to the new bucket according to the updated
+key-to-bucket-number mapping. This is essentially the same hash table
+management technique embodied in src/backend/utils/hash/dynahash.c for
+in-memory hash tables.
+
+Each bucket in the hash index comprises one or more index pages. The
+bucket's first page is permanently assigned to it when the bucket is
+created. Additional pages, called "overflow pages", are added if the
+bucket receives too many tuples to fit in the primary bucket page.
+The pages of a bucket are chained together in a doubly-linked list
+using fields in the index page special space.
+
+There is currently no provision to shrink a hash index, other than by
+rebuilding it with REINDEX. Overflow pages can be recycled for reuse
+in other buckets, but we never give them back to the operating system.
+There is no provision for reducing the number of buckets, either.
+
+
+Page addressing
+---------------
+
+There are four kinds of pages in a hash index: the meta page (page zero),
+which contains statically allocated control information; primary bucket
+pages; overflow pages; and bitmap pages, which keep track of overflow
+pages that have been freed and are available for re-use. For addressing
+purposes, bitmap pages are regarded as a subset of the overflow pages.
+
+Primary bucket pages and overflow pages are allocated independently (since
+any given index might need more or fewer overflow pages relative to its
+number of buckets). The hash code uses an interesting set of addressing
+rules to support a variable number of overflow pages while not having to
+move primary bucket pages around after they are created.
+
+Primary bucket pages (henceforth just "bucket pages") are allocated in
+power-of-2 groups, called "split points" in the code. Buckets 0 and 1
+are created when the index is initialized. At the first split, buckets 2
+and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;
+when bucket 8 is needed, buckets 8-15 are allocated; etc. All the bucket
+pages of a power-of-2 group appear consecutively in the index. This
+addressing scheme allows the physical location of a bucket page to be
+computed from the bucket number relatively easily, using only a small
+amount of control information. We take the log2() of the bucket number
+to determine which split point S the bucket belongs to, and then simply
+add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the
+metapage) to compute the physical address. hashm_spares[S] can be
+interpreted as the total number of overflow pages that have been allocated
+before the bucket pages of splitpoint S. hashm_spares[0] is always 0,
+so that buckets 0 and 1 (which belong to splitpoint 0) always appear at
+block numbers 1 and 2, just after the meta page. We always have
+hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
+former. The difference between the two represents the number of overflow
+pages appearing between the bucket page groups of splitpoints N and N+1.
+
+When S splitpoints exist altogether, the array entries hashm_spares[0]
+through hashm_spares[S] are valid; hashm_spares[S] records the current
+total number of overflow pages. New overflow pages are created as needed
+at the end of the index, and recorded by incrementing hashm_spares[S].
+When it is time to create a new splitpoint's worth of bucket pages, we
+copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
+stored in the hashm_ovflpoint field of the meta page). This has the
+effect of reserving the correct number of bucket pages at the end of the
+index, and preparing to allocate additional overflow pages after those
+bucket pages. hashm_spares[] entries before S cannot change anymore,
+since that would require moving already-created bucket pages.
+
+The last page nominally used by the index is always determinable from
+hashm_spares[S]. To avoid complaints from smgr, the logical EOF as seen by
+the filesystem and smgr must always be greater than or equal to this page.
+We have to allow the case "greater than" because it's possible that during
+an index extension we crash after allocating filesystem space and before
+updating the metapage. Note that on filesystems that allow "holes" in
+files, it's entirely likely that pages before the logical EOF are not yet
+allocated: when we allocate a new splitpoint's worth of bucket pages, we
+physically zero the last such page to force the EOF up, and the first such
+page will be used immediately, but the intervening pages are not written
+until needed.
+
+Since overflow pages may be recycled if enough tuples are deleted from
+their bucket, we need a way to keep track of currently-free overflow
+pages. The state of each overflow page (0 = available, 1 = not available)
+is recorded in "bitmap" pages dedicated to this purpose. The entries in
+the bitmap are indexed by "bit number", a zero-based count in which every
+overflow page has a unique entry. We can convert between an overflow
+page's physical block number and its bit number using the information in
+hashm_spares[] (see hashovfl.c for details). The bit number sequence
+includes the bitmap pages, which is the reason for saying that bitmap
+pages are a subset of the overflow pages. It turns out in fact that each
+bitmap page's first bit represents itself --- this is not an essential
+property, but falls out of the fact that we only allocate another bitmap
+page when we really need one. Bit number zero always corresponds to block
+number 3, which is the first bitmap page and is allocated during index
+creation.
+
+
+Lock definitions
+----------------
+
+We use both lmgr locks ("heavyweight" locks) and buffer context locks
+(LWLocks) to control access to a hash index. lmgr locks are needed for
+long-term locking since there is a (small) risk of deadlock, which we must
+be able to detect. Buffer context locks are used for short-term access
+control to individual pages of the index.
+
+We define the following lmgr locks for a hash index:
+
+LockPage(rel, 0) represents the right to modify the hash-code-to-bucket
+mapping. A process attempting to enlarge the hash table by splitting a
+bucket must exclusive-lock this lock before modifying the metapage data
+representing the mapping. Processes intending to access a particular
+bucket must share-lock this lock until they have acquired lock on the
+correct target bucket.
+
+LockPage(rel, page), where page is the page number of a hash bucket page,
+represents the right to split or compact an individual bucket. A process
+splitting a bucket must exclusive-lock both old and new halves of the
+bucket until it is done. A process doing VACUUM must exclusive-lock the
+bucket it is currently purging tuples from. Processes doing scans or
+insertions must share-lock the bucket they are scanning or inserting into.
+(It is okay to allow concurrent scans and insertions.)
+
+The lmgr lock IDs corresponding to overflow pages are currently unused.
+These are available for possible future refinements.
+
+Note that these lock definitions are conceptually distinct from any sort
+of lock on the pages whose numbers they share. A process must also obtain
+read or write buffer lock on the metapage or bucket page before accessing
+said page.
+
+Processes performing hash index scans must hold share lock on the bucket
+they are scanning throughout the scan. This seems to be essential, since
+there is no reasonable way for a scan to cope with its bucket being split
+underneath it. This creates a possibility of deadlock external to the
+hash index code, since a process holding one of these locks could block
+waiting for an unrelated lock held by another process. If that process
+then does something that requires exclusive lock on the bucket, we have
+deadlock. Therefore the bucket locks must be lmgr locks so that deadlock
+can be detected and recovered from. This also forces the page-zero lock
+to be an lmgr lock, because as we'll see below it is held while attempting
+to acquire a bucket lock, and so it could also participate in a deadlock.
+
+Processes must obtain read (share) buffer context lock on any hash index
+page while reading it, and write (exclusive) lock while modifying it.
+To prevent deadlock we enforce these coding rules: no buffer lock may be
+held long term (across index AM calls), nor may any buffer lock be held
+while waiting for an lmgr lock, nor may more than one buffer lock
+be held at a time by any one process. (The third restriction is probably
+stronger than necessary, but it makes the proof of no deadlock obvious.)
+
+
+Pseudocode algorithms
+---------------------
+
+The operations we need to support are: readers scanning the index for
+entries of a particular hash code (which by definition are all in the same
+bucket); insertion of a new tuple into the correct bucket; enlarging the
+hash table by splitting an existing bucket; and garbage collection
+(deletion of dead tuples and compaction of buckets). Bucket splitting is
+done at conclusion of any insertion that leaves the hash table more full
+than the target load factor, but it is convenient to consider it as an
+independent operation. Note that we do not have a bucket-merge operation
+--- the number of buckets never shrinks. Insertion, splitting, and
+garbage collection may all need access to freelist management, which keeps
+track of available overflow pages.
+
+The reader algorithm is:
+
+ share-lock page 0 (to prevent active split)
+ read/sharelock meta page
+ compute bucket number for target hash key
+ release meta page
+ share-lock bucket page (to prevent split/compact of this bucket)
+ release page 0 share-lock
+-- then, per read request:
+ read/sharelock current page of bucket
+ step to next page if necessary (no chaining of locks)
+ get tuple
+ release current page
+-- at scan shutdown:
+ release bucket share-lock
+
+By holding the page-zero lock until lock on the target bucket is obtained,
+the reader ensures that the target bucket calculation is valid (otherwise
+the bucket might be split before the reader arrives at it, and the target
+entries might go into the new bucket). Holding the bucket sharelock for
+the remainder of the scan prevents the reader's current-tuple pointer from
+being invalidated by other processes. Notice though that the reader need
+not prevent other buckets from being split or compacted.
+
+The insertion algorithm is rather similar:
+
+ share-lock page 0 (to prevent active split)
+ read/sharelock meta page
+ compute bucket number for target hash key
+ release meta page
+ share-lock bucket page (to prevent split/compact of this bucket)
+ release page 0 share-lock
+-- (so far same as reader)
+ read/exclusive-lock current page of bucket
+ if full, release, read/exclusive-lock next page; repeat as needed
+ >> see below if no space in any page of bucket
+ insert tuple
+ write/release current page
+ release bucket share-lock
+ read/exclusive-lock meta page
+ increment tuple count, decide if split needed
+ write/release meta page
+ done if no split needed, else enter Split algorithm below
+
+It is okay for an insertion to take place in a bucket that is being
+actively scanned, because it does not change the position of any existing
+item in the bucket, so scan states are not invalidated. We only need the
+short-term buffer locks to ensure that readers do not see a
+partially-updated page.
+
+It is clearly impossible for readers and inserters to deadlock, and in
+fact this algorithm allows them a very high degree of concurrency.
+(The exclusive metapage lock taken to update the tuple count is stronger
+than necessary, since readers do not care about the tuple count, but the
+lock is held for such a short time that this is probably not an issue.)
+
+When an inserter cannot find space in any existing page of a bucket, it
+must obtain an overflow page and add that page to the bucket's chain.
+Details of that part of the algorithm appear later.
+
+The page split algorithm is entered whenever an inserter observes that the
+index is overfull (has a higher-than-wanted ratio of tuples to buckets).
+The algorithm attempts, but does not necessarily succeed, to split one
+existing bucket in two, thereby lowering the fill ratio:
+
+ exclusive-lock page 0 (assert the right to begin a split)
+ read/exclusive-lock meta page
+ check split still needed
+ if split not needed anymore, drop locks and exit
+ decide which bucket to split
+ Attempt to X-lock old bucket number (definitely could fail)
+ Attempt to X-lock new bucket number (shouldn't fail, but...)
+ if above fail, drop locks and exit
+ update meta page to reflect new number of buckets
+ write/release meta page
+ release X-lock on page 0
+ -- now, accesses to all other buckets can proceed.
+ Perform actual split of bucket, moving tuples as needed
+ >> see below about acquiring needed extra space
+ Release X-locks of old and new buckets
+
+Note the page zero and metapage locks are not held while the actual tuple
+rearrangement is performed, so accesses to other buckets can proceed in
+parallel; in fact, it's possible for multiple bucket splits to proceed
+in parallel.
+
+Split's attempt to X-lock the old bucket number could fail if another
+process holds S-lock on it. We do not want to wait if that happens, first
+because we don't want to wait while holding the metapage exclusive-lock,
+and second because it could very easily result in deadlock. (The other
+process might be out of the hash AM altogether, and could do something
+that blocks on another lock this process holds; so even if the hash
+algorithm itself is deadlock-free, a user-induced deadlock could occur.)
+So, this is a conditional LockAcquire operation, and if it fails we just
+abandon the attempt to split. This is all right since the index is
+overfull but perfectly functional. Every subsequent inserter will try to
+split, and eventually one will succeed. If multiple inserters failed to
+split, the index might still be overfull, but eventually, the index will
+not be overfull and split attempts will stop. (We could make a successful
+splitter loop to see if the index is still overfull, but it seems better to
+distribute the split overhead across successive insertions.)
+
+A problem is that if a split fails partway through (eg due to insufficient
+disk space) the index is left corrupt. The probability of that could be
+made quite low if we grab a free page or two before we update the meta
+page, but the only real solution is to treat a split as a WAL-loggable,
+must-complete action. I'm not planning to teach hash about WAL in this
+go-round.
+
+The fourth operation is garbage collection (bulk deletion):
+
+ next bucket := 0
+ read/sharelock meta page
+ fetch current max bucket number
+ release meta page
+ while next bucket <= max bucket do
+ Acquire X lock on target bucket
+ Scan and remove tuples, compact free space as needed
+ Release X lock
+ next bucket ++
+ end loop
+ exclusive-lock meta page
+ check if number of buckets changed
+ if so, release lock and return to for-each-bucket loop
+ else update metapage tuple count
+ write/release meta page
+
+Note that this is designed to allow concurrent splits. If a split occurs,
+tuples relocated into the new bucket will be visited twice by the scan,
+but that does no harm. (We must however be careful about the statistics
+reported by the VACUUM operation. What we can do is count the number of
+tuples scanned, and believe this in preference to the stored tuple count
+if the stored tuple count and number of buckets did *not* change at any
+time during the scan. This provides a way of correcting the stored tuple
+count if it gets out of sync for some reason. But if a split or insertion
+does occur concurrently, the scan count is untrustworthy; instead,
+subtract the number of tuples deleted from the stored tuple count and
+use that.)
+
+The exclusive lock request could deadlock in some strange scenarios, but
+we can just error out without any great harm being done.
+
+
+Free space management
+---------------------
+
+(Question: why is this so complicated? Why not just have a linked list
+of free pages with the list head in the metapage? It's not like we
+avoid needing to modify the metapage with all this.)
+
+Free space management consists of two sub-algorithms, one for reserving
+an overflow page to add to a bucket chain, and one for returning an empty
+overflow page to the free pool.
+
+Obtaining an overflow page:
+
+ read/exclusive-lock meta page
+ determine next bitmap page number; if none, exit loop
+ release meta page lock
+ read/exclusive-lock bitmap page
+ search for a free page (zero bit in bitmap)
+ if found:
+ set bit in bitmap
+ write/release bitmap page
+ read/exclusive-lock meta page
+ if first-free-bit value did not change,
+ update it and write meta page
+ release meta page
+ return page number
+ else (not found):
+ release bitmap page
+ loop back to try next bitmap page, if any
+-- here when we have checked all bitmap pages; we hold meta excl. lock
+ extend index to add another overflow page; update meta information
+ write/release meta page
+ return page number
+
+It is slightly annoying to release and reacquire the metapage lock
+multiple times, but it seems best to do it that way to minimize loss of
+concurrency against processes just entering the index. We don't want
+to hold the metapage exclusive lock while reading in a bitmap page.
+(We can at least avoid repeated buffer pin/unpin here.)
+
+The normal path for extending the index does not require doing I/O while
+holding the metapage lock. We do have to do I/O when the extension
+requires adding a new bitmap page as well as the required overflow page
+... but that is an infrequent case, so the loss of concurrency seems
+acceptable.
+
+The portion of tuple insertion that calls the above subroutine looks
+like this:
+
+ -- having determined that no space is free in the target bucket:
+ remember last page of bucket, drop write lock on it
+ call free-page-acquire routine
+ re-write-lock last page of bucket
+ if it is not last anymore, step to the last page
+ update (former) last page to point to new page
+ write-lock and initialize new page, with back link to former last page
+ write and release former last page
+ insert tuple into new page
+ -- etc.
+
+Notice this handles the case where two concurrent inserters try to extend
+the same bucket. They will end up with a valid, though perhaps
+space-inefficient, configuration: two overflow pages will be added to the
+bucket, each containing one tuple.
+
+The last part of this violates the rule about holding write lock on two
+pages concurrently, but it should be okay to write-lock the previously
+free page; there can be no other process holding lock on it.
+
+Bucket splitting uses a similar algorithm if it has to extend the new
+bucket, but it need not worry about concurrent extension since it has
+exclusive lock on the new bucket.
+
+Freeing an overflow page is done by garbage collection and by bucket
+splitting (the old bucket may contain no-longer-needed overflow pages).
+In both cases, the process holds exclusive lock on the containing bucket,
+so need not worry about other accessors of pages in the bucket. The
+algorithm is:
+
+ delink overflow page from bucket chain
+ (this requires read/update/write/release of fore and aft siblings)
+ read/share-lock meta page
+ determine which bitmap page contains the free space bit for page
+ release meta page
+ read/exclusive-lock bitmap page
+ update bitmap bit
+ write/release bitmap page
+ if page number is less than what we saw as first-free-bit in meta:
+ read/exclusive-lock meta page
+ if page number is still less than first-free-bit,
+ update first-free-bit field and write meta page
+ release meta page
+
+We have to do it this way because we must clear the bitmap bit before
+changing the first-free-bit field (hashm_firstfree). It is possible that
+we set first-free-bit too small (because someone has already reused the
+page we just freed), but that is okay; the only cost is the next overflow
+page acquirer will scan more bitmap bits than he needs to. What must be
+avoided is having first-free-bit greater than the actual first free bit,
+because then that free page would never be found by searchers.
+
+All the freespace operations should be called while holding no buffer
+locks. Since they need no lmgr locks, deadlock is not possible.
+
+
+Other notes
+-----------
+
+All the shenanigans with locking prevent a split occurring while *another*
+process is stopped in a given bucket. They do not ensure that one of
+our *own* backend's scans is not stopped in the bucket, because lmgr
+doesn't consider a process's own locks to conflict. So the Split
+algorithm must check for that case separately before deciding it can go
+ahead with the split. VACUUM does not have this problem since nothing
+else can be happening within the vacuuming backend.
+
+Should we instead try to fix the state of any conflicting local scan?
+Seems mighty ugly --- got to move the held bucket S-lock as well as lots
+of other messiness. For now, just punt and don't split.
#include "utils/lsyscache.h"
-static BlockNumber _hash_alloc_buckets(Relation rel, uint32 nblocks);
+static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
+ uint32 nblocks);
static void _hash_splitbucket(Relation rel, Buffer metabuf,
Bucket obucket, Bucket nbucket,
BlockNumber start_oblkno,
* requested buffer and its reference count has been incremented
* (ie, the buffer is "locked and pinned").
*
- * blkno == P_NEW is allowed, but it is caller's responsibility to
- * ensure that only one process can extend the index at a time.
+ * P_NEW is disallowed because this routine should only be used
+ * to access pages that are known to be before the filesystem EOF.
+ * Extending the index should be done with _hash_getnewbuf.
*/
Buffer
_hash_getbuf(Relation rel, BlockNumber blkno, int access)
{
Buffer buf;
+ if (blkno == P_NEW)
+ elog(ERROR, "hash AM does not use P_NEW");
+
buf = ReadBuffer(rel, blkno);
if (access != HASH_NOLOCK)
return buf;
}
+/*
+ * _hash_getnewbuf() -- Get a new page at the end of the index.
+ *
+ * This has the same API as _hash_getbuf, except that we are adding
+ * a page to the index, and hence expect the page to be past the
+ * logical EOF. (However, we have to support the case where it isn't,
+ * since a prior try might have crashed after extending the filesystem
+ * EOF but before updating the metapage to reflect the added page.)
+ *
+ * It is caller's responsibility to ensure that only one process can
+ * extend the index at a time.
+ *
+ * All call sites should call _hash_pageinit on the returned page.
+ * Also, it's difficult to imagine why access would not be HASH_WRITE.
+ */
+Buffer
+_hash_getnewbuf(Relation rel, BlockNumber blkno, int access)
+{
+ BlockNumber nblocks = RelationGetNumberOfBlocks(rel);
+ Buffer buf;
+
+ if (blkno == P_NEW)
+ elog(ERROR, "hash AM does not use P_NEW");
+ if (blkno > nblocks)
+ elog(ERROR, "access to noncontiguous page in hash index \"%s\"",
+ RelationGetRelationName(rel));
+
+ /* smgr insists we use P_NEW to extend the relation */
+ if (blkno == nblocks)
+ {
+ buf = ReadBuffer(rel, P_NEW);
+ if (BufferGetBlockNumber(buf) != blkno)
+ elog(ERROR, "unexpected hash relation size: %u, should be %u",
+ BufferGetBlockNumber(buf), blkno);
+ }
+ else
+ buf = ReadBuffer(rel, blkno);
+
+ if (access != HASH_NOLOCK)
+ LockBuffer(buf, access);
+
+ /* ref count and lock type are correct */
+ return buf;
+}
+
/*
* _hash_relbuf() -- release a locked buffer.
*
/*
* We initialize the metapage, the first two bucket pages, and the
- * first bitmap page in sequence, using P_NEW to cause smgrextend()
- * calls to occur. This ensures that the smgr level has the right
- * idea of the physical index length.
+ * first bitmap page in sequence, using _hash_getnewbuf to cause
+ * smgrextend() calls to occur. This ensures that the smgr level
+ * has the right idea of the physical index length.
*/
- metabuf = _hash_getbuf(rel, P_NEW, HASH_WRITE);
- Assert(BufferGetBlockNumber(metabuf) == HASH_METAPAGE);
+ metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, HASH_WRITE);
pg = BufferGetPage(metabuf);
_hash_pageinit(pg, BufferGetPageSize(metabuf));
*/
for (i = 0; i <= 1; i++)
{
- buf = _hash_getbuf(rel, P_NEW, HASH_WRITE);
- Assert(BufferGetBlockNumber(buf) == BUCKET_TO_BLKNO(metap, i));
+ buf = _hash_getnewbuf(rel, BUCKET_TO_BLKNO(metap, i), HASH_WRITE);
pg = BufferGetPage(buf);
_hash_pageinit(pg, BufferGetPageSize(buf));
pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
Bucket old_bucket;
Bucket new_bucket;
uint32 spare_ndx;
- BlockNumber firstblock = InvalidBlockNumber;
BlockNumber start_oblkno;
BlockNumber start_nblkno;
uint32 maxbucket;
if (metap->hashm_maxbucket >= (uint32) 0x7FFFFFFE)
goto fail;
- /*
- * If the split point is increasing (hashm_maxbucket's log base 2
- * increases), we need to allocate a new batch of bucket pages.
- */
- new_bucket = metap->hashm_maxbucket + 1;
- spare_ndx = _hash_log2(new_bucket + 1);
- if (spare_ndx > metap->hashm_ovflpoint)
- {
- Assert(spare_ndx == metap->hashm_ovflpoint + 1);
- /*
- * The number of buckets in the new splitpoint is equal to the
- * total number already in existence, i.e. new_bucket. Currently
- * this maps one-to-one to blocks required, but someday we may need
- * a more complicated calculation here.
- */
- firstblock = _hash_alloc_buckets(rel, new_bucket);
- if (firstblock == InvalidBlockNumber)
- goto fail; /* can't split due to BlockNumber overflow */
- }
-
/*
* Determine which bucket is to be split, and attempt to lock the old
* bucket. If we can't get the lock, give up.
*
* The lock protects us against other backends, but not against our own
* backend. Must check for active scans separately.
- *
- * Ideally we would lock the new bucket too before proceeding, but if
- * we are about to cross a splitpoint then the BUCKET_TO_BLKNO mapping
- * isn't correct yet. For simplicity we update the metapage first and
- * then lock. This should be okay because no one else should be trying
- * to lock the new bucket yet...
*/
+ new_bucket = metap->hashm_maxbucket + 1;
+
old_bucket = (new_bucket & metap->hashm_lowmask);
start_oblkno = BUCKET_TO_BLKNO(metap, old_bucket);
if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
goto fail;
+ /*
+ * Likewise lock the new bucket (should never fail).
+ *
+ * Note: it is safe to compute the new bucket's blkno here, even though
+ * we may still need to update the BUCKET_TO_BLKNO mapping. This is
+ * because the current value of hashm_spares[hashm_ovflpoint] correctly
+ * shows where we are going to put a new splitpoint's worth of buckets.
+ */
+ start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
+
+ if (_hash_has_active_scan(rel, new_bucket))
+ elog(ERROR, "scan in progress on supposedly new bucket");
+
+ if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
+ elog(ERROR, "could not get lock on supposedly new bucket");
+
+ /*
+ * If the split point is increasing (hashm_maxbucket's log base 2
+ * increases), we need to allocate a new batch of bucket pages.
+ */
+ spare_ndx = _hash_log2(new_bucket + 1);
+ if (spare_ndx > metap->hashm_ovflpoint)
+ {
+ Assert(spare_ndx == metap->hashm_ovflpoint + 1);
+ /*
+ * The number of buckets in the new splitpoint is equal to the
+ * total number already in existence, i.e. new_bucket. Currently
+ * this maps one-to-one to blocks required, but someday we may need
+ * a more complicated calculation here.
+ */
+ if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
+ {
+ /* can't split due to BlockNumber overflow */
+ _hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
+ _hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
+ goto fail;
+ }
+ }
+
/*
* Okay to proceed with split. Update the metapage bucket mapping info.
*/
metap->hashm_ovflpoint = spare_ndx;
}
- /* now we can compute the new bucket's primary block number */
- start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket);
-
- /* if we added a splitpoint, should match result of _hash_alloc_buckets */
- if (firstblock != InvalidBlockNumber &&
- firstblock != start_nblkno)
- elog(PANIC, "unexpected hash relation size: %u, should be %u",
- firstblock, start_nblkno);
-
- Assert(!_hash_has_active_scan(rel, new_bucket));
-
- if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
- elog(PANIC, "could not get lock on supposedly new bucket");
-
/*
* Copy bucket mapping info now; this saves re-accessing the meta page
* inside _hash_splitbucket's inner loop. Note that once we drop the
* for the purpose. OTOH, adding a splitpoint is a very infrequent operation,
* so it may not be worth worrying about.
*
- * Returns the first block number in the new splitpoint's range, or
- * InvalidBlockNumber if allocation failed due to BlockNumber overflow.
+ * Returns TRUE if successful, or FALSE if allocation failed due to
+ * BlockNumber overflow.
*/
-static BlockNumber
-_hash_alloc_buckets(Relation rel, uint32 nblocks)
+static bool
+_hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
{
- BlockNumber firstblock;
BlockNumber lastblock;
BlockNumber endblock;
char zerobuf[BLCKSZ];
- /*
- * Since we hold metapage lock, no one else is either splitting or
- * allocating a new page in _hash_getovflpage(); hence it's safe to
- * assume that the relation length isn't changing under us.
- */
- firstblock = RelationGetNumberOfBlocks(rel);
lastblock = firstblock + nblocks - 1;
/*
* extend the index anymore.
*/
if (lastblock < firstblock || lastblock == InvalidBlockNumber)
- return InvalidBlockNumber;
-
- /* Note: we assume RelationGetNumberOfBlocks did RelationOpenSmgr for us */
+ return false;
MemSet(zerobuf, 0, sizeof(zerobuf));
rel->rd_nblocks = lastblock+1;
- return firstblock;
+ return true;
}