diff --git a/_posts/2025-01-06-ibdNinja.md b/_posts/2025-01-06-ibdNinja.md new file mode 100644 index 0000000..2e85b51 --- /dev/null +++ b/_posts/2025-01-06-ibdNinja.md @@ -0,0 +1,327 @@ +--- +layout: post +title: 'ibdNinja: A powerful tool for parsing and analyzing MySQL 8.0 (.ibd) data files' +--- + + + + +MySQL 8.0 introduced **instant add column** and **instant drop column** operations, allowing multiple data formats to exist within the same table. This added complexity makes developing external tools for record-level parsing and analysis of MySQL `.ibd` files more complex compared to versions 5.6 and 5.7. Additionally, MySQL Server lacks robust mechanisms for detailed table usage statistics, such as accurately determining the total size and proportion of valid data in a specific index or the total size of dropped columns hidden within a primary index. + +Taking advantage of the Christmas holiday, I developed **ibdNinja**, a C++ tool for multidimensional and fine-grained analysis of `.ibd` files. I hope it aids MySQL enthusiasts in understanding internals better and provides DBAs with advanced statistical insights. + +**Github**: [ibdNinja](https://github.com/KernelMaker/ibdNinja) + +--- + + + +**Contents** + +**1. Key Features of ibdNinja** + +**2. Examples of ibdNinja Usage** + +**3. Highlight: Parsing Records with Instant Add/Drop Columns** + +**4. Limitations** + + + +# 1. Key Features of ibdNinja + +### 1. Parsing SDI Metadata + +Extracts and analyzes the dictionary information of all tables and indexes contained in an ibd file from its SDI (Serialized Dictionary Information). + +### 2. Dynamic Parsing of Records Across Multiple Table Definition Versions * + +With the parsed dictionary information, ibdNinja supports parsing and printing **any record** from **any page** of **any index\*** in **any table\*** (supporting all column types). +**Moreover, it can dynamically adapt to parse records in tables with multiple coexisting schema versions caused by repeated `instant add column` and `instant drop column` operations.** + +***Detailed explanations and examples are provided in*** [Section 3](#third-section) + + +### 3. Multi-Dimensional Data Analysis + +Powered by its record parsing capabilities, ibdNinja enables comprehensive data analysis across multiple levels, including Record, Page, Index, and Table levels. It computes and presents multi-dimensional statistics: + +**Record Level:** + +- Total size of the record (header + body), the number of fields, and whether the record contains a deleted mark. +- Hexadecimal content of the header. +- Detailed information for each field (including user-defined columns, system columns, and instant added/dropped columns), such as: + - Field name + - Field size in bytes + - Field type + - **Hexadecimal content of the field value** + +**Page Level:** + +- The number of valid records, their total size, and the percentage of page space they occupy. +- The count of records containing `instant dropped columns` and the size and page space percentage of these dropped but still allocated columns. +- The count, total size, and page space percentage of records marked as deleted. +- The space utilized internally by InnoDB (e.g., page header, **record headers**, page directory), along with its percentage of the page. +- The size and percentage of free space within the page. + +**Index Level:** + +- For a specific index, analyzes and aggregates statistics for all its pages starting from the root page. +- Statistics are presented separately for non-leaf levels and leaf levels, similar to the statistics provided at the page level. + +**Table Level:** + +- For a given table, starts from its primary index and analyzes each index to display its statistics + +### 4. Printing Leftmost Pages of Each Index Level: + +Allows users to print the leftmost page number of each level for a specified index, making it easier to manually traverse and print every record in the index page by page. + +### 5. [TODO] Repairing Corrupted ibd Files + +With ibdNinja's capability to parse records, it is possible to address ibd files with corrupted index pages. By removing damaged records from pages or excluding corrupted pages from indexes, the tool can attempt to recover the file to the greatest extent possible. + + +# 2. Examples of ibdNinja Usage + +**Compiling is straightforward—just run `make` in the current directory.** + +### 1. Display Help Information (`--help`, `-h`) + +image-1 + +### 2. List Tables and Indexes in the ibd File (`--list-tables`, `-l`) + +Using the system tablespace file **mysql.ibd** as an example, after specifying the file with the `--file` or `-f` option, the output provides: + +image-2 + +1. A summary of the ibd file, including the number of tables and indexes successfully parsed and loaded from the data dictionary. +2. The table IDs and names of all tables in the file. +3. For each table, all index IDs, root page numbers, and index names. + +With this information, you can explore the ibd file further using other commands. + +### 3. Parse and Print a Specific Page (`--parse-page`, `-p PAGE_ID`) + +Continuing with **mysql.ibd** as an example, let’s parse the root page of the `PRIMARY` index for the `mysql.innodb_index_stats` table (its root page number is 7, as shown in the previous example). + +Run the following command: + +``` +./ibdNinja -f ../innodb-run/mysqld/data/mysql.ibd -p 7 +``` + +The output consists of three parts: + +1. **Page Summary:** Information such as sibling page numbers (left and right), the index the page belongs to, the page level, etc. + +image-3 + +2. **Record Details:** For each record in the page, details like: + +image-4 + +- Total length of the record (header + body), field count, and whether it has a delete mark. +- A hexadecimal dump of the record header. +- Detailed information for each field (e.g., name, length, type, and the hexadecimal value). + +### 3. Page Analysis Summary: + + Includes statistics such as: + +image-5 + +- Number, total size, and space usage percentage of valid records. +- Number and size of records with `instant dropped columns`, as well as their space usage percentage. +- Number and size of delete-marked records, and their space usage percentage. +- Space used by InnoDB internal components (e.g., page headers), along with their percentages. +- Free space size and percentage. + +### 4. Analyze a Specific Index (`--analyze-index`, `-i INDEX_ID`) + +Using **mysql.ibd** again, first obtain the table and index information using the `--list-tables` (`-l`) command. + +image-6 + +For example, the `mysql.tables` table has an ID of 29 and contains 10 indexes. To analyze the `PRIMARY` index (ID 78), run: + +``` +./ibdNinja -f ../innodb-run/mysqld/data/mysql.ibd -i 78 +``` + +image-7 + +ibdNinja traverses the `PRIMARY` index from its root page, analyzing it level by level and page by page, then summarizes the statistics: + +1. **Overview:** Includes the index name, number of levels, and number of pages. +2. **Non-Leaf Levels Statistics:** Provides page count, record count, and various space usage details. +3. **Leaf Level Statistics:** Similar to the above, but specific to the leaf level. + +### 5. Analyze a Specific Table (`--analyze-table`, `-t TABLE_ID`) + +Using **mysql.ibd** again, first run the `--list-tables` (`-l`) command to get table and index information. +For the `mysql.tables` table with an ID of 29, execute: + +``` +./ibdNinja -f ../innodb-run/mysqld/data/mysql.ibd -t 29 +``` + +This command analyzes all 10 indexes of the `mysql.tables` table and outputs their statistics. Each index's structure is similar to the output of `--analyze-index`. + +### 6. List the Leftmost Page Number for Each Level of an Index (`--list-leafmost-pages`, `-e INDEX_ID`) + +Continuing with the **mysql.ibd** example, the `PRIMARY` index of the `mysql.tables` table has an ID of 78. +Run the following command: + +``` +./ibdNinja -f ../innodb-run/mysqld/data/mysql.ibd -e 78 +``` + +image-8 + +The output shows the leftmost page number for each level of the index. For example: + +- Level 1 (non-leaf) has a leftmost page number of 82. +- Level 0 (leaf level) has a leftmost page number of 161. + +You can then use the `--parse-page` (`-p PAGE_NO`) command to print detailed information for these pages. From the sibling page numbers, you can continue parsing the left and right pages to traverse the entire index. + +***Note:*** To skip printing record details for a page (e.g., to avoid excessive output), use the `--no-print-record` (`-n`) option along with `-p`, as in:`-p 161 -n` + + + +# 3. Highlight: Parsing Records with Instant Add/Drop Columns + +### Table Setup + +We start by creating a table: + +``` +CREATE TABLE `ninja_tbl` ( + `col_uint` int unsigned NOT NULL, + `col_datetime_0` datetime DEFAULT NULL, + `col_varchar` varchar(10) DEFAULT NULL, + PRIMARY KEY (`col_uint`) +) ENGINE=InnoDB; +``` + +Based on the current table definition (V1), we insert one record: + +``` +INSERT INTO ninja_tbl values (1, NOW(), "Row_V1"); +``` + +Next, we use `ALTER TABLE` to add two columns to the table: + +``` +ALTER TABLE ninja_tbl ADD COLUMN col_datetime_6 datetime(6); +ALTER TABLE ninja_tbl ADD COLUMN col_char char(10) DEFAULT "abc"; +``` + +Based on the updated table definition (V2), we insert another record: + +``` +INSERT INTO ninja_tbl values (2, NOW(), "Row_V2", NOW(), "ibdNinja"); +``` + +Then, we drop two columns from the table: + +``` +ALTER TABLE ninja_tbl DROP COLUMN col_varchar; +ALTER TABLE ninja_tbl DROP COLUMN col_char; +``` + +Finally, based on the updated table definition (V3), we insert a third record: + +``` +INSERT INTO ninja_tbl values (3, NOW(), NOW()); +``` + +### Parsing Records with ibdNinja + +Through the operations above, we constructed three different table definitions (V1, V2, V3) and inserted one record for each version. Now, let’s use ibdNinja to parse these three records. Since there are only three records, the primary key index of `ninja_tbl` must fit into a single page (root number 4). We can directly use the `-p` command to parse this page. Here, we skip most of the output and focus on the parsed records: + +1. **Record 1:** + + image-9 + + - **FIELD 1 (col_uint):** The value is `1`, inserted under table definition V1. + - **FIELD 5 (col_varchar):** This field was defined in V1 and is part of record 1. Its value is present but marked as `!hidden!_dropped_v3_p4_col_varchar` because the column was instantly dropped in V3. Although hidden from queries, the data remains in the page. + - **FIELD 6 (col_datetime_6):** Added in V2, this field has no value in record 1, as it did not exist when the record was inserted (length is 0). + - **FIELD 7 (col_char):** Also added in V2 and dropped in V3, this field has no value in record 1 for the same reason. + +2. **Record 2:** + + image-10 + + - **FIELD 1 (col_uint):** The value is `2`, inserted under table definition V2. + - **FIELD 5 (col_varchar):** This column was defined in V1 and dropped in V3. Since record 2 was inserted before the drop, it still contains a value. + - **FIELD 6 (col_datetime_6):** Added in V2, this field contains a value for record 2. + - **FIELD 7 (col_char):** Added in V2 and dropped in V3, this field also contains a value for record 2. + +3. **Record 3:** + + image-11 + + - **FIELD 1 (col_uint):** The value is `3`, inserted under table definition V3. + - **FIELD 5 and FIELD 7:** Both fields were dropped in V3. Since record 3 was inserted after the drop, these fields are empty. + +### Page Analysis Results + +The page analysis output highlights the following key details: + +image-12 + +- As shown in the red box of the analysis, two records in the page still contain old values for columns that were dropped. +- The analysis shows the total size and percentage of space wasted due to these dropped columns. + +This information helps to quantify the space overhead caused by instant column drops. + +Similarly, if the page contains deleted-marked records, their size and percentage are also displayed. + +These statistics are not only available at the page level but can also be aggregated at the index level using the `--analyze-index` (`-i INDEX_ID`) command. + +# 4. Limitations + +This is the first version I developed during the Christmas holiday, so there are some functional limitations and potential bugs (feel free to raise issues): + +1. **Supported MySQL Versions**: + + Currently supports MySQL 8.0 (8.0.16 - 8.0.40). + + *(Earlier versions of MySQL 8.0, prior to 8.0.16, contain a bug in SDI generation that leads to missing metadata in `dd_object::indexes::elements`.)* + +3. **Supported Platforms**: + + Currently supports Linux and macOS. + + *(I don't have Windows.)* + +5. **Functional Limitations**: + + **Tablespace:** + - Encrypted tablespaces are not yet supported. + + **Table:** + - Tables using table compression or page compression are not yet supported. + - Encrypted tables are not yet supported. + - Partition tables are not yet supported. + - Auxiliary and common index tables of FTS are not yet supported. + + **Index:** + - Full-text indexes are not yet supported (only `FTS_DOC_ID_INDEX` is supported). + - Spatial indexes are not yet supported. + - Indexes using virtual columns as key columns are not yet supported. + + **Page:** + - Only `INDEX` pages (those in B+Tree) are currently supported. + + **Record:** + - Records in the `redundant` row format are not yet supported. + +*Note: The analysis in ibdNinja is currently based on the InnoDB data pages written to the ibd file. Pages in the redo log that have not yet been flushed to the ibd file are not included in the statistics. In scenarios with a large number of dirty pages, the analysis results may have some deviations or errors.* + + + +Special thanks to [MySQL](https://github.com/mysql/mysql-server) for being an invaluable reference in developing ibdNinja. diff --git a/_posts/2025-01-14-ibdNinja-in-action.md b/_posts/2025-01-14-ibdNinja-in-action.md new file mode 100644 index 0000000..e440458 --- /dev/null +++ b/_posts/2025-01-14-ibdNinja-in-action.md @@ -0,0 +1,221 @@ +--- +layout: post +title: 'Dissecting an Extreme Sparse MySQL Table Using ibdNinja' +--- + +I previously wrote a blog post to introduce the concept of MySQL B+ Tree splits and explained how to deliberately construct a highly sparse table by hacking the current rules. Then in the next post, I introduced **ibdNinja**, a tool I developed to parse the MySQL data file. In this follow-up post, I demonstrate how to use **ibdNinja** to analyze and verify this table's data file. + +This post combines two topics: + +1. Constructing a highly sparse InnoDB table. + + *Reference:* [Hack MySQL: How to Create an Extremely Sparse InnoDB Table (B+Tree)](https://kernelmaker.github.io/Hack-InnoDB) + +2. Using [ibdNinja](https://github.com/KernelMaker/ibdNinja) to analyze the table's data file and identify interesting findings. + + *Reference:*[ibdNinja: A powerful tool for parsing and analyzing MySQL 8.0 (.ibd) data files](https://kernelmaker.github.io/ibdNinja) + +## 1. Constructing a Highly Sparse Table + +By running the following SQL file using the `source` command in the MySQL client, you can create a highly sparse InnoDB table called `tbl`: + +``` +mysql> source /PATH_TO_THIS_FILE/load.sql +``` + +**load.sql:** + +``` +-- Create the database and table +CREATE DATABASE IF NOT EXISTS test; +USE test; + +DROP TABLE IF EXISTS tbl; +CREATE TABLE tbl ( + pk INT NOT NULL, + val CHAR(240) NOT NULL DEFAULT 'a', + PRIMARY KEY (pk) +) ENGINE=InnoDB CHARSET=LATIN1; + +-- Create the stored procedure +DELIMITER // + +CREATE PROCEDURE PopulateTable() +BEGIN + DECLARE i INT; + + -- Insert initial values from 1 to 32 + SET i = 1; + WHILE i <= 32 DO + INSERT INTO test.tbl (pk, val) VALUES (i, 'a'); + SET i = i + 1; + END WHILE; + + -- Insert values from 10000001 to 10000027 + SET i = 10000001; + WHILE i <= 10000027 DO + INSERT INTO test.tbl (pk, val) VALUES (i, 'a'); + SET i = i + 1; + END WHILE; + -- The first leaf page splits into 2 leaf pages: + -- [1 ... 29],[30, 31, 32, 10000001 ... 10000027] + + -- Insert values from 10000028 to 10000054 + SET i = 10000028; + WHILE i <= 10000054 DO + INSERT INTO test.tbl (pk, val) VALUES (i, 'a'); + SET i = i + 1; + END WHILE; + + -- Insert alternating values starting from 33 to 1000000 + SET i = 33; + WHILE i < 100000 DO + INSERT INTO test.tbl (pk, val) VALUES (i + 1, 'a'); + INSERT INTO test.tbl (pk, val) VALUES (i, 'a'); + SET i = i + 2; + END WHILE; +END // + +DELIMITER ; + +-- Call the procedure +CALL PopulateTable(); +``` + +After executing the script, you will have created the table `tbl` with **100,054 records**, each having a size of **262 bytes**. The resulting B+ Tree structure will be highly sparse, as illustrated in the following diagram: + +image-1 + +### Expected B+ Tree Structure + +- **Leaf Level**: The table's B+ Tree uses **49,986 leaf pages**, requiring **49,986 × 16KB = 781MB** of space to store only **24MB** of actual records (100,054 × 262 bytes). + +## 2. Using ibdNinja to Analyze and Verify the Table + +### 1️⃣ **Verifying that the Leftmost Page at Level 0 Contains 29 Records** + +First, run the following command to get an overview of the `tbl.ibd` data file: + +``` +./ibdNinja -f ../innodb-run/mysqld/data/test/tbl.ibd +``` + +image-2 + +From the output, you can see that the table contains a B+ Tree index with **Index ID = 547**. + +Next, run the following command to get the number of levels in the B+ Tree and the `page ID` of the leftmost page at each level: + +``` +./ibdNinja -f ../innodb-run/mysqld/data/test/tbl.ibd -e 547 +``` + +image-3 + +The output will show that the **leftmost page ID at Level 0 (leaf level) is 5**. Let's analyze this page: + +``` +./ibdNinja -f ../innodb-run/mysqld/data/test/tbl.ibd -p 5 +``` + +The tool will display detailed information about all the records in Page 5. Let's take a look at the last record as an example: + +image-4 + +- **Record Size**: 262 bytes (including a 5-byte header and 257-byte body), which matches our expectation. + +Finally, the tool will print a summary of Page 5: + +image-5 + +- **Record Count**: 29 records +- **Space Utilization**: 46.37% + +This confirms our expectation that the leftmost page contains 29 records. + +### 2️⃣ **Verifying that Subsequent Pages Contain Only 2 Records** + +We expect that, starting from the second page, each page in the leaf level will contain only **2 records**, making the tree extremely sparse. Let's verify this. + +From the analysis of Page 5, we know that its **next page ID is 7**. + +image-6 + +Let's analyze Page 7: + +``` +./ibdNinja -f ../innodb-run/mysqld/data/test/tbl.ibd -p 7 +``` + +image-7 + +The tool shows that Page 7 indeed contains only **2 records**, with a **space utilization of just 3.20%**, confirming our expectation. + +### 3️⃣ **Verifying the Overall Sparsity of the B+ Tree** + +To analyze the entire B+ Tree, run the following command: + +``` +./ibdNinja -f ../innodb-run/mysqld/data/test/tbl.ibd -i 547 +``` + +The tool will print a summary: + +image-8 + +- **Number of Levels**: 3 +- **Number of Non-Leaf Pages**: 44 +- **Number of Leaf Pages**: 49,986 + +The tool then provides detailed statistics for both non-leaf and leaf levels: + +image-9 + +- **Non-Leaf Page Utilization**: 90.22% (not sparse) +- **Leaf Page Utilization**: 3.20% (extremely sparse) + +This confirms that the B+ Tree is highly sparse at the leaf level. + +### 4️⃣ **Making the Situation Even Worse** + +The leaf level currently has a utilization rate of only **3.20%**, which is already very low. But we can make it worse by dropping a column from the table. + +Since the `val` column (a `CHAR(240)`) occupies **240 bytes** of each record, we can drop this column using **Instant Drop Column**. The space previously occupied by this column remains allocated but becomes unusable, creating unutilizable gaps in the pages + +To drop the column, execute the following command in the MySQL client: + +``` +ALTER TABLE tbl DROP COLUMN val; +``` + +Then, reanalyze the B+ Tree using the `ibdNinja` tool: + +``` +./ibdNinja -f ../innodb-run/mysqld/data/test/tbl.ibd -i 547 +``` + +image-10 + +The analysis shows that **2.93%** of the space in the index is occupied by dropped columns, which cannot be reused. The free space ratio remains unchanged. + +### 5️⃣ **Releasing Space with OPTIMIZE TABLE** + +To reclaim the wasted space, run the following command in the MySQL client: + +``` +OPTIMIZE TABLE tbl; +``` + +After running this command, analyze the B+ Tree again (note that the **Index ID will change**): + +``` +./ibdNinja -f ../innodb-run/mysqld/data/test/tbl.ibd -i 548 +``` + +image-11 + +The leaf level's size is reduced from **781MB to 2MB**, with a space utilization of **90%**. The B+ Tree is no longer sparse. + + + +The **ibdNinja** tool provides a practical way to explore `.ibd` files and gain insights into InnoDB's internal structures. It can be helpful for understanding MySQL's storage engine and troubleshooting data-related issues. Whether you're exploring InnoDB for learning or operational purposes, this tool offers assistance. \ No newline at end of file diff --git a/_posts/2025-02-28-MySQL-implicit-locks.md b/_posts/2025-02-28-MySQL-implicit-locks.md new file mode 100644 index 0000000..23c8ed6 --- /dev/null +++ b/_posts/2025-02-28-MySQL-implicit-locks.md @@ -0,0 +1,209 @@ +--- +layout: post +title: 'Deep Dive into MySQL - Implicit Locks' +--- + +The previous 2 blog posts introduced MySQL's locking mechanisms for insert and locking read operations. + +[Deep Dive into MySQL - Transaction lock - PART 1](https://kernelmaker.github.io/MySQL-Lock-1) + +[Deep Dive into MySQL - Transaction lock - PART 2](https://kernelmaker.github.io/MySQL-Lock-2) + +In this post, we will discuss an optimization for locking during insert operations: implicit locks. We will cover their implementation, benefits, and the trade-offs involved. + +**Table of Contents** + +1. What is an Implicit Lock? +2. Implicit Lock Determination for Primary Key Records +3. Implicit Lock Determination for Secondary Index Records ⭐️ +4. Worst-Case Scenario for Secondary Index Record Implicit Lock Determination ⭐️ + +### 1. What is an Implicit Lock? + +When a MySQL transaction inserts a record, it holds a row lock on that record until the transaction is committed. But how is this row lock implemented? + +Each record in a B+Tree can be located using ``, where `page id` consists of ``, identifying the data page within the data file, and `heap no` represents the record's position within that page. + +MySQL maintains a global lock hash table for transaction locks. The hash key is the `page id`, and the hash value contains all locked records on that page. Locks for different records within the same page are organized into a linked list and distinguished using `heap no`. + +Acquiring a row lock involves inserting a record into the lock hash table. However, before inserting into the hash, the hash itself must be locked to ensure consistency. This introduces a performance bottleneck in high-concurrency insert scenarios, even when there are no conflicts between inserted records. + +As illustrated below: +image-1 +In this scenario, multiple concurrent transactions perform simple insert operations without conflicts. However, since each insert requires a row lock, they all contend for the global lock hash table, creating a performance bottleneck. (Partitioned lock hashes can alleviate this to some extent, but the problem remains.) + +To address this issue, MySQL introduced **implicit locks**. Instead of registering row locks in the global lock hash table, insert operations leverage the record itself to express the lock state indirectly. This prevents concurrent insert operations from competing for the lock hash table. These implicit locks are automatically released upon transaction commit. + +If another transaction accesses the inserted record before the transaction commits (e.g., through a locking read), it checks whether the record has an implicit lock. If an implicit lock is detected, this transaction converts it into an explicit lock, registers it in the lock hash table, and waits if necessary. If no implicit lock is found, it finally checks the lock hash table to determine if the record is locked. + +Now, let's explore how implicit locks are implemented in detail. + +### 2. Implicit Lock Determination for Primary Key Records + +The implementation of implicit locks for primary key records is straightforward. Each primary key record includes a hidden column, `DATA_TRX_ID`, which stores the transaction ID of the last operation (for inserts, this is the inserting transaction's ID). Implicit locks leverage this transaction ID. + +Each transaction is assigned a unique transaction ID when it starts. The `DATA_TRX_ID` of all inserted or modified records is set to this transaction ID. Additionally, MySQL maintains an **active transaction list**, which tracks uncommitted transactions. Once a transaction commits, its ID is removed from this list. + +To determine whether a record has an implicit lock: + +image-2 + +1. Retrieve the `DATA_TRX_ID` from the record to get the transaction ID of the last insert/update operation. + +2. Check if this transaction ID is present in the active transaction list: + - If **not present**, the transaction has already committed, so the record cannot have an implicit lock. + - If **present**, the transaction is still active, meaning the record has an implicit lock. In this case, convert it into an explicit lock and register it in the lock hash table. + +3. Attempt to acquire a lock on the record in the lock hash table. If another transaction already holds the lock, it must wait; otherwise, it acquires the lock successfully. + +By introducing implicit locks, MySQL defers the costly operation of acquiring the lock hash mutex and registering row locks. This only happens if another concurrent transaction attempts to access the record. If no other transaction accesses the record, it avoids the lock hash operation entirely, improving concurrency for pure insert workloads. + +#### Benefits and Trade-offs + +- **Benefit**: Insert operations no longer need to register explicit locks in the lock hash table. Instead, explicit locks are registered only when another transaction tries to access the record. +- **Trade-off**: If a concurrent transaction tries to lock the record before the inserting transaction commits, it must check whether the `DATA_TRX_ID` is still active, incurring additional overhead. + +Fortunately, for primary key records, this overhead is minimal because: + +1. The `DATA_TRX_ID` is readily available on the primary key record. +2. MySQL optimizes `trx_sys` operations to reduce the cost of checking whether a transaction ID is still active (though brief locking of `trx_sys` may occur). + +**However, implicit lock determination for secondary index records is much more expensive**. Let's examine why. + +### 3. Implicit Lock Determination for Secondary Index Records + +Implicit lock determination for secondary index records is costly because: + +1. Unlike primary key records, secondary index records do not have a `DATA_TRX_ID` field. +2. Instead, `DATA_TRX_ID` exists at the **page level**, representing the transaction ID of the last insert/update operation on that page. + +The implementation of implicit lock determination for a secondary index record in MySQL is quite complex. Before diving into the details, let's first go over some essential background information about secondary indexes: + +1. As mentioned earlier, secondary index records do not have a hidden field for `DATA_TRX_ID`; this information is only available at the secondary index page level. +2. Secondary index updates do not generate undo logs, meaning they do not support multi-versioning (MVCC) like primary key records. To locate the corresponding version of a secondary index record, it is necessary to first access the corresponding primary key record and then traverse its historical versions. +3. The insertion/update sequence always processes the primary index before the secondary index. This means that for any given secondary index record, the primary key record version it points to must be either the same version or a newer version than itself. +4. Updates to secondary index records are handled by marking the old record with a delete mark and inserting a new record. + +To determine whether a secondary index record's transaction has committed, we must rely on the page-level `DATA_TRX_ID`: + +1. If the page's `DATA_TRX_ID` corresponds to a committed transaction, the secondary index record definitely does not hold an implicit lock. +2. If the page's `DATA_TRX_ID` is still active, we must **locate the corresponding primary key record** and check its transaction status. +3. Since primary key records maintain multiple versions (undo logs), finding the corresponding version can be expensive. + +Step 3 is the most intricate part of the process. To understand how to accurately determine the corresponding primary key record version for a given secondary index record, we can refer to the following 8 representative cases. + +**Example Table Definition:** + +``` +CREATE TABLE `tbl` ( + `pk` int NOT NULL, + `value` int NOT NULL, + `sk` int NOT NULL, + PRIMARY KEY (`pk`), + KEY `sk` (`sk`) +) +``` + +In the following 8 cases, we aim to identify the **corresponding** primary key record version for the secondary index record `<10, 1>`. By **corresponding**, we mean the specific version of the primary key record that resulted in the creation of the given secondary index record. + +**Case 1** + +image-3 + +1. Transaction T1 inserts record `(1, 6, 10)`: The primary key record `P<1, 6, 10>` and the secondary index record `S<10, 1>` are both inserted. + +**Case 2** + +image-4 + +1. Transaction T1 inserts record `(1, 6, 10)`: The primary key record `P<1, 6, 10>` and the secondary index record `S<10, 1>` are both inserted. +2. Transaction T1 deletes record `(1, 6, 10)`: The primary key record `P<1, 6, 10>` generates a historical version and is marked with delete mark, but the secondary index record `S<10, 1>` has not yet been marked. + +**Case 3** + +image-5 + +1. Transaction T1 inserts record `(1, 6, 10)`: The primary key record `P<1, 6, 10>` and the secondary index record `S<10, 1>` are both inserted. +2. Transaction T1 deletes record `(1, 6, 10)`: The primary key record `P<1, 6, 10>` and the secondary index record `S<10, 1>` are both marked with delete mark. + +**Case 4** + +image-6 + +1. Transaction T1 inserts record `(1, 6, 10)`: The primary key record `P<1, 6, 10>` and the secondary index record `S<10, 1>` are both inserted. +2. Transaction T1 deletes record `(1, 6, 10)`: The primary key record `P<1, 6, 10>` and the secondary index record `S<10, 1>` are both marked with delete mark. +3. Transaction T1 inserts record `(1, 6, 10)`: The primary key record `P<1, 6, 10>` is inserted, but the secondary index record `S<10, 1>` has not yet been inserted. + +**Case 5** + +image-7 + +1. Transaction T1 inserts record `(1, 6, 10)`: The primary key record `P<1, 6, 10>` and the secondary index record `S<10, 1>` are both inserted. +2. Transaction T1 deletes record `(1, 6, 10)`: The primary key record `P<1, 6, 10>` and the secondary index record `S<10, 1>` are both marked with delete mark. +3. Transaction T1 inserts record `(1, 6, 10)`: The primary key record `P<1, 6, 10>` and the secondary index record `S<10, 1>` are both inserted again. + +**Case 6** + +image-8 + +1. Transaction T1 inserts record `(1, 6, 10)`: The primary key record `P<1, 6, 10>` and the secondary index record `S<10, 1>` are both inserted. +2. Transaction T1 deletes record `(1, 6, 10)`: The primary key record `P<1, 6, 10>` and the secondary index record `S<10, 1>` are both marked with delete mark. +3. Transaction T1 inserts record `(1, 6, 11)`: The primary key record `P<1, 6, 11>` is inserted. Whether the secondary index record `S<11, 1>` is inserted or not is irrelevant in this case (since we only focus on `S<10, 1>`). + +**Case 7** + +image-9 + +1. Transaction T1 inserts record `(1, 6, 10)`: The primary key record `P<1, 6, 10>` and the secondary index record `S<10, 1>` are both inserted. +2. Transaction T1 deletes record `(1, 6, 10)`: The primary key record `P<1, 6, 10>` and the secondary index record `S<10, 1>` are both marked with delete mark. +3. Transaction T1 inserts record `(1, 6, 11)`: The primary key record `P<1, 6, 11>` and the secondary index record `S<11, 1>` are both inserted. +4. Transaction T1 deletes record `(1, 6, 11)`: The primary key record `P<1, 6, 11>` and the secondary index record `S<11, 1>` are both marked with delete mark. +5. Transaction T1 inserts record `(1, 6, 10)`: The primary key record `P<1, 6, 10>` is inserted, but the secondary index record `S<10, 1>` has not yet been inserted. + +**Case 8** + +image-10 + +1. Transaction T1 inserts record `(1, 6, 10)`: The primary key record `P<1, 6, 10>` and the secondary index record `S<10, 1>` are both inserted. +2. Transaction T1 deletes record `(1, 6, 10)`: The primary key record `P<1, 6, 10>` and the secondary index record `S<10, 1>` are both marked with delete mark. +3. Transaction T1 inserts record `(1, 6, 11)`: The primary key record `P<1, 6, 11>` and the secondary index record `S<11, 1>` are both inserted. +4. Transaction T1 deletes record `(1, 6, 11)`: The primary key record `P<1, 6, 11>` and the secondary index record `S<11, 1>` are both marked with delete mark. +5. Transaction T1 inserts record `(1, 6, 10)`: The primary key record `P<1, 6, 10>` and the secondary index record `S<10, 1>` are both inserted. + +The above 8 cases clearly illustrate how to determine the **corresponding** primary key record version for a given secondary index record. Once the correct primary key record version is identified, we can extract its `DATA_TRX_ID` field to precisely determine whether the secondary index record currently holds an implicit lock. + +Summary of InnoDB's Core Approach to Finding the "Corresponding" Primary Key Record Version for a Given Secondary Index Record: + +1. If the secondary index record does not have a **delete mark**, start from the **previous version** of its associated latest primary key record and traverse backward until the first **non-matching** primary key record version is found. The **next version after this point** is the corresponding version of the secondary index record. +2. If the secondary index record **has a delete mark**, start from the **previous version** of its associated latest primary key record and traverse backward until the first **matching** primary key record version is found. The **next version after this point** corresponds to the delete operation of the secondary index record. + +**Definition of "Matching":** +A primary key record (or its historical version) `P` is considered **matching** a secondary index record `S` if: + +- `P` does **not** have a **delete mark**, and +- The value of the secondary index field in `P` is **exactly the same** as that in `S`. + +Why Not Start from the Latest Primary Key Record Directly? + +Since the search process involves traversing backward until a version meeting specific criteria is found, the next version after the stopping point is the target version. Thus, the stopping version cannot be the latest version. This serves as a small optimization by eliminating one unnecessary comparison. + +Additionally, it is not always necessary to traverse all the way back to find the "corresponding" primary key record version. If, during traversal, the transaction ID of a primary key record version changes, it indicates that the previous transaction has already committed. Consequently, the secondary index record cannot have an implicit lock, and the traversal can be terminated early. + +InnoDB's implementation follows this approach, with the key function being `row_vers_impl_x_locked_low()`. Once you understand the explanation above, the actual code implementation becomes much easier to comprehend. + +### 4. Worst-Case Scenario for Determining Implicit Locks on Secondary Index Records + +From the implementation details discussed earlier, we can see that determining whether a secondary index record holds an implicit lock relies on traversing the historical versions of its corresponding primary key record. Based on this approach, we can construct a worst-case scenario where this determination process becomes extremely costly: +image-11 + +1. Transaction T1 inserts record (1, 6, 10), where both the primary key record P<1, 6, 10> and the secondary index record S<10, 1> are inserted. +2. T1 remains uncommitted and continuously updates the non-index field of this record, changing it as follows: + (1, 6, 10) → (1, 7, 10) → (1, 8, 10) → (1, N, 10). + +In this scenario, since only a non-index field is being updated, the secondary index record S<10, 1> remains unchanged. Additionally, based on the earlier definition of "matching", all versions of the primary key record are considered matching the secondary index record. + +According to the search approach described earlier, in order to find the corresponding primary key record version for S<10, 1>, the system must traverse all historical versions until it encounters the first non-matching version. In this case, that would require traversing back to the very first version. If N is particularly large, the cost of this backward traversal can be extremely high. + +Thus, while implicit locks improve concurrency for insert operations, they also introduce a potentially significant overhead when determining implicit locks for secondary index records. + +**"There’s no free lunch."** \ No newline at end of file diff --git a/_posts/2025-04-11-duckdb-db-mgr.md b/_posts/2025-04-11-duckdb-db-mgr.md new file mode 100644 index 0000000..b27a6b5 --- /dev/null +++ b/_posts/2025-04-11-duckdb-db-mgr.md @@ -0,0 +1,160 @@ +--- +layout: post +title: 'DuckDB Source Dive - #1 Exploring the DatabaseManager' +--- + +As the “SQLite for OLAP,” DuckDB has been gaining significant popularity. At Hopsworks, we’ve also had many interesting discussions around its design and capabilities. Diving into the source code of such a star project is both exciting and insightful. This post marks the beginning of my **DuckDB Source Dive** series, where I will break down and explain the internals of DuckDB from a source code perspective. + +For this first article, I’ll start with a deep dive into one of DuckDB’s components: the DatabaseManager class. This class handles how databases are organized, along with their schemas, tables, indexes, and other related resources. Understanding this part is essential because it’s often the first concept users encounter when working with DuckDB. + +------ + +## 1. Core Concepts + +DuckDB organizes its data resources into a four-level hierarchy: + +1. **Database**: A single DuckDB instance can dynamically `ATTACH` or `DETACH` databases at runtime, typically stored as `.db` files. This concept doesn't have a direct counterpart in MySQL. If we had to draw a comparison, one DuckDB database is **”roughly equivalent“** to a MySQL instance. +2. **Schema**: Each database contains multiple schemas, which can be created or dropped via `CREATE/DROP SCHEMA`. This is conceptually equivalent to a MySQL *database*. A schema holds tables, indexes, and other resources. +3. **Table**: Each schema can contain multiple tables, which are created or dropped with `CREATE/DROP TABLE`. This maps directly to MySQL tables. +4. **Index**: Tables can have multiple indexes, defined with `CREATE/DROP INDEX`. These are equivalent to indexes in MySQL. + +Here’s a diagram that illustrates how these concepts align between DuckDB and MySQL: + +image-1 + +------ + +## 2. Code Breakdown + +DuckDB’s codebase is well-structured, but it heavily uses inheritance, encapsulation, and templates, which can make it challenging to follow initially. However, once you understand the key classes and how they relate to each other, the overall design becomes much clearer. + +Let’s walk through a concrete example. Suppose we have a DuckDB database file `zhao.db`, with a default schema (`main`) containing the following table: + +``` +CREATE TABLE tbl ( + col_a INTEGER, + col_b INTEGER, + col_c INTEGER, + col_d INTEGER, + PRIMARY KEY(col_a), + UNIQUE(col_b) +); + +CREATE INDEX idx ON tbl(col_c); +``` + +This table, tbl, includes: + +* 4 INTEGER columns: `col_a`, `col_b`, `col_c`, and `col_d` + +* 3 indexes: a primary key on `col_a`, a unique constraint on `col_b`, and a regular index `idx` on `col_c` + +We now start the DuckDB shell and load the file: + +``` +duckdb ./zhao.db +``` + +What happens under the hood? I’ll break it down into two main parts. + +------ + +### 2.1 DatabaseManager + +When DuckDB starts, it creates a DuckDB object and loads the `zhao.db` file. Skipping over unrelated details, it reads metadata from the file and deserializes it into in-memory structures representing all database resources. + +image-2 + +The DuckDB class has a member called instance, which is a pointer to a DatabaseInstance object — the core object representing the DuckDB instance. It contains several key components, and for this article, we’ll focus on the db_manager field. This is the DatabaseManager object that manages all attached databases and their contents (schemas, tables, indexes, etc.). It is the core module for data resource management in DuckDB. + +The DatabaseManager includes two major members: + +1. **system**: This is the system database, containing built-in function sets and other predefined resources. It’s rebuilt from scratch on each startup and is not persisted in `.db` files. +2. **databases**: A collection of user-attached databases. Every time a user executes an ATTACH, the corresponding Database object is added to this collection. + +Here’s an expanded diagram that shows how this works: + +image-3 + +Key classes and concepts: + +1. **Catalog**: Represents the collection of resources in a database. Each database has a corresponding Catalog object. Its main job is to manage schemas. +2. **CatalogSet**: Stores various catalog entries like schemas, tables, etc. Internally, it holds a map (of type CatalogEntryMap) which contains all entries. +3. **CatalogEntry**: The base class for all catalog items (schemas, tables, indexes, etc.). For example, DuckSchemaEntry represents a schema and inherits from CatalogEntry. +4. **AttachedDatabase**: Represents an attached database. Although it's a top-level concept, it also inherits from CatalogEntry`and fits into the catalog structure. + +To summarize: + +- **system** points to an AttachedDatabase representing the system catalog, with a DuckCatalog holding its schemas. Within that is a CatalogSet of schemas, one of which is main. This schema holds built-in functions stored as FunctionEntry objects. +- **databases** is a CatalogSet containing user databases. Each AttachedDatabase contains its own catalog of schemas, each a DuckSchemaEntry. These, in turn, contain CatalogSets for tables, indexes, functions, types, sequences, etc., all stored as CatalogEntry-derived types like TableCatalogEntry or IndexCatalogEntry. + +We’ve covered the high-level organization of databases and schemas. Next, let’s zoom into DuckSchemaEntry and explore the schema’s internal implementation. + +------ + +### 2.2 DuckSchemaEntry + +DuckSchemaEntry is a central class that represents a schema and contains all associated resources. As shown earlier, it holds tables, indexes, and more. + +This object is constructed during startup, when CheckpointReader::LoadCheckpoint() is invoked. This function reads the `.db` file’s meta blocks and deserializes them into various catalog entries, including schemas, tables, indexes, and functions. + +Here’s a rough pseudocode outline: + +``` +CheckpointReader::LoadCheckpoint(): +1. Read metadata from the .db file +2. Deserialize by TYPE: + switch (TYPE) { + case SCHEMA_ENTRY: + Create DuckSchemaEntry; + break; + case TYPE_ENTRY: + Create DuckTableEntry; + break; + case INDEX_ENTRY: + Create DuckIndexEntry; + break; + case VIEW_ENTRY: + Create DuckViewEntry; + break; + case MACRO_ENTRY: + Create ScalarMacroCatalogEntry; + break; + case TABLE_MACRO_ENTRY: + Create TableMacroCatalogEntry; + break; + case TYPE_ENTRY: + Create TypeCatalogEntry; + break; + case SEQUENCE_ENTRY: + Create SequenceCatalogEntry; + break; + } +``` + +DuckSchemaEntry is the top-level catalog entry for a schema. All other entries — tables, views, macros, sequences — are nested under it. + +Here’s the deserialization flow: + +image-4 + +1. Read an entry from the file and deserialize it. +2. Based on its type, convert it into a CreateInfo structure. +3. Use that CreateInfo to construct the corresponding CatalogEntry. +4. Depending on the entry type: + - If it's a schema, insert it into the schemas map of AttachedDatabase. + - Otherwise (tables, indexes, etc.), locate the target schema and add it there. + +This next figure gives a detailed view of DuckSchemaEntry's internals and how child entries are managed: + +image-5 + +This diagram will greatly clarify the structure as we read the code. I won’t go over every part in detail here — instead, I’ll highlight some important elements: + +1. **DuckTableEntry** holds detailed information like columns, indexes, and data blocks. Stored under the schema’s tables set. +2. **DuckViewEntry** also lives in the tables set, alongside regular tables. +3. **DuckIndexEntry** represents an index, but note: the actual index data is stored inside the storage of the corresponding DuckTableEntry, and DuckIndexEntry merely references it. +4. Not all indexes are defined via **INDEX_ENTRY**. Primary and unique indexes are created during TABLE_ENTRY parsing, based on constraints. Only regular indexes (like `idx` in our example) are created from INDEX_ENTRY. +5. Functions, table functions, sequences, and types are all schema-level resources. They are created using statements like `CREATE MACRO/FUNCTION`, `CREATE MACRO ... AS TABLE`, `CREATE SEQUENCE`, and `CREATE TYPE`. Just like tables, indexes, and views, the metadata for these resources is also persisted in the `.db` file. + +In future posts, I plan to explore how DuckDB implements columnar storage, the ART index, and its transaction — possibly with comparisons to MySQL. It would be fun. \ No newline at end of file diff --git a/_posts/2025-04-23-mysql-docker.md b/_posts/2025-04-23-mysql-docker.md new file mode 100644 index 0000000..171d4db --- /dev/null +++ b/_posts/2025-04-23-mysql-docker.md @@ -0,0 +1,73 @@ +--- +layout: post +title: Is MySQL Ready for Running in Containers? +--- + +**MySQL 9.3.0** was [released](https://dev.mysql.com/doc/relnotes/mysql/9.3/en/news-9-3-0.html) last week (April 15, 2025), and one particular update caught my attention: + +> “InnoDB now supports container-aware resource allocation, allowing it to adhere to the restrictions imposed by the container.” + +Curious about this, I went through the related patch and did a quick review of what MySQL has done to support container environments so far. I’d like to share a few thoughts on it. + +In my opinion, for MySQL to be **fully container-friendly**, it needs to support the following three capabilities in order of importance: + +### ✅ Step 1: Auto-detect container resource limits at startup + +On startup, MySQL should be able to automatically detect CPU and memory limits configured for the container and use them to generate sensible default values for critical parameters—without requiring manual intervention. Otherwise, users must manually tune parameters to match each container’s resources, which is inefficient and error-prone. + +Some of the key parameters that should be auto-tuned include: + +- **CPU-related:** + - `innodb_buffer_pool_instances` + - `innodb_page_cleaners` + - `innodb_purge_threads` + - `innodb_read_io_threads` + - `innodb_parallel_read_threads` + - `innodb_log_writer_threads` + - `innodb_redo_log_capacity` +- **Memory-related:** + - `innodb_buffer_pool_size` + - `innodb_buffer_pool_instances` + - `temptable_max_ram` + +### ✅ Step 2: Support online adjustment of these parameters + +This is absolutely critical. If the container's CPU or memory resources are updated at runtime, but MySQL requires a restart to apply updated parameters, that causes unnecessary downtime and potentially expensive data reloading—an unacceptable tradeoff in many production environments. + +### ✅ Step 3: Allow MySQL to dynamically respond to container resource changes + +If Steps 1 and 2 are supported, then adapting to container changes becomes feasible, though still a bit manual: + +1. Dynamically adjust container CPU and memory from the outside. +2. Recalculate the relevant MySQL parameters and apply them using the online configuration capabilities from Step 2. + +This workflow is workable—but still tedious. It would be much more elegant if MySQL could: + +1. Detect new CPU and memory limits in real time. +2. Automatically adjust internal configuration accordingly **without user input(or by simply passing in the latest CPU and memory info)**. + + +### So… where does MySQL stand today? + +Unfortunately, not quite there yet. + +image-1 + +- As of **MySQL 9.3.0**, Step 1 is now properly supported. +- Step 2 is **partially** implemented—some parameters can be modified at runtime, while others still require a restart. +- Step 3 is **not supported at all**. + +Looking at the progress over time, it’s clear that development in this area has been relatively slow. + +image-2 + +To be honest, most of the work to support containers isn't very difficult—probably the most technically challenging piece is online resizing of `innodb_buffer_pool_size`, but MySQL has supported that since **version 5.7.5**! + +The rest should be manageable in a minor release cycle or two. + + +### ✨ + +I hope the MySQL team can recapture the spirit of the 5.6/5.7 era and deliver some bold innovations in upcoming releases—especially in the Innovation Track. There's a lot of potential to make MySQL significantly better—with improved performance, smarter features... + +--- *"Make MySQL Great Again..."* :) \ No newline at end of file diff --git a/_posts/2025-04-27-btr_search.md b/_posts/2025-04-27-btr_search.md new file mode 100644 index 0000000..b1c44ed --- /dev/null +++ b/_posts/2025-04-27-btr_search.md @@ -0,0 +1,48 @@ +--- +layout: post +title: How Many B+Tree Searches Are Hidden Inside a MySQL Insert? +--- + +image-1 + +### 1️⃣ What is a B+Tree Search? + + * A B+Tree search starts from the root page and traverses down level by level, comparing keys until the target page at the specified level is located. Typically, the target level is 0 (leaf level), but during page splits or merges, the parent level may also be involved. + * B+Tree search is a core operation in InnoDB's B+Tree handling. The key function is 'btr_cur_search_to_nth_level()'. Insert, update, delete, and read operations all require B+Tree searches. + * During a search, latches are acquired depending on the operation type. A read typically only latches the leaf page (S latch). For writes, optimistic inserts latch the leaf page (X latch), while pessimistic inserts may X latch the entire subtree involved in a page split or merge. + * The cost of a B+Tree search includes potential page reads from disk, page latching and index locking, and key comparisons at each level. + +### 2️⃣ Number of B+Tree Searches When Inserting into a Primary Key Index + + * 1st search: Locate the leaf page where the new record should be inserted. If there's enough space, insert directly. Otherwise: + * 2nd search: Latch the subtree involved and split the leaf page, then insert. + * 3rd search: Locate the parent page of the split leaf and insert a pointer to the new page. If the parent page has enough space, insert directly. Otherwise, split the parent page: + * Extra searches: If the parent page also needs to split, recursively search and update higher-level parent pages, up to the root. This could require up to (B+Tree height - 2) additional searches. + * 4th search: If the record contains large BLOB fields stored externally, locate the newly inserted record again to insert the external BLOB, and update the external reference. + +➔ Maximum number of B+Tree searches: 5 + (B+Tree height - 2) + +### 3️⃣ Number of B+Tree Searches When Inserting into a Unique Secondary Index + + * 1st search: Locate the leaf page to find the appropriate insertion position and check for potential duplicate keys. If potential duplicates are found: + * 2nd search: Locate the first candidate and perform a rightward lock scan (Next-Key Locking) to check for delete-marked records. If a non-deleted duplicate is found, return a duplicate key error. Otherwise, continue scanning to the right until encountering the first non-matching record, at which point the scan stops. + * 3rd search: Re-locate the correct position to insert after ensuring no duplicates. If the leaf page lacks space: + * 4th search: Latch the subtree and split the page. + * 5th search: Locate the parent page to insert a new pointer for the split page. If the parent page is full, recursively split: + * Extra searches: As with the primary key, this may cascade upward, requiring up to (B+Tree height - 2) additional searches. + +➔ Maximum number of B+Tree searches: 6 + (B+Tree height - 2) + +### 💡Additional Note on the 2nd Search for Unique Secondary Index Insertions + +Some might wonder: why is a second search necessary to check for duplicates? +Why can't we verify duplicates directly during the first search? + +The reason is that, unlike the primary key index, secondary indexes in InnoDB do not perform in-place updates. Instead, updates are handled by marking the old record as deleted (delete mark) and inserting a new record. +As a result, even for a **unique** secondary index, there may be multiple records internally with the same indexed column values. +The position where the first search stops might not point to a valid, active duplicate entry — it could be a delete-marked record — while the real active record might be located elsewhere. +Therefore, a second search is needed, starting from the leftmost potential duplicate, to scan and validate whether a true duplicate exists. + +### 💡A Potential Optimization Idea I Can Think Of: +If the first search already stops at a valid (non-delete-marked) duplicate entry, it might be possible to directly acquire the necessary locks and return a duplicate key error, without requiring a second full search. +This could potentially reduce some overhead in certain cases. \ No newline at end of file diff --git a/_posts/2025-05-10-mariadb-vector.md b/_posts/2025-05-10-mariadb-vector.md new file mode 100644 index 0000000..5e619d7 --- /dev/null +++ b/_posts/2025-05-10-mariadb-vector.md @@ -0,0 +1,69 @@ +--- +layout: post +title: Reviewing the Internals of MariaDB’s Vector Index +--- + + + +### 1. How It's implemented + +image-1 + +MariaDB’s vector index is implemented at the server layer, not at the storage engine. InnoDB doesn’t know vector indexes exist. Each vector index is backed by an internal table, making this effectively a “vector index on a B+Tree.” + +💬 **Internal Table Design**: + +1. Created when a vector index is defined. +2. Named as 'user_table_name'#i#'vector_column_position'. +3. No primary key, uses InnoDB’s DB_ROW_ID as the implicit PK. +4. Each node is stored as a row: + * layer: node layer (indexed). + * tref: user table's PK, used for fast cross-reference (UNIQUE KEY). + * vec: halfvec (header + N * float16), derived from user vector (N * float32) via normalization and quantization. + * neighbors: neighbor list by layer, each neighbor stored as its DB_ROW_ID. + +### 2. Insertion Flow and Cost + +image-2 + +Inserting a record into a vector index involves multiple InnoDB API calls and lock acquisitions: + +1. ha_index_last() locates and X-locks the top-layer entry point. +2. From layer L to T+1, ha_rnd_pos() reads neighbors and picks the best to descend. +3. At layer T, ha_rnd_pos() selects top M candidates. +4. From T-1 to 1, ha_rnd_pos() expands M candidates' neighbors, keeping the top M. +5. At layer 0, ha_rnd_pos() expands and picks top 2*M. +6. ha_write_row() inserts the new node. +7. ha_rnd_pos() reads selected neighbors for back-link updates. +8. ha_update_row() updates neighbor lists. + +👉 **Max InnoDB Call Counts**: + +* ha_index_last(): 1 + +* ha_rnd_pos(): ((L - T) * M + M + T * M * M + M * 2M) + (T * M + 2M) + +* ha_write_row(): 1 + +* ha_update_row(): T * M + 2M + + (Pruning may reduce actual calls in step 3 - 5) + +👉 **Locking Cost**: + + \- All calls acquire X-locks, including on unused nodes. Locks are held until transaction commit. + + \- The top-layer entry node becomes a write hotspot, every insert starts by X-locking this record—resulting in almost no write-write concurrency. + + \- Reads use snapshot reads and don't acquire locks, so read concurrency is not impacted by locking—only by the number of InnoDB calls. I remember [Mark](https://www.blogger.com/profile/09590445221922043181) did a detailed [benchmark](https://smalldatum.blogspot.com/2025/01/evaluating-vector-indexes-in-mariadb.html) on this—worth checking out if you're interested. + +### 3. Review + +image-3 + +I guess this server-layer design exists because the vector index is built on a MariaDB table, which must follow all the transactional constraints of the main table. Reusing InnoDB’s transactional capabilities is thus a natural way to achieve durability and crash recovery, while significantly reducing engineering complexity. + +However, each insert behaves like a large update transaction involving hotspot records. It cannot perform fine-grained concurrency control on the vector index the way InnoDB does with B+Tree latches. + +Still, it represents a good and pragmatic trade-off—implementing a native vector index would require significant investment in I/O and concurrency optimizations. + diff --git a/_posts/2025-05-26-mariadb-vector-2.md b/_posts/2025-05-26-mariadb-vector-2.md new file mode 100644 index 0000000..d8454f8 --- /dev/null +++ b/_posts/2025-05-26-mariadb-vector-2.md @@ -0,0 +1,62 @@ +--- +layout: post +title: Reviewing the Internals of MariaDB’s Vector Index 2 +--- + + + +image-1 + +### 1. What optimizations does MariaDB apply to vector indexes? + +#### 1.1 SIMD Acceleration + +- To improve performance, vector fields in the records are quantized from raw float arrays into fixed-point representations (scale + `int16_t` arrays) using normalization and quantization. These `int16_t` arrays allow fast dot product computation using AVX2 or AVX512 instructions, which significantly speeds up vector distance calculations. +- When expanding neighbors for a given node, MariaDB uses a `PatternedSimdBloomFilter` to efficiently skip previously visited neighbors. This filter groups visited memory addresses in batches of 8 and uses SIMD to accelerate the matching process. + +#### 1.2 Node Caching to Reduce Storage Engine Access + +- Each table's `TABLE_SHARE` structure holds an `MHNSW_Share` object, which contains a global cache shared across sessions (since `TABLE_SHARE` is global). + +- The cache improves read performance but introduces additional locking overhead, which is worth a closer look. Three types of locks are used to manage concurrency: + + - `cache_lock`: guards the entire cache structure. + - `node_lock[8]`: partitions node-level locks to reduce contention on `cache_lock`. The thread first uses `cache_lock` to locate the node, then grabs `node_lock[x]` for fine-grained protection, allowing `cache_lock` to be released right after. + - `commit_lock`: a read-write lock that ensures consistency during writes. Readers hold the read lock throughout the query to prevent concurrent cache modifications. Writers acquire the write lock during commit, invalidate the cache, bump the version number, and notify any ongoing reads (executing between `hlindex_read_first()` and `hlindex_read_next()`) to switch to the new cache which generated by the writer. + + **Observations:** + + - In pure read workloads, this caching scheme works well, and the two-tier locking mechanism (`cache_lock` + `node_lock`) minimizes contention. + - In read-write workloads, however, every write invalidates the entire cache, making it less effective for concurrent readers. + +------ + +### 2. What Is the Transaction Isolation on the Vector Index? + +(This section refers specifically to vector indexes on InnoDB.) + +#### 1. Read Committed + +*(tx_isolation = 'READ COMMITTED')* + +- **Consistent reads**, remain unchanged. +- **Locking reads**, however, become stronger: the shared lock (S-lock) acquired on the entry-point node blocks writes completely, effectively elevating isolation to Serializable. + +#### 2. Repeatable Read + +*(tx_isolation = 'REPEATABLE READ')* + +- **Consistent reads**, remain unchanged. +- **Locking reads**, now acquire a next-key S-lock on the entry-point node, which again blocks concurrent writes, behaving like Serializable isolation. + +Interestingly, under the Repeatable Read isolation level, the correctness of locking reads is not guaranteed by InnoDB’s next-key locks or gap locks. Since gap protection applies to ordered indexes, the concept of a “gap” does not really exist in a vector index structure. However, locking reads in this case effectively behave like Serializable, which still satisfies the requirements of Repeatable Read. + +Another notable quirk: the cache layer disrupts normal locking behavior. If a node is found in the cache, no InnoDB-level lock is acquired. Locking only happens on cache misses. This makes the locking behavior somewhat unpredictable under high cache hit rates. + +------ + +### 3. Conclusion + +Based on the reviews in my last post and this one, I believe MariaDB’s current implementation of vector indexes offers an excellent case study of how to integrate vector search in a relational database. It achieves a strong balance between engineering complexity, performance, and applicability. + +Looking forward to seeing even more powerful iterations in the future! diff --git a/_posts/2025-06-05-opt-uniq-index.md b/_posts/2025-06-05-opt-uniq-index.md new file mode 100644 index 0000000..24776ac --- /dev/null +++ b/_posts/2025-06-05-opt-uniq-index.md @@ -0,0 +1,30 @@ +--- +layout: post +title: A Potential Optimization for Inserting into MySQL Unique Indexes +--- + +In my previous [blog](https://kernelmaker.github.io/btr_search), I discussed how many B+Tree searches MySQL performs when inserting a record, detailing the process for inserting into both the primary index and unique secondary indexes. + +I believe there is room for optimization in the unique secondary index insertion process. In my previous post, I already reported one such optimization, which has since been verified by the MySQL team. In this post, I’d like to discuss another optimization I recently proposed. + +image-1 + +Currently, when inserting an entry into a unique secondary index, MySQL performs the following steps: + +1. It first does a B+Tree search using **all columns of the entry** to locate the leaf page insertion point. If it finds a record with matching index columns, it suggests a potential duplicate, so it proceeds to the next step. +2. It does another B+Tree search using **only the index columns** to locate the first record with matching index columns. It then iterates (using `next`) to check for actual duplicates, considering that records marked as deleted don’t count as duplicates. If no duplicate is found, it continues to the next step. +3. It performs a final B+Tree search using **all columns of the entry** to find the insertion point and then inserts the entry. + +As you can see, this process involves 3 separate B+Tree searches. This is mainly because unique secondary indexes in InnoDB cannot be updated in place; they rely on a delete-mark and insert approach. Since multiple records can share the same index column values (including deleted-marked ones), MySQL has to perform these extra checks to ensure uniqueness. + +Each of these B+Tree searches acquires at least one page latch at each tree level, which can become a concurrency bottleneck, especially during page splits or merges. + +### How can we optimize this? + +I believe we can reduce the number of B+Tree searches for unique secondary indexes. Specifically, we could skip the initial B+Tree search that uses all columns as the search pattern. The revised process would be: + +1. Use a B+Tree search with **only the index columns** to locate the first record with matching index columns. Then iterate (using `next`) through subsequent records to confirm whether an actual duplicate exists, while also identifying the final insertion point (including comparing the full entry columns when needed). If no duplicate is found, we can directly insert the entry at the determined insertion point. + +This approach would reduce the number of B+Tree searches from 3 to **1**, significantly reducing the chances of concurrency conflicts. All duplicate checks would happen within one or just a few adjacent leaf pages, making the lock granularity much smaller. Importantly, even in the worst case, the number of entry-record comparisons wouldn’t exceed what the current implementation requires. + +I’ve already submitted a report with this idea to the MySQL team. I’m hoping it can generate some interesting discussions around this optimization. diff --git a/_posts/2025-07-14-pgvector.md b/_posts/2025-07-14-pgvector.md new file mode 100644 index 0000000..e933944 --- /dev/null +++ b/_posts/2025-07-14-pgvector.md @@ -0,0 +1,66 @@ +--- +layout: post +title: Exploring the Internals of pgvector +--- + + pgvector brings vector similarity search capabilities to the PostgreSQL. Thanks to PostgreSQL’s flexible extension architecture, pgvector can focus purely on implementing the vector index: defining the vector data type, implementing index operations like insert, search, and vacuum, managing index page and tuple formats, and handling WAL logging. PostgreSQL handles the rest — index file and page buffer management, MVCC, crash recovery, and more. Because of this, pgvector is able to implement its vector index at a low level, on par with PostgreSQL’s built-in nbtree index. + +This is one of the key differences from MariaDB’s vector index. Due to MariaDB’s pluggable storage engine architecture and engineering trade-offs, its vector index is implemented at the server layer through the storage engine’s transactional interface. The storage engine itself is unaware of the vector index — it just sees a regular table. Curious about the internals of the MariaDB vector index? Take a look at my previous posts [here[1]](https://kernelmaker.github.io/mariadb-vector) and [here[2]](https://kernelmaker.github.io/mariadb-vector-2). + +pgvector supports two types of vector indexes: HNSW and IVFFlat. This post focuses on the HNSW implementation, particularly its concurrency control mechanisms — a topic that has received relatively little attention. + +## 1. Key Interfaces + +1. **hnswinsert**: This is the core interface for inserting into the index. Its implementation closely follows the HNSW paper, aligning with the INSERT, SEARCH-LAYER, and SELECT-NEIGHBORS-HEURISTIC algorithms (Algorithms 1, 2, and 4). One difference I noticed is that pgvector omits the extendCandidates step from SELECT-NEIGHBORS-HEURISTIC. + +2. **hnswgettuple**: This is the search interface. It invokes GetScanItems to perform the HNSW search, which aligns closely with Algorithm 5 (K-NN-SEARCH) from the paper. In iterative scans, GetScanItems not only returns the best candidates but also retains the discarded candidates — those rejected during neighbor selection. Once all the best candidates are exhausted, it revisits some of these discarded candidates at layer 0 for additional search rounds. This continues until hnsw_max_scan_tuples is exceeded, after which the remaining discarded candidates are returned as-is and the scan ends. + +3. **hnswbulkdelete**: This is the vacuum-related interface. It's the heaviest part of pgvector, involving three steps: + 1. Scanning all index pages to collect invalid tuples. + + 2. RepairGraph, the most complex step, removes invalid tuples from the graph and their neighbors, then repairs the graph to ensure correctness. This requires scanning all pages again and performing multiple search-layer and select-neighbors operations. + + 3. Safely deleting the invalid tuples from the index, again requiring a full scan of all pages. + + As you can see, the full traversal of index pages three times plus extensive HNSW operations make this function very heavy. + +4. **hnswbuild**: This handles index creation. It uses the similar logic as hnswinsert, but with PostgreSQL’s index build APIs. Notably, it supports concurrent builds. Initially, it builds the index in memory; only when the memory limit is exceeded does it flush pages to disk and switch to on-disk insertions. WAL logging is disabled throughout the build phase. + +## 2. Concurrency Control +In an earlier post, I analyzed the concurrency control design of MariaDB’s vector index. There, read-write concurrency is supported through InnoDB’s MVCC, but write-write concurrency is not. + +pgvector goes further by supporting true write-write concurrency. Let’s dive into how pgvector handles concurrency between hnswinsert and hnswbulkdelete. + +It introduces multiple lock types: + +1. **Page Locks** These are PostgreSQL’s standard buffer locks and are used to protect individual pages. Pages in pgvector fall into three categories: + + * Meta Page: Stores index metadata. + + * Element Pages: Store index tuples. + + * Entrypoint Page: A special element page containing the HNSW entrypoint. + + +2. **HNSW_UPDATE_LOCK** is a read-write lock used in pgvector to protect critical sections that span a larger scope than a single page lock. + * Most inserts (hnswinsert) acquire it in shared mode, allowing concurrent inserts. + + * If an insert needs to update the entrypoint, it upgrades to exclusive mode to ensure only one insert can modify the entrypoint. + hnswbulkdelete briefly acquires and immediately releases the lock in exclusive mode after collecting invalid index tuples and just beforeRepairGraph, ensuring that all in-flight inserts have completed. Otherwise, concurrent inserts might reintroduce the invalid elements being removed, making them neighbors again. + + +3. **HNSW_SCAN_LOCK** Similar to HNSW_UPDATE_LOCK, but used to coordinate between hnswbulkdelete and hnswgettuple. + +All locking operations in hnswinsert and hnswbulkdelete are well-structured. The diagram below shows detailed lock scopes in both implementations, where solid lines indicate exclusive locks and dashed lines indicate shared locks. I won’t go into all the details here — I may write a separate post covering the implementation specifics — but the diagram clearly illustrates that exclusive HNSW_UPDATE_LOCK usage is infrequent. Most operations acquire it in shared mode and hold short-lived page locks only as needed, keeping contention low. + +image-1 +**What about deadlocks?** The answer is simple: as shown in the diagram, in most cases only one page buffer lock is held at a time, eliminating the risk of circular dependencies. In rare cases, both an element page and one of its neighbor pages (also an element page) may be locked simultaneously. However, since pgvector maintains a globally unique current insert page, even these scenarios remain safe. + +## 3. Summary +pgvector is another textbook example of how to integrate vector search into a traditional OLTP database. Its implementation is elegant, closely aligned with the original HNSW paper. + +Compared to MariaDB's vector index, it stands out for its fine-grained concurrency control. However, it lacks the SIMD-level optimizations that MariaDB has introduced for better performance. + +A deeper comparison between pgvector and MariaDB’s vector index internals would be an interesting future topic. + +If you're interested in performance benchmarks comparing pgvector and MariaDB, Mark Callaghan did detailed tests, check them out [here](https://smalldatum.blogspot.com/2025/01/evaluating-vector-indexes-in-mariadb.html). diff --git a/_posts/2025-08-11-pgvector_rr.md b/_posts/2025-08-11-pgvector_rr.md new file mode 100644 index 0000000..513e19b --- /dev/null +++ b/_posts/2025-08-11-pgvector_rr.md @@ -0,0 +1,67 @@ +--- +layout: post +title: Is pgvector breaking PostgreSQL's Repeatable Read isolation? +--- + +This thought hit me on the way to work today: +(The table ‘items’ has an HNSW index on the vector column ‘embedding’) + +``` +BEGIN; +SET TRANSACTION ISOLATION LEVEL REPEATABLE READ; +SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5; +…… +``` + +**Can we really say this SELECT is repeatable read safe❓** + +I used to assume pgvector, as a PostgreSQL extension, naturally inherits Postgres’s transactional guarantees — but after thinking it through, that might not be the case. + +### PostgreSQL MVCC relies on 3 assumptions: + +1. **Indexes are append-only**: Write operations only insert new index entries — never update or delete them. +2. **The heap stores version history**: Each row’s versions are retained for snapshot-based visibility checks. +3. **VACUUM coordinates cleanup**: It purges dead heap tuples and their corresponding index entries together. + +This works well with native ordered index like nbtree. For example: + +1. A REPEATABLE READ transaction performs the same SELECT twice. +2. Between them, a new row B is inserted. +3. In the second SELECT, B appears in the index scan but is filtered out after a heap visibility check. + +So, the query still returns the same results — consistent with REPEATABLE READ. + +image-1 + +### But HNSW behaves differently… +When inserting a new vector B: + +1. B searches the graph to find neighbors. +2. Selected neighbors (say, T) update their neighbor lists to include B. +3. If T's list is full, HNSW re-selects top-k neighbors — possibly evicting an existing node like D. + +Here’s the issue: T’s neighbor list is modified — breaking assumption #1. +Now, suppose a REPEATABLE READ transaction had previously discovered D via T. In its second identical query, it may no longer reach D, simply because D was evicted from T's neighbor list. At the same time, the newly inserted B is now reachable — but is correctly rejected due to heap visibility checks. + +### Root cause: + +1. The HNSW index breaks MVCC’s immutability assumption: It performs in-place modifications to graph nodes during insertions. +2. No versioning in HNSW index: There's no way to preserve historical neighbor lists for concurrent transactions. Even though I prefer pgvector’s low-level, native integration (at the same level as nbtree), MariaDB’s design may provide better transactional isolation here. Its HNSW index is implemented as a separate InnoDB table — which naturally supports MVCC, including versioned index “rows.” + + +This question came to mind today — I reached a tentative conclusion through some code review and thought experiments. Haven’t verified this with a test case yet, so feel free to correct me if I’m wrong. + +🤔 BTW, lately, I’ve been comparing how vector search is implemented in transactional databases vs dedicated vector databases by reading through their code. It’s exciting to see traditional databases embracing new trends — but what do you think: +Do transactions bring real value to vector search, or are they more of a burden in practice? And what about the other way around? + +### Discussion + +This post has sparked some discussion on LinkedIn, with two main points being raised: + +1. HNSW is approximate search by nature, so strict Repeatable Read isn’t required. +2. PostgreSQL doesn’t currently guarantee identical results in all cases anyway (e.g., non-unique indexes with `SELECT ... ORDER BY ... LIMIT ...`), because different execution plans can produce different result orders. + +I’m not convinced by either of these arguments: + +1. Approximate search is an inherent trade-off in the vector search domain. It’s unrelated to PostgreSQL’s ACID guarantees, and using vector search shouldn’t be a reason to compromise on them. +2. The core issue here isn’t about result **order** — it’s about the result **set** itself. Query plan variability doesn’t explain this away, because even if we strictly control every runtime condition to ensure identical execution plans, HNSW can still produce different result sets (not just differently ordered sets) due to the root cause I described above. diff --git a/_posts/2025-09-08-simd.md b/_posts/2025-09-08-simd.md new file mode 100644 index 0000000..821aab8 --- /dev/null +++ b/_posts/2025-09-08-simd.md @@ -0,0 +1,326 @@ +--- +layout: post +title: SIMD in Vector Search - "Hand-Tuned SIMD vs Compiler Auto-Vectorization" +--- + + + +**SIMD** (Single instruction, multiple data) is often one of the key optimization techniques in vector search. In particular, when computing the distance between two vectors, SIMD can transform what was originally a one-dimensional-at-a-time calculation into 8- or 16-dimensions-at-a-time, significantly improving performance. + +Here, as I mentioned in previous posts, MariaDB and pgvector take different approaches: + +1. **MariaDB**: directly implements distance functions using SIMD instructions. +2. **pgvector**: implements distance functions in a naive way and relies on compiler optimization (`-ftree-vectorize`) for vectorization. + +To better understand the benefits of SIMD vectorization, and to compare these two approaches, I ran a series of benchmarks — and **discovered some surprising performance results along the way.** + +## 1. Test Environment and Method + +**Environment** + +1. AWS EC2: c5.4xlarge, 16 vCPUs, 32 GiB memory +2. Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz +3. gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 + +**Method** + +1. First, I implemented 4 different squared L2 distance (L2sq) functions (i.e., Euclidean distance without the square root): + + - Naive L2sq implementation + + ```c++ + static inline double l2sq_naive_f32(const float* a, const float* b, size_t n) { + float acc = 0.f; + for (size_t i = 0; i < n; ++i) { float d = a[i] - b[i]; acc += d * d; } + return (double)acc; + } + ``` + + - Naive high-precision L2sq (converting float to double before computation) + + ```c++ + static inline double l2sq_naive_f64(const float* a, const float* b, size_t n) { + double acc = 0.0; + for (size_t i = 0; i < n; ++i) { double d = (double)a[i] - (double)b[i]; acc += d * d; } + return acc; + } + ``` + + - SIMD (AVX2) L2sq implementation, computing 8 dimensions at a time + + ```c++ + // Reference: simSIMD + SIMSIMD_PUBLIC void simsimd_l2sq_f32_haswell(simsimd_f32_t const *a, + simsimd_f32_t const *b, + simsimd_size_t n, + simsimd_distance_t *result) { + + __m256 d2_vec = _mm256_setzero_ps(); + simsimd_size_t i = 0; + for (; i + 8 <= n; i += 8) { + __m256 a_vec = _mm256_loadu_ps(a + i); + __m256 b_vec = _mm256_loadu_ps(b + i); + __m256 d_vec = _mm256_sub_ps(a_vec, b_vec); + d2_vec = _mm256_fmadd_ps(d_vec, d_vec, d2_vec); + } + + simsimd_f64_t d2 = _simsimd_reduce_f32x8_haswell(d2_vec); + for (; i < n; ++i) { + float d = a[i] - b[i]; + d2 += d * d; + } + + *result = d2; + } + SIMSIMD_INTERNAL simsimd_f64_t _simsimd_reduce_f32x8_haswell(__m256 vec) { + // Convert the lower and higher 128-bit lanes of the input vector to double precision + __m128 low_f32 = _mm256_castps256_ps128(vec); + __m128 high_f32 = _mm256_extractf128_ps(vec, 1); + + // Convert single-precision (float) vectors to double-precision (double) vectors + __m256d low_f64 = _mm256_cvtps_pd(low_f32); + __m256d high_f64 = _mm256_cvtps_pd(high_f32); + + // Perform the addition in double-precision + __m256d sum = _mm256_add_pd(low_f64, high_f64); + return _simsimd_reduce_f64x4_haswell(sum); + } + SIMSIMD_INTERNAL simsimd_f64_t _simsimd_reduce_f64x4_haswell(__m256d vec) { + // Reduce the double-precision vector to a scalar + // Horizontal add the first and second double-precision values, and third and fourth + __m128d vec_low = _mm256_castpd256_pd128(vec); + __m128d vec_high = _mm256_extractf128_pd(vec, 1); + __m128d vec128 = _mm_add_pd(vec_low, vec_high); + + // Horizontal add again to accumulate all four values into one + vec128 = _mm_hadd_pd(vec128, vec128); + + // Convert the final sum to a scalar double-precision value and return + return _mm_cvtsd_f64(vec128); + } + ``` + + - SIMD (AVX-512) L2sq implementation, computing 16 dimensions at a time + + ```c++ + // Reference: simSIMD + SIMSIMD_PUBLIC void simsimd_l2sq_f32_skylake(simsimd_f32_t const *a, + simsimd_f32_t const *b, + simsimd_size_t n, + simsimd_distance_t *result) { + __m512 d2_vec = _mm512_setzero(); + __m512 a_vec, b_vec; + + simsimd_l2sq_f32_skylake_cycle: + if (n < 16) { + __mmask16 mask = (__mmask16)_bzhi_u32(0xFFFFFFFF, n); + a_vec = _mm512_maskz_loadu_ps(mask, a); + b_vec = _mm512_maskz_loadu_ps(mask, b); + n = 0; + } + else { + a_vec = _mm512_loadu_ps(a); + b_vec = _mm512_loadu_ps(b); + a += 16, b += 16, n -= 16; + } + __m512 d_vec = _mm512_sub_ps(a_vec, b_vec); + d2_vec = _mm512_fmadd_ps(d_vec, d_vec, d2_vec); + if (n) goto simsimd_l2sq_f32_skylake_cycle; + + *result = _simsimd_reduce_f32x16_skylake(d2_vec); + } + ...... + ``` + +2. I generated a dataset of 10,000 float vectors (dimension = 1024, 64B aligned) and one target vector. Then, for the following 5 scenarios, I searched for the vector with the closest L2sq distance to the target. Each distance computation was repeated 16 times (to create a CPU-intensive workload), and each scenario was executed 5 times, taking the median runtime to eliminate random fluctuations: + + 1. SIMD L2sq implementation + 2. Naive L2sq implementation + 3. Naive L2sq with compiler vectorization disabled (`-fno-tree-vectorize -fno-builtin -fno-lto -Wno-cpp -Wno-pragmas`) + 4. Naive high-precision L2sq implementation + 5. Naive high-precision L2sq with compiler vectorization disabled + +3. Compile with AVX2 (`-O3 -mavx2 -mfma -mf16c -mbmi2`) and run the 5 scenarios. + +4. Compile with AVX-512 (`-O3 -mavx512f -mavx512dq -mavx512bw -mavx512vl -mavx512cd -mfma -mf16c -mbmi2`) and run the 5 scenarios again. + +## 2. Results and Analysis + +image-1 + +#### Expected results: + +1. SIMD L2sq implementations are much faster than others, and AVX-512 outperforms AVX2 since it processes 16 dimensions at once instead of 8. + +2. Under AVX2, naive L2sq (178.385ms) is faster than naive high-precision L2sq (183.973ms), because the latter incurs float→double conversion overhead. + +3. Under both AVX2 and AVX-512, naive implementations with compiler vectorization disabled perform the worst, since they are forced into scalar execution. + +#### Unexpected Results + +In addition to the expected results above, some surprising findings appeared: + +1. For naive L2sq, **AVX-512 performance (208.822ms) was actually slower than AVX2 (178.385ms).** +2. With AVX-512, **naive L2sq was slower than naive high-precision L2sq.** + +Both deserve deeper analysis. + +**(1) Why was naive L2sq with AVX-512 slower than with AVX2?** + +Although this was a naive implementation, with `-O3` we would expect the compiler to auto-vectorize. However, the vectorized result generated by the compiler was far worse than our manual SIMD implementation, and AVX-512 even performed worse than AVX2. + +To investigate further, I used `objdump` to examine the AVX2 and AVX-512 binaries for `l2sq_naive_f32()`. + +- Under AVX2: + + ``` + 0000000000007090 <_ZL19l2sq_naive_f32PKfS0_m>: + ... ... + 70b7: 48 c1 ee 03 shr rsi,0x3 + 70bb: 48 c1 e6 05 shl rsi,0x5 + 70bf: 90 nop + 70c0: c5 fc 10 24 07 vmovups ymm4,YMMWORD PTR [rdi+rax*1] + 70c5: c5 dc 5c 0c 01 vsubps ymm1,ymm4,YMMWORD PTR [rcx+rax*1] + 70ca: 48 83 c0 20 add rax,0x20 + 70ce: c5 f4 59 c9 vmulps ymm1,ymm1,ymm1 + + 70d2: c5 fa 58 c1 vaddss xmm0,xmm0,xmm1 + 70d6: c5 f0 c6 d9 55 vshufps xmm3,xmm1,xmm1,0x55 + 70db: c5 f0 c6 d1 ff vshufps xmm2,xmm1,xmm1,0xff + 70e0: c5 fa 58 c3 vaddss xmm0,xmm0,xmm3 + 70e4: c5 f0 15 d9 vunpckhps xmm3,xmm1,xmm1 + 70e8: c4 e3 7d 19 c9 01 vextractf128 xmm1,ymm1,0x1 + 70ee: c5 fa 58 c3 vaddss xmm0,xmm0,xmm3 + 70f2: c5 fa 58 c2 vaddss xmm0,xmm0,xmm2 + 70f6: c5 f0 c6 d1 55 vshufps xmm2,xmm1,xmm1,0x55 + 70fb: c5 fa 58 c1 vaddss xmm0,xmm0,xmm1 + 70ff: c5 fa 58 c2 vaddss xmm0,xmm0,xmm2 + 7103: c5 f0 15 d1 vunpckhps xmm2,xmm1,xmm1 + 7107: c5 f0 c6 c9 ff vshufps xmm1,xmm1,xmm1,0xff + 710c: c5 fa 58 c2 vaddss xmm0,xmm0,xmm2 + 7110: c5 fa 58 c1 vaddss xmm0,xmm0,xmm1 + ... ... + ``` + + The compiler did use vector instructions (`vmovups`, `vsubps`, `vmulps`) to compute L2sq in groups of 8 floats. But when folding the 8 results horizontally into `xmm0`, it extracted elements using `vshufps`, `vunpckhps`, `vextractf128`, etc., and then added them one by one with scalar `vaddss`. Worse, this folding happened **in every iteration**. + + image-2 + + This per-iteration horizontal reduction became the bottleneck. Instead, like the manual SIMD implementation, it should have accumulated vector results across the whole loop and performed just one horizontal reduction at the end. + + + +- Under AVX-512: + + ``` + a057: 48 c1 ee 04 shr rsi,0x4 + a05b: 48 c1 e6 06 shl rsi,0x6 + a05f: 90 nop + a060: 62 f1 7c 48 10 2c 07 vmovups zmm5,ZMMWORD PTR [rdi+rax*1] + a067: 62 f1 54 48 5c 0c 01 vsubps zmm1,zmm5,ZMMWORD PTR [rcx+rax*1] + a06e: 48 83 c0 40 add rax,0x40 + a072: 62 f1 74 48 59 c9 vmulps zmm1,zmm1,zmm1 + + a078: c5 f0 c6 e1 55 vshufps xmm4,xmm1,xmm1,0x55 + a07d: c5 f0 c6 d9 ff vshufps xmm3,xmm1,xmm1,0xff + a082: 62 f3 75 28 03 d1 07 valignd ymm2,ymm1,ymm1,0x7 + a089: c5 fa 58 c1 vaddss xmm0,xmm0,xmm1 + a08d: c5 fa 58 c4 vaddss xmm0,xmm0,xmm4 + a091: c5 f0 15 e1 vunpckhps xmm4,xmm1,xmm1 + a095: c5 fa 58 c4 vaddss xmm0,xmm0,xmm4 + a099: c5 fa 58 c3 vaddss xmm0,xmm0,xmm3 + a09d: 62 f3 7d 28 19 cb 01 vextractf32x4 xmm3,ymm1,0x1 + a0a4: c5 fa 58 c3 vaddss xmm0,xmm0,xmm3 + a0a8: 62 f3 75 28 03 d9 05 valignd ymm3,ymm1,ymm1,0x5 + a0af: c5 fa 58 c3 vaddss xmm0,xmm0,xmm3 + a0b3: 62 f3 75 28 03 d9 06 valignd ymm3,ymm1,ymm1,0x6 + a0ba: 62 f3 7d 48 1b c9 01 vextractf32x8 ymm1,zmm1,0x1 + a0c1: c5 fa 58 c3 vaddss xmm0,xmm0,xmm3 + a0c5: c5 f0 c6 d9 55 vshufps xmm3,xmm1,xmm1,0x55 + a0ca: c5 fa 58 c2 vaddss xmm0,xmm0,xmm2 + a0ce: c5 f0 c6 d1 ff vshufps xmm2,xmm1,xmm1,0xff + a0d3: c5 fa 58 c1 vaddss xmm0,xmm0,xmm1 + a0d7: c5 fa 58 c3 vaddss xmm0,xmm0,xmm3 + a0db: c5 f0 15 d9 vunpckhps xmm3,xmm1,xmm1 + a0df: c5 fa 58 c3 vaddss xmm0,xmm0,xmm3 + a0e3: c5 fa 58 c2 vaddss xmm0,xmm0,xmm2 + a0e7: 62 f3 7d 28 19 ca 01 vextractf32x4 xmm2,ymm1,0x1 + a0ee: c5 fa 58 c2 vaddss xmm0,xmm0,xmm2 + a0f2: 62 f3 75 28 03 d1 05 valignd ymm2,ymm1,ymm1,0x5 + a0f9: c5 fa 58 c2 vaddss xmm0,xmm0,xmm2 + a0fd: 62 f3 75 28 03 d1 06 valignd ymm2,ymm1,ymm1,0x6 + a104: 62 f3 75 28 03 c9 07 valignd ymm1,ymm1,ymm1,0x7 + a10b: c5 fa 58 c2 vaddss xmm0,xmm0,xmm2 + a10f: c5 fa 58 c1 vaddss xmm0,xmm0,xmm1 + ``` + + The first part similarly used vector instructions to compute 16 values at a time. But folding 16 results was even more complex and expensive, involving `vshufps`, `valignd`, `vunpckhps`, `vextractf32x4`, `vextractf32x8`, etc. This additional complexity canceled out the gains from processing 16 dimensions per iteration, which explains why AVX-512 was slower. + +**(2) Why was naive float L2sq slower than naive high-precision L2sq under AVX-512?** + +Theoretically, high-precision L2sq should be slower because of float→double conversions. So why was it faster? + +Looking at the disassembly of `l2sq_naive_f64`: + +``` +000000000000a280 <_ZL19l2sq_naive_f64PKfS0_m>: + a280: f3 0f 1e fa endbr64 + a284: 48 85 d2 test rdx,rdx + a287: 74 37 je a2c0 <_ZL19l2sq_naive_f64_oncePKfS0_m+0x40> + a289: c5 e0 57 db vxorps xmm3,xmm3,xmm3 + a28d: 31 c0 xor eax,eax + a28f: c5 e9 57 d2 vxorpd xmm2,xmm2,xmm2 + a293: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0] + a298: c5 e2 5a 04 87 vcvtss2sd xmm0,xmm3,DWORD PTR [rdi+rax*4] + a29d: c5 e2 5a 0c 86 vcvtss2sd xmm1,xmm3,DWORD PTR [rsi+rax*4] + a2a2: c5 fb 5c c1 vsubsd xmm0,xmm0,xmm1 + a2a6: 48 83 c0 01 add rax,0x1 + a2aa: c5 fb 59 c0 vmulsd xmm0,xmm0,xmm0 + a2ae: c5 eb 58 d0 vaddsd xmm2,xmm2,xmm0 + a2b2: 48 39 c2 cmp rdx,rax + a2b5: 75 e1 jne a298 <_ZL19l2sq_naive_f64_oncePKfS0_m+0x18> + a2b7: c5 eb 10 c2 vmovsd xmm0,xmm2,xmm2 + a2bb: c3 ret + a2bc: 0f 1f 40 00 nop DWORD PTR [rax+0x0] + a2c0: c5 e9 57 d2 vxorpd xmm2,xmm2,xmm2 + a2c4: c5 eb 10 c2 vmovsd xmm0,xmm2,xmm2 + a2c8: c3 ret + a2c9: 0f 1f 80 00 00 00 00 nop DWORD PTR [rax+0x0] +``` + +- The code is much shorter than the float version. +- Although it includes scalar float→double conversions (`vcvtss2sd`) and computes one dimension at a time, it avoids the complex and costly 16-element horizontal folding. + +In other words, even with the conversion overhead, the simpler scalar path was still faster than the float version with vector folding. The compiler likely chose the conservative scalar path here, avoiding vectorization. + +**(3) How to Improve Naive L2sq for Better Compiler Vectorization?** + +The reason for horizontal folding is likely that the compiler strictly follows IEEE 754 semantics, preserving the exact order of floating-point additions. This prevents the compiler from reordering additions into vectorized accumulations. + +To relax this, we can explicitly allow reassociation: + +```c++ +static inline double l2sq_naive_f32(const float* a, const float* b, size_t n) { + float acc = 0.f; + #pragma omp simd reduction(+:acc) + for (size_t i = 0; i < n; ++i) { + float d = a[i] - b[i]; + acc += d * d; + } + return (double)acc; +} +``` + +And compile with `-fopenmp-simd` to enable this directive. + +Running again shows a significant improvement: compiler auto-vectorization now achieves performance close to manual SIMD implementations. Using `-ffast-math` also works. + +image-3 + + + +## 3. Summary + +1. SIMD significantly improves distance computation performance. +2. Hand-written SIMD implementations perform best. +3. For naive implementations, **allowing reassociation** (via `#pragma omp simd reduction(+:acc)` or appropriate subsets of `-ffast-math`) is the key to approaching hand-written SIMD performance. Under strict IEEE semantics, the compiler conservatively generates per-iteration folding, which creates slow paths where AVX-512 does not necessarily have an advantage. diff --git a/_posts/2025-12-01-mysql-blob.md b/_posts/2025-12-01-mysql-blob.md new file mode 100644 index 0000000..1903bc9 --- /dev/null +++ b/_posts/2025-12-01-mysql-blob.md @@ -0,0 +1,103 @@ +--- +layout: post +title: MySQL BLOB Internals - Partial Update Implementation and Multi-Versioning +--- + +In this blog, I would like to introduce the implementation of BLOB and BLOB partial update in MySQL, and explain how the current design works together with the MVCC module to support multi-version control for BLOB columns. + +# 1. Background + +Before going into the details, I would like to briefly introduce two important concepts that are closely related to this topic. + +## 1. Basic Principles of MySQL MVCC (Multi-Version Concurrency Control) + +MySQL supports snapshot reads. Each read transaction reads data based on a certain snapshot, so even if other write transactions modify the data during the execution of a read transaction, the read transaction will always see the version it is supposed to see. + +The underlying mechanism is that a write transaction directly updates the data in place on the primary key record. However, before the update happens, the old value of the field to be modified is copied into the undo space. At the same time, there is a ROLL_PTR field in the row that points to the exact location in the undo space where the old value (the undo log record) is stored. + +image-1 + +As shown in the figure above, there is a row in the primary key index that contains three fields. Suppose a write transaction is modifying Field 2. It will first copy the original value of Field 2 into the undo space, and then overwrite Field 2 directly in the row. After that, two important system fields of the row are updated: + +- TRX_ID is set to the ID of the current write transaction and is used later by read transactions to determine visibility. +- ROLL_PTR points to the exact location in the undo space where the old value of the modified field is stored, and is used to reconstruct the previous version of the row when needed. + +After the update is finished, if a previously existing read transaction reads this row again, it will find, based on the TRX_ID, that the row has been modified by a later write transaction. Therefore, the current version of the row is not visible to this read transaction. It must roll back to the previous version. At this point, it uses the ROLL_PTR to locate the old value in the undo space, applies it to the current row, and thus reconstructs the version that it is supposed to see. + +## 2. Basic Implementation of MySQL BLOB + +The primary key record in MySQL contains the values of all fields and is stored in the clustered index. However, BLOB columns are an exception. Since they are usually very large, MySQL stores their data in separate data pages called external pages. + +A BLOB value is split into multiple parts and stored sequentially across multiple external pages. These pages are linked together in order, like a linked list. So how does the primary key record locate the corresponding BLOB data stored in those external pages? For each BLOB column, the clustered record stores a reference (lob::ref_t). This ref_t contains some metadata about the column and a pointer to the first external page where the BLOB data starts. + +image-2 + +When reading the row, MySQL first locates the row via the primary key index, then follows this reference to find the external pages and reconstructs the full BLOB value by copying the data from those pages. + +This is a very straightforward and intuitive design, simple and sufficient. It is also exactly how BLOB was implemented in older versions of MySQL. + +## 3. A “Thought Exercise” + +Based on the two points above, here is a question: + +**How is MVCC implemented for BLOB in MySQL?** + +The intuitive answer is as follows: the lob::ref_t stored in the primary key record follows the same MVCC rules. Every time a BLOB column is updated, the old BLOB value is read out, modified, and then the entire modified BLOB is written into newly allocated external pages. The corresponding lob::ref_t in the primary key record is overwritten with the new reference. At the same time, following the MVCC mechanism, the old lob::ref_t is copied into the undo space. + +image-3 + +After the modification, the situation looks like this (as shown in the figure): the undo space stores the lob::ref_t that points to the old BLOB value, while the lob::ref_t in the primary key record points to the new value. + +This is exactly how older versions of MySQL worked. The next question is: + +**What are the pros and cons of this design?** + +The advantage is that the undo log only needs to record the lob::ref_t, and it does not need to store the entire old BLOB value. + +The disadvantage is that no matter how small the change to the BLOB is, even if only a single byte is modified, the entire modified BLOB still has to be written into newly allocated external pages. BLOB columns are usually very large, so if each update only changes a very small portion, this design introduces a lot of extra I/O and space overhead. + +A typical example is JSON. Internally, MySQL stores JSON as BLOB. Usually, updates to JSON are local and small. However, with the old design, each small partial update still requires reading the entire JSON, modifying a part of it, and then inserting the whole value back again. This is obviously very heavy. + +So how to solve this problem? MySQL introduced BLOB partial update to address it. + +# 2. Implementation of BLOB Partial Update + +MySQL optimized the format of the external pages used to store BLOB data and redesigned the original simple linked-list structure: + +image-4 + +1. Each external page now has a corresponding index entry. +2. These index entries are organized as a linked list and stored in the **BLOB first page**. (If there are too many index entries to fit, they are stored in separate BLOB index pages.) +3. Under normal circumstances, these index entries are linked together in order, just like the external pages in the old implementation. +4. To support partial updates, MySQL changes the granularity of BLOB updates from the whole BLOB to individual external pages. Only the external pages involved in the current modification are updated. The modified external page is copied into a new page and updated there, while the other external pages remain unchanged. + +Then the question becomes: **how can MySQL make sure that it can read the correct new and old BLOB values?** The answer is that the new external page and the old external page share the same logical position in the index entry list. In other words, at this specific position in the list, there are now two versions, version 1 and version 2. Which one is used is determined by the version number recorded in the current lob::ref_t. The idea is illustrated in the figure below. + +image-5 + +In summary, MySQL transforms the original external-page linked list into a linked list of index entries. For each index entry in this list, if the corresponding external page is modified, a new version of the index entry is created at the same horizontal position to point to the new version of that external page. Essentially, this introduces multi-versioning for external pages. + +## Special Case: BLOB Small Changes + +The implementation described above is not the whole story. MySQL makes a practical trade-off between creating a new index entry (which requires copying the entire external page) and copying only the modified portion into the undo space. + +For BLOB small-change scenarios, when the modification to a blob is smaller than 100 bytes, MySQL does not create a new index entry and link it into the version chain for that page. Instead, it modifies the page in place. Following MVCC principles, the portion to be modified is first written into the undo space before the in-place update happens. + +image-6 + +It is worth noting that in this case, the lob::ref_t stored in the primary key record does not advance its base version number. It shares the same base as the previous version. When a read transaction needs to read the previous version, it first constructs the latest BLOB value based on the lob::ref_t and the index entry list. Then, following the MVCC logic, it finds that the TRX_ID indicates that this version is not visible. At this point, it follows the ROLL_PTR to the undo space, where the old value of the modified external page is stored. By applying that old data back onto the current value, the complete and correct historical BLOB value can be reconstructed. + +In this scenario, the recovery process is a combination of two steps: + +1. First, the version corresponding to the lob::ref_t is reconstructed via the index entry version chain. +2. Then, the version visible to the current transaction is reconstructed via the ROLL_PTR chain. + +## Index Entry Details + +Index entries are the key to the implementation of BLOB partial update. To make them easier to understand, I drew the following diagram to illustrate the logical relationships among index entries. It is a two-dimensional linked list. The horizontal dimension represents the sequential position when assembling the full BLOB value. The vertical dimension represents multiple versions at the same position. Each time the page at that position is modified, a new node is added vertically. + +image-7 + +Of course, this is only a logical model. The physical layout is not organized exactly like this. Each BLOB has a BLOB first page. This page stores a portion of the BLOB data (the initial part) and 10 index entries. Each index entry corresponds to one BLOB data page. When all 10 index entries are used up, a new BLOB index page is allocated, and additional index entries are allocated from there. In reality, the index entries distributed across the BLOB first page and the BLOB index pages are linked together to form the logical structure shown in the diagram above. + +image-8 diff --git a/public/images/2025-01-14/1.png b/public/images/2025-01-14/1.png new file mode 100644 index 0000000..ca43cc2 Binary files /dev/null and b/public/images/2025-01-14/1.png differ diff --git a/public/images/2025-01-14/10.png b/public/images/2025-01-14/10.png new file mode 100644 index 0000000..41d6590 Binary files /dev/null and b/public/images/2025-01-14/10.png differ diff --git a/public/images/2025-01-14/11.png b/public/images/2025-01-14/11.png new file mode 100644 index 0000000..842608f Binary files /dev/null and b/public/images/2025-01-14/11.png differ diff --git a/public/images/2025-01-14/2.png b/public/images/2025-01-14/2.png new file mode 100644 index 0000000..4d04f83 Binary files /dev/null and b/public/images/2025-01-14/2.png differ diff --git a/public/images/2025-01-14/3.png b/public/images/2025-01-14/3.png new file mode 100644 index 0000000..075e3c7 Binary files /dev/null and b/public/images/2025-01-14/3.png differ diff --git a/public/images/2025-01-14/4.png b/public/images/2025-01-14/4.png new file mode 100644 index 0000000..301f9f3 Binary files /dev/null and b/public/images/2025-01-14/4.png differ diff --git a/public/images/2025-01-14/5.png b/public/images/2025-01-14/5.png new file mode 100644 index 0000000..46c506e Binary files /dev/null and b/public/images/2025-01-14/5.png differ diff --git a/public/images/2025-01-14/6.png b/public/images/2025-01-14/6.png new file mode 100644 index 0000000..10fd4b2 Binary files /dev/null and b/public/images/2025-01-14/6.png differ diff --git a/public/images/2025-01-14/7.png b/public/images/2025-01-14/7.png new file mode 100644 index 0000000..3d3ed84 Binary files /dev/null and b/public/images/2025-01-14/7.png differ diff --git a/public/images/2025-01-14/8.png b/public/images/2025-01-14/8.png new file mode 100644 index 0000000..b1acaf6 Binary files /dev/null and b/public/images/2025-01-14/8.png differ diff --git a/public/images/2025-01-14/9.png b/public/images/2025-01-14/9.png new file mode 100644 index 0000000..b274fa6 Binary files /dev/null and b/public/images/2025-01-14/9.png differ diff --git a/public/images/2025-02-28/1.png b/public/images/2025-02-28/1.png new file mode 100644 index 0000000..bdf6d72 Binary files /dev/null and b/public/images/2025-02-28/1.png differ diff --git a/public/images/2025-02-28/10.png b/public/images/2025-02-28/10.png new file mode 100644 index 0000000..3b00b90 Binary files /dev/null and b/public/images/2025-02-28/10.png differ diff --git a/public/images/2025-02-28/11.png b/public/images/2025-02-28/11.png new file mode 100644 index 0000000..296d68b Binary files /dev/null and b/public/images/2025-02-28/11.png differ diff --git a/public/images/2025-02-28/2.png b/public/images/2025-02-28/2.png new file mode 100644 index 0000000..55e1a10 Binary files /dev/null and b/public/images/2025-02-28/2.png differ diff --git a/public/images/2025-02-28/3.png b/public/images/2025-02-28/3.png new file mode 100644 index 0000000..e62df64 Binary files /dev/null and b/public/images/2025-02-28/3.png differ diff --git a/public/images/2025-02-28/4.png b/public/images/2025-02-28/4.png new file mode 100644 index 0000000..8247f00 Binary files /dev/null and b/public/images/2025-02-28/4.png differ diff --git a/public/images/2025-02-28/5.png b/public/images/2025-02-28/5.png new file mode 100644 index 0000000..01396bd Binary files /dev/null and b/public/images/2025-02-28/5.png differ diff --git a/public/images/2025-02-28/6.png b/public/images/2025-02-28/6.png new file mode 100644 index 0000000..f0d6657 Binary files /dev/null and b/public/images/2025-02-28/6.png differ diff --git a/public/images/2025-02-28/7.png b/public/images/2025-02-28/7.png new file mode 100644 index 0000000..41afe61 Binary files /dev/null and b/public/images/2025-02-28/7.png differ diff --git a/public/images/2025-02-28/8.png b/public/images/2025-02-28/8.png new file mode 100644 index 0000000..c0c89f0 Binary files /dev/null and b/public/images/2025-02-28/8.png differ diff --git a/public/images/2025-02-28/9.png b/public/images/2025-02-28/9.png new file mode 100644 index 0000000..06bd6d6 Binary files /dev/null and b/public/images/2025-02-28/9.png differ diff --git a/public/images/2025-04-11/1.png b/public/images/2025-04-11/1.png new file mode 100644 index 0000000..2e4dc97 Binary files /dev/null and b/public/images/2025-04-11/1.png differ diff --git a/public/images/2025-04-11/2.png b/public/images/2025-04-11/2.png new file mode 100644 index 0000000..b7973d8 Binary files /dev/null and b/public/images/2025-04-11/2.png differ diff --git a/public/images/2025-04-11/3.png b/public/images/2025-04-11/3.png new file mode 100644 index 0000000..8f5087a Binary files /dev/null and b/public/images/2025-04-11/3.png differ diff --git a/public/images/2025-04-11/4.png b/public/images/2025-04-11/4.png new file mode 100644 index 0000000..b104ec6 Binary files /dev/null and b/public/images/2025-04-11/4.png differ diff --git a/public/images/2025-04-11/5.png b/public/images/2025-04-11/5.png new file mode 100644 index 0000000..15d5fc9 Binary files /dev/null and b/public/images/2025-04-11/5.png differ diff --git a/public/images/2025-04-23/1.png b/public/images/2025-04-23/1.png new file mode 100644 index 0000000..0082325 Binary files /dev/null and b/public/images/2025-04-23/1.png differ diff --git a/public/images/2025-04-23/2.png b/public/images/2025-04-23/2.png new file mode 100644 index 0000000..84ee8fe Binary files /dev/null and b/public/images/2025-04-23/2.png differ diff --git a/public/images/2025-04-27/1.png b/public/images/2025-04-27/1.png new file mode 100644 index 0000000..94217d1 Binary files /dev/null and b/public/images/2025-04-27/1.png differ diff --git a/public/images/2025-05-10/1.png b/public/images/2025-05-10/1.png new file mode 100644 index 0000000..4ae45c9 Binary files /dev/null and b/public/images/2025-05-10/1.png differ diff --git a/public/images/2025-05-10/2.png b/public/images/2025-05-10/2.png new file mode 100644 index 0000000..133f1f2 Binary files /dev/null and b/public/images/2025-05-10/2.png differ diff --git a/public/images/2025-05-10/3.png b/public/images/2025-05-10/3.png new file mode 100644 index 0000000..612cabb Binary files /dev/null and b/public/images/2025-05-10/3.png differ diff --git a/public/images/2025-05-26/1.png b/public/images/2025-05-26/1.png new file mode 100644 index 0000000..4f08205 Binary files /dev/null and b/public/images/2025-05-26/1.png differ diff --git a/public/images/2025-06-05/1.png b/public/images/2025-06-05/1.png new file mode 100644 index 0000000..7264bf3 Binary files /dev/null and b/public/images/2025-06-05/1.png differ diff --git a/public/images/2025-07-14/1.png b/public/images/2025-07-14/1.png new file mode 100644 index 0000000..6c0f50a Binary files /dev/null and b/public/images/2025-07-14/1.png differ diff --git a/public/images/2025-08-11/1.png b/public/images/2025-08-11/1.png new file mode 100644 index 0000000..68d7b25 Binary files /dev/null and b/public/images/2025-08-11/1.png differ diff --git a/public/images/2025-09-08/1.png b/public/images/2025-09-08/1.png new file mode 100644 index 0000000..917bf93 Binary files /dev/null and b/public/images/2025-09-08/1.png differ diff --git a/public/images/2025-09-08/2.png b/public/images/2025-09-08/2.png new file mode 100644 index 0000000..fcbd2bb Binary files /dev/null and b/public/images/2025-09-08/2.png differ diff --git a/public/images/2025-09-08/3.png b/public/images/2025-09-08/3.png new file mode 100644 index 0000000..0e50619 Binary files /dev/null and b/public/images/2025-09-08/3.png differ diff --git a/public/images/2025-12-01/1.png b/public/images/2025-12-01/1.png new file mode 100644 index 0000000..7b5abcd Binary files /dev/null and b/public/images/2025-12-01/1.png differ diff --git a/public/images/2025-12-01/2.png b/public/images/2025-12-01/2.png new file mode 100644 index 0000000..5561f28 Binary files /dev/null and b/public/images/2025-12-01/2.png differ diff --git a/public/images/2025-12-01/3.png b/public/images/2025-12-01/3.png new file mode 100644 index 0000000..c7566e0 Binary files /dev/null and b/public/images/2025-12-01/3.png differ diff --git a/public/images/2025-12-01/4.png b/public/images/2025-12-01/4.png new file mode 100644 index 0000000..945c06f Binary files /dev/null and b/public/images/2025-12-01/4.png differ diff --git a/public/images/2025-12-01/5.png b/public/images/2025-12-01/5.png new file mode 100644 index 0000000..30b1737 Binary files /dev/null and b/public/images/2025-12-01/5.png differ diff --git a/public/images/2025-12-01/6.png b/public/images/2025-12-01/6.png new file mode 100644 index 0000000..73ed43a Binary files /dev/null and b/public/images/2025-12-01/6.png differ diff --git a/public/images/2025-12-01/7.png b/public/images/2025-12-01/7.png new file mode 100644 index 0000000..04972f1 Binary files /dev/null and b/public/images/2025-12-01/7.png differ diff --git a/public/images/2025-12-01/8.png b/public/images/2025-12-01/8.png new file mode 100644 index 0000000..ec7d269 Binary files /dev/null and b/public/images/2025-12-01/8.png differ diff --git a/public/images/2025-12-01/cover.png b/public/images/2025-12-01/cover.png new file mode 100644 index 0000000..eab1694 Binary files /dev/null and b/public/images/2025-12-01/cover.png differ diff --git a/public/images/ibdNinja-diagram/1.png b/public/images/ibdNinja-diagram/1.png new file mode 100644 index 0000000..fb28477 Binary files /dev/null and b/public/images/ibdNinja-diagram/1.png differ diff --git a/public/images/ibdNinja-diagram/10.png b/public/images/ibdNinja-diagram/10.png new file mode 100644 index 0000000..dc29fd8 Binary files /dev/null and b/public/images/ibdNinja-diagram/10.png differ diff --git a/public/images/ibdNinja-diagram/11.png b/public/images/ibdNinja-diagram/11.png new file mode 100644 index 0000000..5fde4ac Binary files /dev/null and b/public/images/ibdNinja-diagram/11.png differ diff --git a/public/images/ibdNinja-diagram/12.png b/public/images/ibdNinja-diagram/12.png new file mode 100644 index 0000000..74ca5a8 Binary files /dev/null and b/public/images/ibdNinja-diagram/12.png differ diff --git a/public/images/ibdNinja-diagram/2.png b/public/images/ibdNinja-diagram/2.png new file mode 100644 index 0000000..2fa7224 Binary files /dev/null and b/public/images/ibdNinja-diagram/2.png differ diff --git a/public/images/ibdNinja-diagram/3.png b/public/images/ibdNinja-diagram/3.png new file mode 100644 index 0000000..afaca3a Binary files /dev/null and b/public/images/ibdNinja-diagram/3.png differ diff --git a/public/images/ibdNinja-diagram/4.png b/public/images/ibdNinja-diagram/4.png new file mode 100644 index 0000000..ce5fe56 Binary files /dev/null and b/public/images/ibdNinja-diagram/4.png differ diff --git a/public/images/ibdNinja-diagram/5.png b/public/images/ibdNinja-diagram/5.png new file mode 100644 index 0000000..01a4b0f Binary files /dev/null and b/public/images/ibdNinja-diagram/5.png differ diff --git a/public/images/ibdNinja-diagram/6.png b/public/images/ibdNinja-diagram/6.png new file mode 100644 index 0000000..8008dad Binary files /dev/null and b/public/images/ibdNinja-diagram/6.png differ diff --git a/public/images/ibdNinja-diagram/7.png b/public/images/ibdNinja-diagram/7.png new file mode 100644 index 0000000..2252ec9 Binary files /dev/null and b/public/images/ibdNinja-diagram/7.png differ diff --git a/public/images/ibdNinja-diagram/8.png b/public/images/ibdNinja-diagram/8.png new file mode 100644 index 0000000..9e72a00 Binary files /dev/null and b/public/images/ibdNinja-diagram/8.png differ diff --git a/public/images/ibdNinja-diagram/9.png b/public/images/ibdNinja-diagram/9.png new file mode 100644 index 0000000..1aa638b Binary files /dev/null and b/public/images/ibdNinja-diagram/9.png differ