Rusty thoughts

Page storage for Rustdb

Any general purpose database will typically store information in pages, like pages in a book. When a page becomes full, it will generally need to be split somehow (typically divided into two smaller halves) before new data can be added.

Most database implementations I have looked at use fixed size pages, however this has the drawback that after a page is split, it may be only half full, meaning that 50% of the storage is unused. While it may be possible to “repack” pages to reduce wasted storage, Rustdb has an efficient implementation of variable size pages, in fact two different implementations of the PageStorage trait.

The first option is “CompactFile“. This (roughly speaking) splits pages into 1-16 kilobyte chunks. However more recently I have developed another option, “BlockPageStorage” which first divides the file into a small number of variable size files ( each built out of fixed size blocks ), and then allocates pages from these files, with each file being used to store pages of a particular size. So this is implemented as three structs: BlockStg which managed fixed size blocks, DividedStg which implements multiple files out of fixed size blocks and finally BlockPageStorage which uses the files to manage variable size pages. Since blocks are fixed size, when there is a free block, the last block in the file can be relocated to fill the gap. Similarly, when there is a free page, the last page in the subfile can be relocated to fill the gap. Thus the amount of unused file storage is minimal.

The choice of block size and page sizes is not fixed and can be configured. There is some advantage in choosing a block size which can be divided exactly by a range of integers – this means that when a page is read, the read does not cross a block boundary, so there is a single read to the underlying file system (disregarding “small” reads which are required to establish the page location, which can be buffered). For example, a block size of 27,720 ( = 5 x 7 x 8 x 9 x 11 ) can be divided evenly by integers in the range 6 to 12, producing pages of size 2310, 2520, 2772, 3080, 3465, 3960 and 4620 bytes.

March 13, 2024

Uncategorized
Database replication and backup
For most databases, a primary concern is ensuring that data is not lost. RustDb does not have database backup and replication built into it, however the RustWeb2 program built on it does.

The fundamental idea is that transactions that modify the database are saved in a log table, then fetched and stored in a copy of the log table by one or more backup servers. These backup servers can then re-run the transactions on a copy of the database (by calling the stored procedure log.Roll, shown below ) to reproduce the database at any stage in its evolution, using an initial copy of the database made (automatically) when the replication process is set up.
```
CREATE FN log.Roll() AS 
BEGIN
  -- This applies all updates. Updates may need to be limited or filtered in some way.  
  DECLARE nt int, a int, d binary
  SET nt = log.NextTransaction()
  SET a = Done FROM log.Status
  WHILE a < nt
  BEGIN
    SET d = BINUNPACK(data) FROM log.Transaction WHERE Id = a
    SET a += 1
    IF DOLOG(d) = 1 BREAK
  END
  UPDATE log.Status SET Done = a WHERE true
  IF a < nt SELECT '<p>Roll incomplete ' | nt - a | ' transactions outstanding'
  ELSE SELECT '<p>Roll complete'
END
```
The table that stores the transactions is log.Transaction. It has a single binary column “data” which has the compressed, serialised data representing the transaction. The table log.Status has a single row with a single integer column “Done” which tracks which transactions have been applied to the database copy. Note that if function, schema or table definitions are modified DOLOG() returns 1 and Roll must be called repeatedly until all updates have been applied.

The backup servers do not have to run the transactions immediately, instead this can be delayed. This can allow data that has been inadvertently deleted to be recovered. However the transactions are fetched and copied almost immediately (provided the network is functioning), meaning that in the event of a disastrous loss of the master database, it can be reconstructed up until the moment before the loss.

A reasonable setup would be to keep one copy that is never rolled forward, which can be copied and rolled forward to recover the state of the database at any point in time if required, and another copy which is regularly rolled forward with only minimal delay ( perhaps a few hours or days ) so that an up-to-date copy of the database is available at short notice. In addition, if some mistake is made, such as deleting a table, the forward roll can be modified to omit the mistake using a customised version of log.Roll. The source for the backup server can be either the master server or another backup server.

Note: RustWeb2 also has more conventional backup, in the form of scripting which will rapidly take a complete copy of the database as SQL text, however this is not a good solution for backing up the database in “real time”, that is within a second or two of the master database being updated, which could happen many thousands or even millions of times every day.

Trimming the log file

Once all the backup servers have fetched all existing rows from the current log table, it is possible to delete the rows that are no longer required, however some care is needed that “future” rows in the backup databases are not deleted before they are applied. One way is to specify which rows should be deleted, for example:

DELETE FROM log.Transaction WHERE Id < 20

It is also possible to suppress the logging of a transaction by calling the NOLOG() builtin function.

DECLARE dummy int SET dummy = NOLOG()
DELETE FROM log.Transaction WHERE Id < 20

This means that the transaction is executed only on the master database. Note that transactions run on a backup database running in replication mode are never logged since this would create a conflict with the usage of the log.Transaction table by the master database.
March 5, 2024

Uncategorized

database, sql, sql-server
Keeping a family tree with Rustdb
This blog describes how to make a simple one-table database with an HTML-based browser interface that stores family tree data. Trying this out (including creating your own local database and web server) should take about 15 minutes.

Step 1 : install and run Rustweb2

Follow the “Installation and Starting Server” instructions here ( cargo install rustweb2 then run rustweb2 ).

Step 2 : create a schema and table.

To create a schema and table to store the family tree, click on the “Execute Sql” link in the menu, and execute the following SQL commands:
```
CREATE SCHEMA famt
GO
CREATE TABLE famt.Person(Male bool, Firstname string, Surname string, Notes string, BirthDate int, DeathDate int, Mother int, Father int)
GO
```
Step 3 : configure the date columns as dates.

Note that dates are represented as integers at the database level, however we will want to be displaying and entering dates as months/days/years. To configure this, return to the main menu, click on the famt schema, and click on the Person table. Now click on BirthDate, set “Datatype” to Date, then save. Do the same for DeathDate. For DeathDate also configure the Default as 0.

Step 4 : start to configure the Mother and Father columns.

Next to let the system know that the Mother and Father columns refer to people, click on Father and change RefersTo to [famt].Person. Do the same for Mother.

Step 5 : finish the configuration of the Mother and Father columns.

Next to let the system know how a person is named use “Execute Sql” to create a “naming” function:
```
CREATE FN [famt].[PersonName]( id int ) RETURNS string AS
BEGIN
  SET result = Firstname | ' ' | Surname | ' ' 
   | date.YearMonthDayToString(BirthDate)
   | '-' 
   | date.YearMonthDayToString(DeathDate)
  FROM famt.Person WHERE Id = id
END
GO
```
Also, create the following functions:
```
CREATE FN [famt].[FatherDisplay]( colid int, k int, ba string ) AS 
BEGIN
  DECLARE male bool
  SET male = Male FROM famt.Person WHERE Id = k
  IF male EXEC browse.ChildDisplay( colid, k, ba )
END
GO

CREATE FN [famt].[FatherSelect]( colid int, sel int ) RETURNS string AS
BEGIN
  RETURN famt.ParentSelect(colid, sel, true)
END
GO

CREATE FN [famt].[MotherDisplay]( colid int, k int, ba string ) AS 
BEGIN
  DECLARE female bool
  SET female = NOT Male FROM famt.Person WHERE Id = k
  IF female EXEC browse.ChildDisplay( colid, k, ba )
END
GO

CREATE FN [famt].[MotherSelect]( colid int, sel int ) RETURNS string AS
BEGIN
  RETURN famt.ParentSelect(colid, sel, false)
END
GO

CREATE FN [famt].[ParentSelect]( colid int, sel int, male bool ) RETURNS string AS
BEGIN
  DECLARE col string SET col = Name FROM sys.Column WHERE Id = colid
  DECLARE opt string, options string

  DECLARE by int, k int, ks string SET ks = web.Query( 'k' )
  IF ks != '' SET k = Id FROM famt.Person WHERE Id = PARSEINT(ks)  
  
  FOR opt = '<option ' | CASE WHEN Id = sel THEN ' selected' ELSE '' END 
    | ' value=' | Id | '>' | web.Encode( famt.PersonName(Id) ) | '</option>'
  FROM famt.Person
  WHERE Male = male
  ORDER BY Firstname, Surname
  SET options = options | opt

  RETURN '<select id="' | col | '" name="' | col | '">' 
    | options 
    | '<option ' | CASE WHEN sel = 0 THEN ' selected' ELSE '' END | ' value=0></option>'
    | '</select>'
END
GO
```
Finally configure the table and column data to use these functions. From the menu again click on the famt schema, then click on the Person table. Now click on Settings, and set NameFunction to famt.PersonName, and Save.

Click on the Mother column and set Input Function to famt.MotherSelect and ChildDisplayFunction to famt.MotherDisplay. Similarly, click on the Father column and set Input Function to famt.FatherSelect and ChildDisplayFunction to famt.FatherDisplay.

Now everything is set up. Use Add to add a person to the database, entering information as required.

Note that to save the data, you can use the “Script” link to save the entire famt schema as SQL commands.
September 26, 2022

Uncategorized
Storage of variable length values in RustDb
Fragments

The basic storage mechanism in RustDb ( a high-performance database written entirely in Rust ) is SortedFile. SortedFile supports an unlimited number of records, stored in key order. However the records are fixed in size ( and also limited in size ).

In order to store arbitrary size values, values need to be split into fragments of fixed size. In order to minimize overhead, instead of using a single fragment size, RustDb uses four fragments sizes. Each fragment size has an associated SortedFile. Currently the fragment sizes are 40, 127, 333, 1011.

The overhead for a single fragment is 12 bytes, made up of 8 bytes to store the record ID, 3 bytes for node overhead within a page, and one byte for the length of the fragment data.

When storing a value, the best fragment size is chosen for the length of the value, to minimise overhead and unused space in the last fragment. For example, to store a value of 100 bytes, using the 40 byte fragment size requires 3 x ( 40 + 12 ) = 156 bytes, using the 127 byte fragment size needs only 1 x ( 127 + 12 ) = 139 bytes, so in this case the 127 byte fragment size is chosen. The code, taken from the Bytes module is shown below:
```
/// =4. Number of fragment types.
pub const NFT: usize = 4;

/// Bytes per fragment.
pub static BPF: [usize; NFT] = [40, 127, 333, 1011];

/// Total bytes used taking into account all overhead ( 3 + 1 + 8 = 12 bytes, per fragment ).
fn total(len: usize, ft: usize) -> usize {
    let bpf = BPF[ft];
    let nf = (len + bpf - 1) / bpf;
    nf * (bpf + 12)
}

/// Calculate best fragment type from byte length.
pub fn fragment_type(len: usize) -> usize {
    let mut best = usize::MAX;
    let mut result = 0;
    for ft in 0..NFT {
        let t = total(len, ft);
        if t <= best {
            best = t;
            result = ft;
        }
    }
    result
}
```
Fragment Length

In versions before version 3, the fragment length was stored as a single byte. In addition, a bit is needed to indicate that a fragment is the last fragment for the value, this limited the fragment size to 127 bytes. However, this meant that for large values, about 10% of storage was taken up in overhead.

Starting from version 3, instead of storing the length, the amount of unused space in the fragment is stored instead. As well as the flag bit to indicate last fragment, another flag bit indicates whether the unused space is more than 63 bytes, in which case a second byte (from the unused space) is used for the unused length. This means the fragment size is no longer limited to 127 bytes, reducing the overhead for large values to about 1% instead of 10%.

In addition, the largest fragment size (1011 bytes) has been chosen to minimise unused space at the page level ( pages are 16kb, allocated in units of 1kb ).

Inline storage

Finally small variable length values are stored without using fragments. The amount of inline storage (from 15 to 250 bytes) is specified when a table column is created. For example:

CREATE TABLE dbo.Cust( Name string(20) )

means that names up to 20 bytes are stored without using fragments. Nevertheless, the Name column is not restricted to 20 bytes. The default is 15 bytes.
September 16, 2022

Uncategorized
Concurrency in RustDb
Introduction

This blog post discusses the way RustDb ( a high-performance database written entirely in Rust ) handles multiple concurrent queries.

Before talking about concurrency, it may help to state plainly what a database is. A database is essentially a substantial store of information, facts, or knowledge. Something like a large library. It could be holding books, weather data, astronomical data, historical data, business data.

The main function of a database is to answer questions, or queries from clients. A database can be large, and queries may need to access a substantial portion of the data. On the other hand, new information may be acquired and need to be added to the database.

Concurrency

The question arises, what should happen if multiple clients want to query and update the database. What should happen if a long query is in progress, but new information needs to be added to the database?

Possible solution – delay the update

The update is delayed until the long query has finished.

This may not be satisfactory. As a database can be large, it could be minutes before an update could be processed, and if new queries are continually arriving, an update could be deferred indefinitely.

Possible solution – locking

As the long query progresses, objects in the database ( tables, rows, indexes ) are locked. This is I believe the widely adopted approach, for example Microsoft SQL server uses locking. However, this is still not very satisfactory, if we are lucky the update can still happen, but if not we still have the long wait. Or perhaps the long query can be cancelled and retried later (there are also potential deadlock situations). If updates happen say every 10 seconds, and the long query takes several minutes, then the long query may have to be retried many times, and could end up taking hours instead of minutes.

Possible solution – use a copy of the database

The long query is performed on a copy of the database. This is essentially the solution adopted by RustDb. It sounds expensive and impractical : what if the database is many gigabytes of data, taking a copy of the entire database could take a long time, and in the mean-time no updates can happen. This leads use to the solution adopted by RustDb.

Possible solution – use a virtual copy of the database

The long query is performed on a “virtual copy” of the database. The database is divided into logical pages, and when an update is to be performed, before storing an updated page, the software takes a copy of the old page.

When a page is fetched, we specify not just the page number, but also which “virtual copy” of the database is required. When an update concludes, or when a read-only query finishes, it may be possible to remove some of the copied pages.

A large update will prevent other updates from running. However, due to the incremental nature of information, it should be possible to divide updates into shorter chunks. In such cases, queries may need to take into account incomplete data. For example, a report that summarises sales data should not include data for the current sales period until all data for that period has been inserted into the database, or alternatively note it as incomplete.

Implementation

Keeping track of the old copies is straight-forward in Rust. The data structure is shown below:
```
struct PageData {
    /// Current data for the page.
    current: Option<Data>,
    /// Historic data for the page.
    history: BTreeMap<u64, Data>,
}
```
Here the current data for the page is an option, because the value may be stored in backing storage ( Data is simply defined as Arc<Vec<u8>> ).

The page history is defined as a map from the database time to the data. The main complication is deciding when the history can be trimmed, due to there being no possibility that a reader could access the historic page. The central data structure ( slightly simplified ) is this :
```
pub struct Stash {
    /// Write time - number of writes.
    time: u64,
    /// Page number -> page info.
    pages: HashMap<u64, PageInfoPtr>,
    /// Time -> reader count. Number of readers for given time.
    rdrs: BTreeMap<u64, usize>,
    /// Time -> set of page numbers. Page copies held for given time.
    vers: BTreeMap<u64, HashSet<u64>>,  
}
```
Here “time” is the number of write transactions that have completed. “pages” is used to lookup the page information, “rdrs” counts the number of outstanding readers for a given time, and “vers” has for each time a set of pages copied at that time.

When a read or write transaction completes, “history” pages may be freed. The first step is to establish the range of times for which there is no longer any reader. Having done this, “vers” is used to check each potentially affected page to check whether a copy can be freed. The Rust code is shown below:
```
    fn trim(&mut self, time: u64) {
        let (s, r) = (self.start(time), self.retain(time));
        if s != r {
            let mut empty = Vec::<u64>::new();
            for (t, pl) in self.vers.range_mut(s..r) {
                pl.retain(|pnum| {
                    let p = self.pages.get(pnum).unwrap();
                    let mut p = p.d.lock().unwrap();
                    p.trim(*t, s)
                });
                if pl.is_empty() {
                    empty.push(*t);
                }
            }
            for t in empty {
                self.vers.remove(&t);
            }
        }
    }
```
Here the function parameter “time” is either the time for which the reader count has just gone to zero, or the time of the write transaction which has just finished. The slight complication is keeping track of the empty sets in vers, and subsequently removing them, using the “empty” list.

The complete code can be found here:
https://docs.rs/rustdb/2.14.1/src/rustdb/pstore.rs.html

Note: after implementing this, and writing the post, I discovered this article on wikipedia, which pretty much describes the same thing: https://en.wikipedia.org/wiki/Multiversion_concurrency_control

[ This is the first of a series of articles I am planning on writing about RustDb ]
September 3, 2022

Uncategorized

Trimming the log file

Step 1 : install and run Rustweb2

Step 2 : create a schema and table.

Step 3 : configure the date columns as dates.

Step 4 : start to configure the Mother and Father columns.

Step 5 : finish the configuration of the Mother and Father columns.

Fragments

Fragment Length

Inline storage

Introduction

Concurrency

Possible solution – delay the update

Possible solution – locking

Possible solution – use a copy of the database

Possible solution – use a virtual copy of the database

Implementation