dev-notes/docs/databases/database-systems/database-storage.md
2024-12-20 00:27:25 +01:00

10 KiB

Disk Oriented Database Storage

The DBMS assumes that the primary storage location of the database is non-volatile disk.

The DBMS component's manage the movement of data between non-volatile and volatile storage.

storage-hierarchy

Access Times:

  • L1 Cache Ref: 0.5 ns
  • L2 Cache Ref: 7ns
  • RAM: 100 ns
  • SSD: 150 µs
  • HDD: 10 ms
  • Network Storage: ~30 ms
  • Tape Archives: 1 s

The goal is to allow the DBMS to manage databases that exceed the amount of memory available.

Reading/writing to disk is expensive, so it must be managed carefully to avoid large stalls ans performance degradation.

disk-operations

OS Memory Management

It's possible to use memory mapping (mmap) to store the contents of a file into a process' address space. The OS is responsible for moving the file pages in and out of memory.

The OS does not know the intended usage for the memory and so fine-grained control over the pages is not possible:

  • When the physical memory is full the (un)loading of memory pages can slow or stall the thread.
  • Using multiple thread is viable on multiple reads but is complicated on multiple writes.
  • The order of unloaded pages is not controllable

There are some solutions to this problem:

  • madvise: tell the OS how certain pages are expected to be read.
  • mlock: tell the OS that memory ranges cannot be paged out (can still write on it).
  • msync: tell the OS to flush memory ranges out to disk.

Using mmap leads to performance and correctness problems.

DBMS (almost) always wants to control things itself and can do a better job at it:

  • flushing dirty pages to disk in the correct order
  • specialized pre-fetching
  • buffer replacement policy
  • thread/process scheduling

File Storage

The DBMS stores a database as one or more files on disk. The OS does not know anything about the contents of those files.

The storage manager is responsible for maintaining a database's files. It organizes files as a collection of pages. It tracks data read/written to pages and the available space.

A page is a fixed-size block of data. It can contain tuples, metadata, indexes, log records, etc.
Most systems do not mix page types and some require pages to be self-contained for disaster recovery purposes.

Each page is given a unique identifier. The DBMS uses an indirection layer to map page ids to physical locations.

There are three different notion of "pages" in a DBMS:

  • Hardware Page (usually 4KB)
  • OS Page (usually 4KB)
  • Database Page (512B - 165KB)

Hardware pages are the size at which the device can guarantee a "failsafe write".

Different DBMSs manage pages in files on disk in different ways:

  • Heap File Organization
  • Sequential / Sorted File Organization
  • Hashing File Organization

At this point in the hierarchy it does not matter the content of a page.

Database Heap File

A Heap file is an unordered collection of pages where tuples are stored in random order. Metadata is needed to keep track what pages exist and which have free space.

There are two ways to represent a heap file:

  • Linked List
  • Page Directory

Linked List

Maintain a header page at the beginning of the file that stores two pointers:

  • HEAD of the free page list
  • HEAD of the data page list

Each page keeps track of the number of free slots in itself.

linked-list-heap-file

Page Directory

The DBMS maintains special pages that track the location of data pages in the database files.
The directory also records the number of free slots per page.

The DBMS has to make sure that the directory pages are in sync with the data pages.

page-directory

Page Layout

Every page contains a header of metadata about the page contents:

  • Page Size
  • Checksum
  • DBMS Version
  • Transaction Visibility
  • Compression Information

Data inside a page can be organized with two approached:

  • Tuple oriented
  • Log oriented

Log Oriented Page Layout

Instead of storing tuples in pages, the DBMS only stores log records in the files.

The system appends log records of how the database was modified:

  • Inserts store the entire tuple.
  • Deletes mark the tuple as deleted.
  • Updates contain the delta of just the attributes that were modified.

To read a record, the DBMS scans the log and "recreates" the tuple to find what it needs.

log-record

Tuple Oriented Page Layout

The most common layout scheme is called slotted pages. The slot array maps "slots" to the tuples' starting position offsets.

The header keeps track of:

  • the number of used slots.
  • the offset of the starting location of the last used slot.

The tuples start from the end of the page and grow "upwards" while the slot array starts from the beginning of the page and grows "downwards".
When the two meet the page is full.

Note

: the position and growing behavior is similar to a program's stack and heap memory

page-layout

Record IDs

The DBMS needs a way to keep track of individual tuple. Each tuple is assigned and unique record identifier.
Most common is page_id + offset/slot. They can also contain file location info.

Note

: an application cannot rely on record ids to derive meaning.

Tuple Layout

A tuple is essentially a sequence of bytes. It's the job of the DBMS to interpret those bytes into types and values.

The DBMS's catalogs contain the schema information about tables and can be used to understand the tuple layout.

Each tuple is prefixed with and header that contains metadata about it:

  • visibility info (concurrency control)
  • bitmap for NULL values

Attributes are typically stored in the order in which they are defined while creating the table.

Note

: metadata about the schema is not stored.

tuple-layout

CREATE TABLE <table> (
    a INT PRIMARY KEY,
    b INT NOT NULL,
    c FLOAT,
    d INT
)

Data Representation

  • Fixed Precision Numbers: IEEE-754
  • Variable Length Data: header with length, followed by data bytes
  • Time & Date: 32/64-bit integer of microseconds since Unix epoch

Variable Precision Numbers

Inexact, variable-precision numeric type stored as specified by IEEE-754. Typically faster than arbitrary precision numbers but can have rounding errors.

Fixed Precision Numbers

Numeric data types with arbitrary precision and scale. Used when rounding errors are unacceptable.

Typically stored in exact, variable-length binary representation with additional metadata.

Large Values

Most DBMS don't allow a tuple to exceed the size of a single page.

To store values that are larger than a page, the DBMS uses separate overflow storage pages.

Some systems allow to store really large values in an external file but the DBMS cannot manipulate the contents of the external file.
This means no durability protections ans not transaction protections.

To BLOB or Not to BLOB Paper

overflow-page

System Catalog

A DBMS stores metadata about databases in it's internal catalogs. The catalogs contain metadata about:

  • Tables, columns, indexes, views
  • Users, permissions
  • Internal statistics

Almost every DBMS stores their database's catalog in itself.

It's possible to query the database internal INFORMATION_SCHEMA catalog to get info about the database. This catalog in an ANSI standard set of read only views that provide info about all of the tables, views, columns, procedures in a database.

SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE table_catalog = '<db_name>';  -- list of tables
SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE table_name = '<table>';  -- schema of a table

Note

: DBMS also have non-standard shortcuts to retrieve this information.

Database Workloads & Data Storage Models

The DBMS can store tuples in different ways that are better for either OLTP or OLAP workloads.

N-Ary Storage Model (NSM)

The DBMS stores all attributes for a single tuple contiguously in a page (aka row store). It's ideal for OLTP workloads where queries tend to operate only on an individual entity and insert-heavy workloads.

Advantages:

  • fast inserts, updates and deletes.
  • good for queries that need the entire tuple.

Disadvantages:

  • Not good for scanning large portions of the table and/or subsets of the attributes.

nary-storage-model

Decomposition Storage Model (NSM)

The DBMS stores the values of a single attribute for all tuples contiguously in a page (aka column store). It's ideal for OLAP workloads where read-only queries perform large scans over a subset og the table attributes.

decomposition-storage-model

To identify all the parts of the same tuple in the different pages there are two choices:

  • Fixed-length Offsets: each value ahs same length and is in the same position
  • Embedded Tuple Ids: each value is stored with it's tuple id in a column

dsm-tuple-identification

Advantages:

  • Reduces the amount of wasted disk I/O because DBMS reads only the data that it needs
  • Better quality of processing and data compression

Disadvantages:

  • Slow for point queries, inserts, update and deletes because of tuple splitting/stitching.