# Disk Oriented Database Storage The DBMS assumes that the primary storage location of the database is non-volatile disk. The DBMS component's manage the movement of data between non-volatile and volatile storage. ![storage-hierarchy] Access Times: - L1 Cache Ref: 0.5 ns - L2 Cache Ref: 7ns - RAM: 100 ns - SSD: 150 µs - HDD: 10 ms - Network Storage: ~30 ms - Tape Archives: 1 s The goal is to allow the DBMS to manage databases that exceed the amount of memory available. Reading/writing to disk is expensive, so it must be managed carefully to avoid large stalls ans performance degradation. ![disk-operations] ## OS Memory Management It's possible to use memory mapping (`mmap`) to store the contents of a file into a process' address space. The OS is responsible for moving the file pages in and out of memory. The OS does not know the intended usage for the memory and so fine-grained control over the pages is not possible: - When the physical memory is full the (un)loading of memory pages can slow or stall the thread. - Using multiple thread is viable on multiple reads but is complicated on multiple writes. - The order of unloaded pages is not controllable There are some solutions to this problem: - `madvise`: tell the OS how certain pages are expected to be read. - `mlock`: tell the OS that memory ranges cannot be paged out (can still write on it). - `msync`: tell the OS to flush memory ranges out to disk. Using `mmap` leads to performance and correctness problems. DBMS (almost) always wants to control things itself and can do a better job at it: - flushing dirty pages to disk in the correct order - specialized pre-fetching - buffer replacement policy - thread/process scheduling ## File Storage The DBMS stores a database as one or more files on disk. The OS does not know anything about the contents of those files. The **storage manager** is responsible for maintaining a database's files. It organizes files as a collection of _pages_. It tracks data read/written to pages and the available space. A **page** is a fixed-size block of data. It can contain tuples, metadata, indexes, log records, etc. Most systems do not mix page types and some require pages to be self-contained for disaster recovery purposes. Each page is given a _unique identifier_. The DBMS uses an indirection layer to map page ids to physical locations. There are three different notion of "pages" in a DBMS: - Hardware Page (usually 4KB) - OS Page (usually 4KB) - Database Page (512B - 165KB) Hardware pages are the size at which the device can guarantee a "failsafe write". Different DBMSs manage pages in files on disk in different ways: - Heap File Organization - Sequential / Sorted File Organization - Hashing File Organization At this point in the hierarchy it does not matter the content of a page. ### Database Heap File A **Heap file** is an unordered collection of pages where tuples are stored in random order. Metadata is needed to keep track what pages exist and which have free space. There are two ways to represent a heap file: - Linked List - Page Directory ### Linked List Maintain a _header page_ at the beginning of the file that stores two pointers: - HEAD of the _free page list_ - HEAD of the _data page list_ Each page keeps track of the number of free slots in itself. ![linked-list-heap-file] ### Page Directory The DBMS maintains special pages that track the location of data pages in the database files. The directory also records the number of free slots per page. The DBMS has to make sure that the directory pages are in sync with the data pages. ![page-directory] ## Page Layout Every page contains a _header_ of metadata about the page contents: - Page Size - Checksum - DBMS Version - Transaction Visibility - Compression Information Data inside a page can be organized with two approached: - Tuple oriented - Log oriented ### Log Oriented Page Layout Instead of storing tuples in pages, the DBMS only stores **log records** in the files. The system appends log records of _how_ the database was modified: - Inserts store the entire tuple. - Deletes mark the tuple as deleted. - Updates contain the delta of just the attributes that were modified. To read a record, the DBMS scans the log and "recreates" the tuple to find what it needs. ![log-record] ### Tuple Oriented Page Layout The most common layout scheme is called _slotted pages_. The slot array maps "slots" to the tuples' starting position offsets. The header keeps track of: - the number of used slots. - the offset of the starting location of the last used slot. The tuples start from the end of the page and grow "upwards" while the slot array starts from the beginning of the page and grows "downwards". When the two meet the page is full. > **Note**: the position and growing behavior is similar to a program's stack and heap memory ![page-layout] ### Record IDs The DBMS needs a way to keep track of individual tuple. Each tuple is assigned and unique **record identifier**. Most common is `page_id + offset/slot`. They can also contain file location info. > **Note**: an application _cannot_ rely on record ids to derive meaning. ## Tuple Layout A tuple is essentially a sequence of bytes. It's the job of the DBMS to interpret those bytes into types and values. The DBMS's _catalogs_ contain the schema information about tables and can be used to understand the tuple layout. Each tuple is prefixed with and _header_ that contains metadata about it: - visibility info (concurrency control) - bitmap for `NULL` values Attributes are typically stored in the order in which they are defined while creating the table. > **Note**: metadata about the schema is _not_ stored. ![tuple-layout] ```sql CREATE TABLE