add database systems notes

fix git branch image link
This commit is contained in:
Marcello 2023-10-04 17:12:04 +02:00
parent 48a415e7e4
commit f6eeb09dd4
Signed by: m-lamonaca
SSH key fingerprint: SHA256:8db8uii6Gweq7TbKixFBioW2T8CbgtyFETyYL3cr3zk
35 changed files with 18634 additions and 0 deletions

View file

@ -0,0 +1,155 @@
# Advanced SQL
SQL is a composed by:
- Data Manipulation Language (DML)
- Data Definition Language (DDL)
- Data Control Language (DCL)
It also includes:
- View definition
- Integrity & Referential Constraints
- Transactions
> **Note**: SQL is based on **bags** (duplicates) adn **sets** (no duplicates)
## Aggregations
Functions that return a single value from a bag of tuples:
- `AVG()`: return the average value
- `MIN()`: return the minimum value
- `MAX()`: return the maximum value
- `SUM()`: return the sum of values
- `COUNT()`: count the number of values
> **Note**: aggregate functions can only be used in the `SELECT` output list.
> **Note**: `COUNT`, `SUM`, `AVG` support `DISTINCT`.
### Group By
Project tuples into subsets and calculate aggregates against each subset.
> **Note**: Not-aggregated values in `SELECT` output clause must appear in the `GROUP BY` clause
```sql
SELECT * FROM <table>
GROUP BY <table>.<field>;
```
### Having
Filters results based on aggregation computation. Like `WHERE` clause for a `GROUP BY`.
```sql
SELECT COUNT(1) AS count FROM <table>
GROUP BY <table>.<field>
HAVING count > 0;
```
## String Operations
> **Note**: strings are case sensitive and defined with single-quotes (`'`).
`LIKE` is used for string matching: `%` matches any substring (including empty strings) while `_` matches any single character.
```sql
SELECT * FROM <table>
WHERE <table>.<field> LIKE '%@c_';
```
Other common string functions/operators:
- `UPPER()`: convert string to uppercase
- `LOWER()`: convert string to lowercase
- `||` is used to concatenate two strings.
```sql
SELECT * FROM <table>
WHERE <table>.<field> = LOWER(<table>.<field> || '-suffix');
```
## DateTime Operations
DateTime Functions:
- `NOW()`: get current timestamp
- `DATE('<timestamp>')`: convert string to date
- `UNIX_TIMESTAMP(<date>)`: convert to unix epoch
## Output Redirection
Store query results in another table provided that the number and type of columns is the same.
```sql
SELECT * INTO <new_table> FROM <table>; -- write result into new table (must not exist)
INSERT INTO <dest_table> (SELECT * FROM <src_table>); -- write result into existing table
```
## Output Control
`ORDER BY` sorts the results based on a specific attribute(s).
```sql
SELECT * FROM <table> ORDER BY <table>.<field> ASC; -- default direction
SELECT * FROM <table> ORDER BY <table>.<field> DESC;
SELECT * FROM <table> ORDER BY <table>.<field_1> ASC, <table>.<field_2> DESC;
```
`LIMIT` limits the number of tuples returned in output. Can set an offset to return a range.
```sql
SELECT * FROM <table> LIMIT 10;
SELECT * FROM <table> LIMIT 10 OFFSET 5;
```
## Nested Queries
Inner queries can appear (almost) anywhere in the query. They are often difficult to optimize.
> **Note**: inner queries can reference attributes and tables of the outer query but not vice-versa.
```sql
SELECT * FROM <table> WHERE <table>.<field> IN (SELECT <table.field> ...);
SELECT (SELECT <table>.<field> FROM <table> WHERE ...) FROM <table> WHERE <expr>;
```
Nested query operators:
- `ALL()`: must satisfy expression for all rows in sub-query.
- `ANY()`: must satisfy expression for at least one row in sub-query.
- `IN`: equivalent to `=ANY()`.
- `EXISTS`: at least one row is returned.
## Window Functions
Perform a calculation across a set of tuples related to a single row. Like an aggregation but tuples are not grouped into a single output tuple.
The `OVER` keyword specifies how to group together tuples when computing the window function.
```sql
SELECT ..., FUNC_NAME(<expr>)
OVER (
PARTITION BY <expr>, ...
ORDER BY <expr>, ...
)
FROM <table>;
```
Special window functions:
- `ROW_NUMBER()`: number of the current row
- `RANK()`: order position of the current row
## Common Table Expressions (CTE)
Provides a way to write auxiliary statements for a larger query. Alternative to nested queries and views.
```sql
WITH <cte_name> (<column>, ...) AS (SELECT ...) -- temporary table from query result
SELECT <column>, ... FROM <cte_name>
```
**Note**: CTEs can be recursive with the `RECURSIVE` keyword.

View file

@ -0,0 +1,73 @@
# Indexes & B+Trees
## B+Tree
[B+Tree Visualization](https://www.cs.usfca.edu/~galles/visualization/BPlusTree.html "B+Tree Visualization")
A **B+Tree** is a self-balancing tree data structure that keeps data sorted and allows searches, sequential access, insertions and deletions always in `O(log n)`.
It's a generalization of a binary search tree, since a node can have more than two children. It's optimized for systems that need to read and write large blocks of data.
A B+Tree is an `M`-way search tree with the following properties:
- It's perfectly balanced (every leaf node is at the same depth).
- Every node other than the root is at least half-full (`(M/2)-1 <= #keys <= M-1`).
- Every inner node with `k` keys has `k+1` non-null children.
![btree-structure](../../img/b+tree.webp "B+Tree Structure")
### Nodes
Every B+Tree node is comprised if an array of key-value pairs. The keys are derived from the attribute(s) that the index is based on. The values will differ based on whether the node is classified as an _inner node_ or a _leaf node_.
The arrays are (usually) kept in sorted key order.
Leaf node values:
- **Record ID**: pointer to the location of the tuple to which the index entry corresponds.
- **Tuple Data**: the leaf node stores the actual contents of the tuple. Secondary indexes must store the record id a their values.
> **Note**: if the workload is mostly primary key lookup then prefer index with tuple data. If the workload has several secondary key lookups prefer record ids. Tuple data indexes need a double search for secondary keys since a secondary index will return a primary key to be searched again in the primary index.
### B-Tree bs B+Tree
The **B-Tree** stores keys and values in all nodes of the tree. It's more space-efficient since each key only appears once in the tree.
A **B+Tree** only stores values in leaf nodes, inner nodes only guide the search process. Since the values are in a linked list at the bottom of the tree a sequential scan does not need to jump around the tree.
### Insertion
Find leaf node in which the entry belongs.
- **Case 1**: If the leaf is not full, insert the key into the leaf node in increasing order. Done.
- **Case 2**: If the leaf is full, insert the key into the leaf node in increasing order and balance the tree:
1. Break the node at `M/2`th position. New leaf contains `M/2`..`M-1` nodes.
2. Add `M/2`th key to the parent node as well.
3. If the parent node is already full, follow steps 2 to 3.
> **Note** B+Trees grow at the root and not at the leaves.
![b+tree-insert](../../img/b+tree-insertion.webp "B+Tree Insertion")
### Deletion
Find leaf node in which the entry belongs and remove the entry.
- **Case 1**: The leaf has more than `M/2` entries. Done.
- **Case 2**: The leaf has only `(M/2)-1` entries. Two options:
- re-distribute borrowing from sibling, update parent if needed.
- merge with sibling, must delete entry from parent.
> **Note**: merging can bubble up and reduce the depth of the tree.
### Key Lookups
A DBMS can use a B+Tree index if the query provides any of the attributes of the search key. For an hash index all attributes of the key are needed.
> **Note**: not all DBMSs support partial key lookup if the leading attribute is missing
![key-lookup](../../img/b+tree-key-lookup.excalidraw.svg)
### Duplicate Keys
- **Append Record Id**: Add the tuple's unique record id as part of the key to ensure all keys are unique. The DBMS can still use partial keys to find tuples.
- **Overflow Leaf Nodes**: Allow leaf nodes to split into overflow nodes that contain the duplicate keys (more complex to maintain and modify).

View file

@ -0,0 +1,126 @@
# Buffer Pools & Memory
## Buffer Pool Manager
The **buffer pool** is a memory region organized as an array of fixed-size pages. An array entry is called a _frame_.
When a DBMS requests a page, an exact copy is placed into one of these frames.
The **page table** keeps track of pages that are currently in memory. It also maintains additional metadata per page:
- Dirty Flag
- Pin/Reference Counter
![buffer-pool]
### Page Table vs Page Directory
The _page directory_ is the mapping from page ids to page locations in the database file. All changes must be recorded on disk to allow the DBMS to find the data on restart.
The _page table_ is the mapping from page ids to a copy og the page in the buffer pool frames. This is an in-memory data structure that does not need to be stored on disk.
### Allocation Polices
_Global Policies_ make decisions for all active transactions.
_Local Policies_ allocate frames for specific transactions without considering behaviour of concurrent transactions. This still need to support sharing pages.
## Buffer Pool Optimizations
### Multiple Buffer Pools
The DBMS does not always have a single buffer pool for the entire system:
- multiple buffer pool instances
- per-database buffer pool
- per-page type buffer pool
Multiple buffer pools help to reduce latch contention and improve locality.
There are multiple approaches to manage multiple buffer pools:
- Object Id: embed an object id in record ids and then maintain from objects to specific buffer pools.
- Hashing: hash the page id to select which buffer pool to access.
![multiple-buffer-pools]
### Pre-Fetching
The DBMS can also prefetch pages based on a query plan in order to reduce the wait between the request of a page end its load in memory.
### Scan Sharing
Queries can reuse data retrieved from storage or operator computations. This is done by allowing multiple queries to attach to a single cursor that scans a table. It's also possible to share intermediate results.
If a query stats a scan and if there is one already doing the same operation, then the DBMS will attach the new cursor to the second query cursor.
The DBMS keeps track of where the second query joined the first so that it can retrieve the pages that were already scanned by the first query.
> **Note**: _Scan Sharing_ is different from _Output Caching_.
### Buffer Pool Bypass
The sequential scan will not store fetched pages in the buffer pool to avoid overhead. The used memory is local to the query thread and will be discarded as the scan finishes. This avoids the overhead of going to the page table and of evicting pages if the buffer is full.
### OS Page Cache
Most disk operations go through the OS API. Unless the DBMS tells it to, the OS maintains its own filesystem cache.
Most DBMSs use direct I/O (`O_DIRECT`) to bypass the OS's cache to avoid redundant copies of pages and the loss of control over I/O, since the DBMS has different eviction policies.
## Page Replacement Policies
When the DBMS needs to free up a frame to make space for a new page, it must decide which page to _evict_ from the buffer pool.
Goals:
- **Correctness**: do not evict _pinned_ data.
- **Accuracy**: evict rarely used pages.
- **Speed**
- **Metadata Overhead**
### LRU (Last Recently Used)
Maintain a timestamp of when each page was last accessed. When the DBMS needs to evict a page, select the one with the oldest timestamp.
> **Note**: Keep the pages in sorted order to reduce search time on eviction
### CLOCK
Approximation of LRU without needing a separate timestamp per page.
Each page has a _reference bit_. When the page is accessed the bit is set to 1.
The pages are organised in a circular buffer with a "clock hand". Upon sweeping, check if a page's bit is set to 1: toggle if is 1, evict if is zero.
> **Note**: LRU and CLOCK replacement policies are susceptible to _sequential flooding_. A sequential scan that reads every page pollutes the buffer pool with pages that are read only once. In this case the most recently used page is actually the most unneeded one.
### LRU-K
Track the history of last _K_ references to each page as timestamps and compute the interval between subsequent access.
The DBMS then uses this history to estimate the next time that page is going to be accessed.
### Localization
The DBMS chooses which page to evict on a transaction/query basis. This minimizes the pollution of the pool from each query.
### Priority Hints
The DBMS knows the context of each page during query execution. It can provide hints to the buffer pool on whether a page is important or not.
### Dirty Pages
**Fast**: if a page in the buffer pool is _not_ dirty, then the DBMS can simply drop it.
**Slow**: if a page is dirty, then the DBMS must write back to disk to ensure that its changes are persisted.
It's a tradeoff between fast evictions versus writing dirty pages that will not be read again in the future.
> **Note**: a page is _dirty_ if a query has modified it while it's in the buffer pool (not yet written to disk).
### Background Writing
The DBMS can periodically walk through the page table and write dirty pages to disk. When a dirty page is safely written, the DBMS can either evict it or just unset the dirty flag.
> **Note**: need to be careful to avoid writing dirty pages before their log records have been written.
<!-- images -->
[buffer-pool]: ../../img/buffer-pool.excalidraw.svg "Buffer Pool"
[multiple-buffer-pools]: ../../img/multiple-buffer-pools.excalidraw.svg "Multiple Buffer Pools"

View file

@ -0,0 +1,292 @@
# Disk Oriented Database Storage
The DBMS assumes that the primary storage location of the database is non-volatile disk.
The DBMS component's manage the movement of data between non-volatile and volatile storage.
![storage-hierarchy]
Access Times:
- L1 Cache Ref: 0.5 ns
- L2 Cache Ref: 7ns
- RAM: 100 ns
- SSD: 150 µs
- HDD: 10 ms
- Network Storage: ~30 ms
- Tape Archives: 1 s
The goal is to allow the DBMS to manage databases that exceed the amount of memory available.
Reading/writing to disk is expensive, so it must be managed carefully to avoid large stalls ans performance degradation.
![disk-operations]
## OS Memory Management
It's possible to use memory mapping (`mmap`) to store the contents of a file into a process' address space. The OS is responsible for moving the file pages in and out of memory.
The OS does not know the intended usage for the memory and so fine-grained control over the pages is not possible:
- When the physical memory is full the (un)loading of memory pages can slow or stall the thread.
- Using multiple thread is viable on multiple reads but is complicated on multiple writes.
- The order of unloaded pages is not controllable
There are some solutions to this problem:
- `madvise`: tell the OS how certain pages are expected to be read.
- `mlock`: tell the OS that memory ranges cannot be paged out (can still write on it).
- `msync`: tell the OS to flush memory ranges out to disk.
Using `mmap` leads to performance and correctness problems.
DBMS (almost) always wants to control things itself and can do a better job at it:
- flushing dirty pages to disk in the correct order
- specialized pre-fetching
- buffer replacement policy
- thread/process scheduling
## File Storage
The DBMS stores a database as one or more files on disk. The OS does not know anything about the contents of those files.
The **storage manager** is responsible for maintaining a database's files. It organizes files as a collection of _pages_. It tracks data read/written to pages and the available space.
A **page** is a fixed-size block of data. It can contain tuples, metadata, indexes, log records, etc.
Most systems do not mix page types and some require pages to be self-contained for disaster recovery purposes.
Each page is given a _unique identifier_. The DBMS uses an indirection layer to map page ids to physical locations.
There are three different notion of "pages" in a DBMS:
- Hardware Page (usually 4KB)
- OS Page (usually 4KB)
- Database Page (512B - 165KB)
Hardware pages are the size at which the device can guarantee a "failsafe write".
Different DBMSs manage pages in files on disk in different ways:
- Heap File Organization
- Sequential / Sorted File Organization
- Hashing File Organization
At this point in the hierarchy it does not matter the content of a page.
### Database Heap File
A **Heap file** is an unordered collection of pages where tuples are stored in random order. Metadata is needed to keep track what pages exist and which have free space.
There are two ways to represent a heap file:
- Linked List
- Page Directory
### Linked List
Maintain a _header page_ at the beginning of the file that stores two pointers:
- HEAD of the _free page list_
- HEAD of the _data page list_
Each page keeps track of the number of free slots in itself.
![linked-list-heap-file]
### Page Directory
The DBMS maintains special pages that track the location of data pages in the database files.
The directory also records the number of free slots per page.
The DBMS has to make sure that the directory pages are in sync with the data pages.
![page-directory]
## Page Layout
Every page contains a _header_ of metadata about the page contents:
- Page Size
- Checksum
- DBMS Version
- Transaction Visibility
- Compression Information
Data inside a page can be organized with two approached:
- Tuple oriented
- Log oriented
### Log Oriented Page Layout
Instead of storing tuples in pages, the DBMS only stores **log records** in the files.
The system appends log records of _how_ the database was modified:
- Inserts store the entire tuple.
- Deletes mark the tuple as deleted.
- Updates contain the delta of just the attributes that were modified.
To read a record, the DBMS scans the log and "recreates" the tuple to find what it needs.
![log-record]
### Tuple Oriented Page Layout
The most common layout scheme is called _slotted pages_. The slot array maps "slots" to the tuples' starting position offsets.
The header keeps track of:
- the number of used slots.
- the offset of the starting location of the last used slot.
The tuples start from the end of the page and grow "upwards" while the slot array starts from the beginning of the page and grows "downwards".
When the two meet the page is full.
> **Note**: the position and growing behavior is similar to a program's stack and heap memory
![page-layout]
### Record IDs
The DBMS needs a way to keep track of individual tuple. Each tuple is assigned and unique **record identifier**.
Most common is `page_id + offset/slot`. They can also contain file location info.
> **Note**: an application _cannot_ rely on record ids to derive meaning.
## Tuple Layout
A tuple is essentially a sequence of bytes. It's the job of the DBMS to interpret those bytes into types and values.
The DBMS's _catalogs_ contain the schema information about tables and can be used to understand the tuple layout.
Each tuple is prefixed with and _header_ that contains metadata about it:
- visibility info (concurrency control)
- bitmap for `NULL` values
Attributes are typically stored in the order in which they are defined while creating the table.
> **Note**: metadata about the schema is _not_ stored.
![tuple-layout]
```sql
CREATE TABLE <table> (
a INT PRIMARY KEY,
b INT NOT NULL,
c FLOAT,
d INT
)
```
## Data Representation
- Fixed Precision Numbers: [IEEE-754]
- Variable Length Data: header with length, followed by data bytes
- Time & Date: 32/64-bit integer of microseconds since Unix epoch
### Variable Precision Numbers
Inexact, variable-precision numeric type stored as specified by [IEEE-754]. Typically faster than arbitrary precision numbers but can have rounding errors.
### Fixed Precision Numbers
Numeric data types with arbitrary precision and scale. Used when rounding errors are unacceptable.
Typically stored in exact, variable-length binary representation with additional metadata.
### Large Values
Most DBMS don't allow a tuple to exceed the size of a single page.
To store values that are larger than a page, the DBMS uses separate **overflow** storage pages.
Some systems allow to store really large values in an external file but the DBMS _cannot manipulate_ the contents of the external file.
This means no durability protections ans not transaction protections.
[To BLOB or Not to BLOB Paper](to-blob-or-not-to-blob-larg-object-storage-in-a-database-or-a-filesystem.pdf "To BLOB or not to BLOB PDF")
![overflow-page]
### System Catalog
A DBMS stores metadata about databases in it's internal catalogs. The catalogs contain metadata about:
- Tables, columns, indexes, views
- Users, permissions
- Internal statistics
Almost every DBMS stores their database's catalog in itself.
It's possible to query the database internal `INFORMATION_SCHEMA` catalog to get info about the database. This catalog in an ANSI standard set of read only views that provide info about all of the tables, views, columns, procedures in a database.
```sql
SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE table_catalog = '<db_name>'; -- list of tables
SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE table_name = '<table>'; -- schema of a table
```
> **Note**: DBMS also have non-standard shortcuts to retrieve this information.
## Database Workloads & Data Storage Models
- [**On-Line Transaction Processing** (`OLTP`)][OLTP]: Simple queries that read/update a small amount of data that is related to a single entity in the database.
- [**On-Line Analytical Processing** (`OLAP`)][OLAP]: Complex queries that read large portions of the database spanning multiple entities.
The DBMS can store tuples in different ways that are better for either OLTP or OLAP workloads.
### N-Ary Storage Model (`NSM`)
The DBMS stores all attributes for a single tuple contiguously in a page (aka _row store_). It's ideal for OLTP workloads where queries tend to operate only on an individual entity and insert-heavy workloads.
Advantages:
- fast inserts, updates and deletes.
- good for queries that need the entire tuple.
Disadvantages:
- Not good for scanning large portions of the table and/or subsets of the attributes.
![nary-storage-model]
### Decomposition Storage Model (`NSM`)
The DBMS stores the values of a single attribute for all tuples contiguously in a page (aka _column store_). It's ideal for OLAP workloads where read-only queries perform large scans over a subset og the table attributes.
![decomposition-storage-model]
To identify all the parts of the same tuple in the different pages there are two choices:
- Fixed-length Offsets: each value ahs same length and is in the same position
- Embedded Tuple Ids: each value is stored with it's tuple id in a column
![dsm-tuple-identification]
Advantages:
- Reduces the amount of wasted disk I/O because DBMS reads only the data that it needs
- Better quality of processing and data compression
Disadvantages:
- Slow for point queries, inserts, update and deletes because of tuple splitting/stitching.
<!-- images -->
[storage-hierarchy]: ../../img/storage-hierarchy.excalidraw.svg "Storage Hierarchy"
[disk-operations]: ../../img/disk-operations.excalidraw.svg "Disk Operations"
[linked-list-heap-file]: ../../img/linked-list.excalidraw.svg "Heap File as a Linked List"
[page-directory]: ../../img/page-directory.excalidraw.svg "Page Directory"
[page-layout]: ../../img/page-layout.excalidraw.svg "Page Layout"
[log-record]: ../../img/log-record.excalidraw.svg "Log Record"
[tuple-layout]: ../../img/tuple-layout.excalidraw.svg "Tuple Layout"
[overflow-page]: ../../img/overflow-page.excalidraw.svg "Overflow Page"
[nary-storage-model]: ../../img/nary-storage-model.excalidraw.svg "N-Ary Storage Model"
[decomposition-storage-model]: ../../img/decomposition-storage-model.excalidraw.svg "Decomposition Storage Model"
[dsm-tuple-identification]: ../../img/dsm-tuple-identification.excalidraw.svg "DSM Tuple Identification"
<!-- references -->
[IEEE-754]: https://wikipedia.org/wiki/IEEE_754 "IEEE-754 Standard"
[OLTP]: https://en.wikipedia.org/wiki/Online_transaction_processing "On-Line Transaction Processing - Wikipedia"
[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing "On-Line Analytical Processing - Wikipedia"

View file

@ -0,0 +1,148 @@
# Databases
Organized collection of inter-related data that models some aspect of the real-world. Databases are the core component of most computer applications.
A **DBMS** is the software that allows applications to store and analyze information in a database.
A general-purpose DBMS is designed to allow the definition, creation, querying, update and administration of database Management System
A **data model** is a collection of concepts for describing the data in a database. A **schema** is a description of a particular collection of data, using a given data model.
## Relational Model
It was proposed by Ted Codd in 1970. It's an abstraction to avoid high maintenance of the DBMS software:
- store database in simple data structures-
- access data through high level language.
- physical storage left to the implementation.
- store database in simple data structures-
- access data through high level language.
- physical storage left to the implementation.
Concepts:
- **Structure**: the definition of relations and their contents.
- **Integrity**: ensures that the database's contents satisfy constraints.
- **Manipulation**: how to access and modify a database's content.
### Relations
A **relation** (aka **table**)is an _unordered_ set that contains the relationship of the _attributes_ (aka _fields_) that represent entities.
The **domain** of a relation is the set of possible values that the relation can contain.
A **tuple** (aka **record**) is a set of attribute values in a relation. Values are (normally) atomic/scalar. The special value `NULL` is a member of every domain.
### Primary Keys (PK)
A relation's **primary key** uniquely identifies a single tuple. Some DBMSs automatically create an internal primary key if one is not provided.
### Foreign Keys (FK)
A **foreign key** specifies that an attribute from one relation has to map to a tuple in another relation.
## Data Manipulation Language (DML)
The **data manipulation language** describes how to _store_ and _retrieve_ information from a database.
Kinds of DMLs:
- **Procedural**: the query specifies the (high-level) strategy that the DBMS should use to find the desired result. (Relational Algebra)
- **Non-Procedural**: the query specifies only what data is wanted and not how to find it. (Relational Calculus)
## Relational Algebra
Set of fundamental operations to retrieve and manipulate tuples in a relation. Each operator takes one or more relations as inputs and outputs a new relation. This allows to chain operations together to create more complex operations.
Relational algebra describes the steps needed to obtain a particular result. The order of the steps does influence the performance of the complete operation.
Fundamental Operators:
- Selection (`σ`)
- Projection (`π`)
- Union (`U`)
- Intersection (`∩`)
- Difference (`-`)
- Product (`x`)
- Natural Join (`|X|`)
Extra Operators:
- Rename (`p`)
- Assignment (`R ← S`)
- Duplicate Elimination
- Aggregation (`Y`)
- Sorting (`τ`)
- Division (`÷`)
> **Note**: reactional algebra operates on sets. A set is an unordered list of unique values.
### Select (`σ`)
Choose a subset of tuples from a relation that satisfies a selection predicate. Predicates acts a filters to retain only tuples that fulfill its qualifying requirements. It's possible to combine multiple predicates using conjunctions/disjunctions.
Syntax: `σ(R)`
```sql
SELECT * FROM R WHERE <predicate>;
```
### Projection (`π`)
Generate a relation with tuples that contain only the specified attributes. Allows to rearrange attribute ordering and can manipulate values.
Syntax: `π(R)`
```sql
SELECT (<tuple>) FROM R;
```
### Union (`U`)
Generate a relation that contains all tuples that appear in either only one ore both input relations.
Syntax: `(R U S)`
```sql
(SELECT * FROM R) UNION ALL (SELECT * FROM S);
```
### Intersection (`∩`)
Generate a relation that contains only the tuples that appear in both of the input relations.
Syntax: `(R ∩ S)`
```sql
(SELECT * FROM R) INTERSECT (SELECT * FROM S);
```
### Difference (`-`)
Generate a relation that contains only the tuples that appear in the first and not the second of the input relations.
Syntax: `(R - S)`
```sql
(SELECT * FROM R) EXCEPT (SELECT * FROM S);
```
#### Product (`x`)
Generate a relation that contains all possible combinations of tuples from the input relations.
Syntax: `(R x S)`
```sql
SELECT * FROM R CROSS JOIN S;
SELECT * FROM R, S;
```
### Natural Join (`|X|`)
Generate a relation that contains all tuples that are a combination of two tuples (one from each relation) with common value(S) for one or more attributes.
> **Note**: the matching values must be on the same fields.
Syntax: (`R |X| S`)
```sql
SELECT * FROM R NATURAL JOIN S;
```

View file

@ -0,0 +1,110 @@
# Hash Tables & Trees
Used in the DBMS for:
- Internal Metadata
- Core Data Storage
- Temporary Data Structures
- Table Indexes
Design Decisions:
- **Data Organization**: How to layout the data structure in memory/pages and what information to store to support efficient access.
- **Concurrency**: How to enable multiple threads to access the data structure at the same time without causing problems.
## Hash Functions & Hash Tables
An **hash table** implements _unordered_ associative arrays (aka dictionaries) that maps keys to values.
It uses an **hash function** to compute an offset into the array for a given key, from which the desired value can be found.
Space Complexity: `O(n)`
Time Complexity:
- Average: `O(1)`
- Worst `O(n)`
Design Decisions:
1. **Hash Function**: How to map a large key space into a smaller domain. Trade-off between speed vs collision rate.
2. **Hashing Scheme**: How to handle key collisions after hashing. Trade-off between table size vs extra logic for key placement.
**Hash Functions**:
- `CRC-64` (1975): used in networking for error detection.
- `MurmurHash` (2008): designed as fast, general-purpose hash function.
- `XXHash` (2012): from the creator of zstd compression.
## Static Hashing
Allocate a giant array that has one slot for _every_ element that needs to be stored.
To find an entry, mod the hash of the key by the number of elements to find the offset in the array.
### Linear Probe Hashing (aka Open-Address Hashing)
Allocate a giant array of slots. The slot for an item is determined by `hash(key) mod N` where N is the number of slots.
If a collision happens (`hash(key_1) == hash(key_2)`), linearly scan for the next first free slot and place the element there. To determine whether and element is present, hash to it's location and scan since the item may have been shifted due to collisions.
Insertions and deletions are generalization of lookups. Deletions leave "tombstones" to keep the state consistent. A tombstone marks the slot as occupied when doing a search scan but can be overwritten by an insertion.
> **Note**: key is stored with value to know when to stop searching.
![linear-probing]
**Non-Unique Keys Strategies**:
- Separate Linked List: store values with the same key in a linked list
- Redundant Keys: store duplicate keys entries together in the hash table (easier to implement)
### Robin Hood Hashing
Variant of linear probing that steals slots from "rich" keys and gives them to "poor" keys.
Each key tracks the number of positions they are from their the optimal position in the table. On insert, a key takes the slot of another key if the first is further away from it's optimal position than the second key.
The objective is to average the distance to the optimal position for each key.
![robin-hood]
### Cuckoo Hashing
Use different hash tables with different hash function seeds.
On insert, check every table and pick the one that has a free slot. If no table has a free slot, evict the element form one of them and re-hash it to find a new location.
Lookup and deletions are always `O(1)` because only one location per hash table is checked.
## Dynamic Hashing
### Chained Hashing
Maintain a linked list of _buckets_ for each slot in the hash table. Resolve collisions by placing all elements with the same hash into the same bucket. To determine if an element exists, hash to it's bucket and scan for it.
Insertions and deletions are generalization of lookups.
![bucket-chain]
### Extensible Hashing
Extension of chained hashing in which the buckets are split instead of growing forever. Multiple slot locations can point to the same bucket chain. The starting bits of the hash indicate in which bucket is located the value. When a bucket is split the values are reshuffled an the number of significant leading bits increases.
![extendible]
### Linear Hashing
The hash table maintains a _pointer_ that track the next bucket to split. When any bucket overflows, split the bucket at the pointer location.
Use multiple hashes to find the right bucket for a given key.
Can use different overflow criterion:
- Space utilization
- Average Length of Overflow Chains
<!-- links -->
[linear-probing]: ../../img/static-hash-linear-probing.excalidraw.svg "Linear Probe Hashing"
[robin-hood]: ../../img/static-hash-robin-hood.excalidraw.svg "Robin Hood Hashing"
[bucket-chain]: ../../img/dynamic-hash-chain.excalidraw.svg "Chain Hashing"
[extendible]: ../../img/dynamic-hash-extendible.excalidraw.svg "Extendible Hashing"

View file

@ -0,0 +1,85 @@
# Table Indexes
A **table index** is a replica of a subset of a table's attributes that are organized and/or sorted for efficient access using those attributes. The DBMS ensures that the contents of the table and the index are synchronized.
There is a trade-off regarding the number of indexes to create per databases:
- storage overhead
- maintenance overhead
## Clustered Indexes
A **Clustered Index** is a tree-organized table. Instead of storing the records in an unsorted Heap table space, the clustered index is actually a B+Tree index having the _leaf nodes_, which are stored in the sort order specified by the _primary key_, store the actual table records.
The Clustered Index can speed up queries that filter records by the clustered index key, like the usual CRUD statements.
Since the records are located in the leaf nodes, there's no additional lookup for extra column values when locating records by their Primary Key values.
![clustered](../../img/clustered-index.png)
![non-clustered](../../img/non-clustered-index.png)
Retrieving tuples in the order they appear in a non-clustered index is inefficient due to redundant reads. Since the index leafs point to pages in random order a page can be retrieved from disk multiple times for a single query.
The DBMS must first figure out all the tuples it needs and then sort them based on page id to maximize read efficiency.
## Index Concurrency Control
A DBMS needs to allow multiple workers (threads or precesses) to safely access data structures to take advantage of additional CPU cores and hide disk I/O stalls.
A **concurrency control** protocol is the method that the DBMS uses to ensure "correct" results for concurrent operations on a shared object.
A protocol correctness criteria can vary:
- **Logical Correctness**: can a thread see the data that is supposed to see?
- **Physical Correctness**: is the internal representation of the object sound?
### Locks vs Latches
Locks (Transactions):
- Protect the database's logical contents from other transactions.
- Held for transaction duration.
- Need to be able to rollback changes
Latches (Workers):
- Protect the critical sections of the DBMS's internal data structure from other workers.
- Held for operation duration.
- Do not need to be able to rollback changes.
| What | Locks | Latches |
|:-----------:|:------------------------------------:|:-------------------------:|
| Separate... | Transactions | Workers |
| Protect... | Database Contents | In-Memory Data Structures |
| During... | Entire Transactions | Critical Sections |
| Modes... | Shared, Exclusive, Update, Intention | Read, Write |
| Deadlock... | Detection % Resolution | Avoidance |
| ...by... | Waits-for, Timeouts, Aborts | Coding Discipline |
| Kept in... | Lock Manager | Protected Data Structure |
### Latch Modes
**Read Mode**:
- Multiple threads can read the same object at the same time
- A thread can acquire the read latch if another thread has it in read mode
**Write Mode**:
- Only one thread can access the object
- A thread cannot acquire a write latch if another thread has it in any mode
### Latch Implementation
Goals:
- Small memory footprint
- Fast execution path with no contention
- De-schedule thread when it has been waiting for too long to avoid burning cycles
- Each latch should not have to implement their own queue to track waiting threads
Types:
- Test-and-Set Spinlock
- Blocking OS Mutex
- Reader-Writer Lock

Binary file not shown.

After

Width:  |  Height:  |  Size: 137 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 68 KiB

BIN
docs/img/b+tree.webp Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 13 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 80 KiB

File diff suppressed because it is too large Load diff

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 60 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 107 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 64 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 28 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 27 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 27 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 52 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 10 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 12 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 78 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 201 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 17 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 38 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 26 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 41 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 26 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 61 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 11 KiB

View file

@ -85,6 +85,14 @@ nav:
- Redis: databases/redis.md
- SQL: databases/sql.md
- MongoDB: databases/mongo-db.md
- Database Systems:
- Databases: databases/database-systems/databases.md
- Advanced SQL: databases/database-systems/advanced-sql.md
- Storage: databases/database-systems/database-storage.md
- Buffer Pools: databases/database-systems/buffer-pools.md
- Hash Tables: databases/database-systems/hash-tables.md
- B+ Trees: databases/database-systems/b+trees.md
- Table Indexes: databases/database-systems/table-indexes.md
- Languages:
- HTML: languages/html/html.md
- Markdown: languages/markdown.md