Databases and Storage - System Design Course

Contents hide

1) SQL vs. NoSQL

1.1) SQL Databases

1.2) NoSQL Databases

2) Indexing, replication, and sharding

2.1) Indexing, Replication, and Sharding

2.2) Indexing

2.3) Replication

2.4) Sharding

3) Caching strategies (Redis, Memcached)

3.1) What Is Caching?

3.2) Popular Tools: Redis vs. Memcached

SQL vs. NoSQL

Overview:
SQL (Structured Query Language) and NoSQL (Not Only SQL) databases are two broad categories of database technologies, each with different structures, use cases, and benefits.

SQL Databases

Definition:
SQL databases are relational databases that use structured schemas with rows and columns. Data is stored in tables, and relationships between data are maintained using keys.

Key Features:

Structured schema with fixed table definitions.
Uses SQL for defining and manipulating data.
ACID compliance ensures reliable transactions.
Ideal for complex queries and relational data.

Common SQL Databases:

MySQL
PostgreSQL
SQLite
Oracle Database

NoSQL Databases

Definition:
NoSQL databases are non-relational and store data in flexible formats like key-value pairs, documents, graphs, or wide-columns, allowing for dynamic, scalable data handling.

Key Features:

Flexible schema, can handle unstructured or semi-structured data.
Designed for horizontal scalability and high performance.
Eventual consistency rather than strict ACID compliance (varies by type).
Supports large-scale distributed data systems.

Common NoSQL Databases:

MongoDB (Document)
Redis (Key-Value)
Cassandra (Wide-Column)
Neo4j (Graph)

Key Differences

Data Model: SQL uses tables and relations; NoSQL uses flexible formats like JSON or key-value.
Schema: SQL requires predefined schemas; NoSQL allows dynamic schemas.
Scalability: SQL is vertically scalable; NoSQL is typically horizontally scalable.
Use Cases: SQL is ideal for structured data and transactions; NoSQL is great for big data, real-time apps, and distributed systems.

Indexing, replication, and sharding

Indexing, Replication, and Sharding

Indexing

Definition: Indexing is the process of creating data structures that improve the speed of data retrieval operations in a database, typically by reducing the number of records the system needs to scan.

How It Works: When an index is created on one or more columns of a table, the database engine maintains a separate lookup table that maps indexed values to the corresponding records. This works similarly to the index of a book, allowing faster access.

Types of Indexes:

Single-column index
Composite (multi-column) index
Unique index
Full-text index

Benefits:

Faster query performance for search, sort, and filtering operations
Reduces CPU and I/O cost for frequent read operations

Trade-offs:

Consumes additional disk space
May slow down write operations (INSERT, UPDATE, DELETE)

Replication

Definition: Replication is the process of copying data from one database server (primary/master) to one or more servers (replicas/slaves) to improve availability and fault tolerance.

How It Works: The primary server handles all write operations and synchronizes data changes to the replicas. The replicas typically handle read-only queries, helping to distribute load and ensure high availability.

Types of Replication:

Master-slave (primary-secondary) replication
Master-master (multi-primary) replication
Synchronous vs. asynchronous replication

Benefits:

High availability and data redundancy
Improved performance by load-balancing read operations
Disaster recovery readiness

Trade-offs:

Possible data lag in asynchronous replication
Increased complexity in system design and maintenance

Sharding

Definition: Sharding is a method of database partitioning that distributes data across multiple machines or databases to handle large datasets and improve scalability.

How It Works: Instead of storing all records in a single database, sharding splits them based on a shard key (e.g., user ID, region, etc.). Each shard holds a portion of the data and operates independently.

Shard Strategies:

Range-based sharding
Hash-based sharding
Geo-based sharding

Benefits:

Improves read/write performance by distributing load
Enables horizontal scaling to handle large volumes of data
Isolates failures to specific shards

Trade-offs:

Complexity in query routing and data aggregation
Rebalancing shards can be operationally challenging

Caching strategies (Redis, Memcached)

Caching is a technique used to store frequently accessed data in a high-speed storage layer (cache) to reduce database load, decrease latency, and improve application performance.

What Is Caching?

Caching involves temporarily storing copies of data so future requests for that data can be served faster. It reduces the need to repeatedly access slower storage layers such as databases or APIs.

Why Use Caching?

Reduces latency and response times
Decreases load on databases and servers
Improves application scalability
Enhances user experience by speeding up repeated data access

Common Caching Strategies

1. Write-Through Cache

Data is written to the cache and the database simultaneously.
Ensures the cache is always up-to-date.
Slightly slower for write operations due to dual writes.

2. Write-Around Cache

Data is written only to the database.
Cache is updated only on subsequent reads.
Prevents cache pollution for infrequently read data.

3. Write-Back (Write-Behind) Cache

Data is written to the cache first, then asynchronously written to the database.
Improves write performance but increases risk of data loss if the cache fails.

4. Cache-Aside (Lazy Loading)

Data is loaded into the cache only when requested.
Application checks cache first, then fetches from the database if not found.
Most flexible and commonly used strategy.

Popular Tools: Redis vs. Memcached

Redis

Data Types: Supports strings, lists, sets, sorted sets, hashes, bitmaps, hyperloglogs, and streams.
Persistence: Offers snapshotting and append-only file (AOF) for data persistence.
Advanced Features: Pub/Sub, Lua scripting, transactions, geospatial indexing.
Use Cases: Session storage, leaderboards, caching complex objects, message queues.

Memcached

Data Types: Supports strings only (key-value pairs).
Persistence: No built-in persistence; all data is stored in-memory only.
Simplicity: Lightweight, fast, and easy to set up.
Use Cases: Simple caching needs, session storage, read-heavy workloads.

Choosing Between Redis and Memcached

Feature	Redis	Memcached
Persistence	Yes	No
Data Structures	Rich (sets, lists, etc.)	Strings only
Performance	Slightly heavier but richer	Lightweight and fast
Scalability	Cluster support	Supports distributed caching

Best Practices

Choose appropriate eviction policies (e.g., LRU, LFU).
Set TTLs (Time-to-Live) to automatically expire stale data.
Avoid caching sensitive data unless encrypted.
Monitor cache hit/miss ratios to optimize performance.
Use consistent hashing for distributed caching systems.