Distributed Systems

System reliability is a crucial aspect of designing and managing distributed systems. It ensures that systems perform consistently, meeting user expectations even in the face of failures. This article explores key concepts, challenges, and strategies related to system reliability in large-scale distributed systems.

Distributed Systems and Failures

Distributed systems are composed of multiple independent components that communicate and coordinate to achieve a common goal. These systems are more prone to failures due to their complexity and scale.

More Likely to Fail

Failure Can Be:
- Partial
- Independent

Failures in Large Scale Distributed Systems

Definition: Large scale systems are typically distributed systems. Distributed systems are composed of multiple independent components that communicate and coordinate to achieve a common goal.
Characteristics:
- Large number of components: Many different parts (software or hardware) make up the system.
- Large number of component instances: Each component may have multiple instances to handle load and provide scalability.

Types of Failures

Partial Failures

Definition: A specific part of the system malfunctions while the rest continues to operate.
Example: One service stops working, but other services remain functional.

Independent Failures

Definition: Failures that are independent won't impact other services or components in the system.
Example: If one service fails, it doesn't affect the functionality of other services.

Single Point of Failure (SPOF)

Definition: A component or part of the system that, if it fails, can cause the entire system to fail.
Example: A critical service required by many others fails.

Challenges in Large Scale Distributed Systems

Increased chance of partial failures due to numerous components.
Cascading effects: Partial failures can lead to complete system failures (e.g., domino effect).
Identifying SPOFs: Determining which components could bring down the entire system.
Mitigation strategies: Implementing redundancy, fault tolerance, and monitoring.

Reliability Engineering

Key Concepts:
- Reliability
- Availability
- Fault Tolerance

Reliability

Definition: Systematic practice of predicting, preventing, and managing failure probabilities.
Characteristics:
- A reliable system remains functional despite partial failures.
- Measured as the probability of correct operation in a given time interval.

Breakdown of Definition

Probability: Likelihood of the system functioning correctly.
Intended Function: Purpose the system is designed to achieve.
Adequately: Meeting acceptable performance standards.
Specified Time: Reliability varies over different durations.
Stated Conditions: Operating conditions influence reliability.

Availability

Definition: Probability that a system is operational and accessible when needed.
Calculation Methods:
- Time-Based Availability:
  - Formula: Availability = Uptime / (Uptime + Downtime) × 100
  - Example: 90% availability if uptime is 900 hours and downtime is 100 hours in a month.
- Request-Based Availability:
  - Formula: Availability = Total Successful Requests / Total Requests × 100
  - Example: 94.44% availability if 8500 of 9000 requests are successful.

High Availability

Goal: Minimize downtime and ensure continuous operation.
Trade-offs:
- New features Vs. availability.
- Operational costs of redundancy and fault tolerance.
The Nines of Availability:

Nines	Availability	Downtime per year
99%	2 Nines	~3.65 days
99.9%	3 Nines	~8.77 hours
99.99%	4 Nines	~52.6 minutes

Fault Tolerance

Definition: Techniques to enable continued operation despite faults.
Key Elements:
- Detect, handle, and recover from partial failures.
Design Strategies:
- Redundancy: Duplicate critical components (e.g., RAID, replicated databases).
  - Types: Active, Passive, Cold.
- Fault Detection: Identify response, timeout, and crash failures using health checks.
- Recovery: Use techniques like hot standby, master-slave failover, and load balancers.

Redundancy Types

Active (Hot Spare): All nodes actively process tasks simultaneously.
Passive (Warm Spare): Standby nodes ready to take over on failure.
Cold Redundancy: Spare nodes activated only during failover.

Health Checks

Ping-Based: External monitoring sends ping requests.
Heartbeat-Based: Cluster servers exchange heartbeat signals.
Application Health Checks: Use HTTP/TCP responses, periodic checks.

Recovery Techniques

Stateless: Hot standby or warm standby.
Stateful: Database or cache failover.

System Stability

Timeouts: Safeguards against blocked threads and dependency failures.
Retries: For transient system errors, using exponential back-off.
Circuit Breaker Pattern:
- Closed: Normal operations.
- Open: Immediate error response during failures.
- Half-Open: Limited request testing before resuming normal operations.

Conclusion

Reliability in distributed systems is vital for ensuring seamless operations, especially in large-scale environments. By addressing challenges, implementing redundancy, detecting faults proactively, and recovering efficiently, systems can achieve high reliability and availability. Continuous improvement in design and monitoring is essential for building robust distributed systems.

Microservice Architecture About