Cache-Aware Programming

📋 Table of Contents

Overview
What is Cache-Aware Programming?
Why is Cache Performance Important?
How CPU Caches Work
Cache Architecture
Cache Performance Concepts
Memory Access Patterns
Cache Optimization Techniques
Implementation
Multi-core Cache Considerations
Performance Profiling
Common Pitfalls
Best Practices
Interview Questions
Additional Resources

🎯 Overview

Cache-aware programming optimizes code to work efficiently with CPU cache memory, improving performance by reducing cache misses and maximizing cache utilization. In embedded systems, understanding cache behavior is crucial for real-time performance and power efficiency.

Key Concepts for Embedded Development

Cache locality - Keeping frequently accessed data close together in memory
Cache line alignment - Aligning data structures to cache line boundaries
Memory access patterns - Optimizing how data is accessed sequentially
False sharing prevention - Avoiding cache line conflicts in multi-threaded code
Cache-aware data structures - Designing data structures for cache efficiency

🤔 What is Cache-Aware Programming?

Cache-aware programming is a technique that optimizes code to work efficiently with the CPU's cache memory hierarchy. It involves understanding how caches work and designing algorithms and data structures to minimize cache misses and maximize cache hits.

Core Principles

Spatial Locality: Accessing data that is close together in memory
Temporal Locality: Reusing recently accessed data
Cache Line Awareness: Understanding cache line boundaries and alignment
Memory Access Patterns: Optimizing sequential vs. random access patterns
Data Structure Design: Designing structures for cache-friendly access

Why Cache Performance Matters

Modern CPUs are much faster than memory. The cache acts as a high-speed buffer between the CPU and main memory, significantly reducing memory access latency.

Memory Hierarchy Performance:
┌─────────────────┬─────────────┬─────────────────┐
│     Level       │   Latency   │     Size        │
├─────────────────┼─────────────┼─────────────────┤
│   L1 Cache      │   1-3 ns    │   32-64 KB      │
│   L2 Cache      │   10-20 ns  │   256-512 KB    │
│   L3 Cache      │   40-80 ns  │   4-32 MB       │
│   Main Memory   │   100-300 ns│   4-32 GB       │
│   Disk/Flash    │   10-100 μs │   100+ GB       │
└─────────────────┴─────────────┴─────────────────┘

🎯 Why is Cache Performance Important?

Performance Impact

Speed: Cache hits are 10-100x faster than memory accesses
Power Efficiency: Cache accesses consume less power than memory accesses
Real-time Performance: Predictable cache behavior improves real-time performance
Scalability: Cache-efficient code scales better with larger datasets

Real-world Impact

10x performance improvement for cache-friendly algorithms
50% power reduction in cache-optimized embedded systems
Predictable timing for real-time applications
Better scalability for multi-core systems

When Cache Optimization Matters

High Impact Scenarios:

Large data processing applications
Real-time systems with strict timing requirements
Multi-core systems with shared caches
Memory-intensive algorithms
Embedded systems with limited cache

Low Impact Scenarios:

Small datasets that fit entirely in cache
I/O-bound applications
Simple algorithms with minimal memory access
Systems with abundant memory bandwidth

🧠 How CPU Caches Work

Cache Basics

A cache is a small, fast memory that stores frequently accessed data. When the CPU needs data, it first checks the cache. If the data is found (cache hit), it's retrieved quickly. If not found (cache miss), the data must be fetched from slower memory.

Cache Organization

Cache Structure:
┌─────────────────────────────────────────────────────────────┐
│                    Cache Memory                             │
├─────────────────┬─────────────────┬─────────────────┬───────┤
│   Cache Line 0  │   Cache Line 1  │   Cache Line 2  │  ...  │
│  ┌─────────────┐│  ┌─────────────┐│  ┌─────────────┐│       │
│  │   Tag       ││  │   Tag       ││  │   Tag       ││       │
│  │   Data      ││  │   Data      ││  │   Data      ││       │
│  │   Valid     ││  │   Valid     ││  │   Valid     ││       │
│  └─────────────┘│  └─────────────┘│  └─────────────┘│       │
└─────────────────┴─────────────────┴─────────────────┴───────┘

Cache Line Concept

A cache line is the smallest unit of data that can be transferred between cache and memory. Typical cache line sizes are 32, 64, or 128 bytes.

Cache Line Structure:
┌─────────────────────────────────────────────────────────────┐
│                    Cache Line (64 bytes)                   │
├─────────┬─────────┬─────────┬─────────┬─────────┬───────────┤
│  Byte 0 │  Byte 1 │  Byte 2 │  ...    │  Byte 62│  Byte 63  │
└─────────┴─────────┴─────────┴─────────┴─────────┴───────────┘

Cache Hit vs Cache Miss

Cache Hit:

CPU requests data at address X
Cache controller checks if data is in cache
Data found in cache (hit)
Data returned to CPU quickly (1-3 cycles)

Cache Miss:

CPU requests data at address X
Cache controller checks if data is in cache
Data not found in cache (miss)
Cache line containing address X is fetched from memory
Data is stored in cache and returned to CPU (100+ cycles)

Cache Replacement Policies

When a cache miss occurs and the cache is full, some data must be evicted to make room for new data.

Common Replacement Policies:

LRU (Least Recently Used): Evict least recently accessed data
FIFO (First In, First Out): Evict oldest data
Random: Randomly select data to evict
LFU (Least Frequently Used): Evict least frequently accessed data

🏗️ Cache Architecture

Cache Hierarchy

Modern processors have multiple levels of cache, each with different characteristics:

Cache Hierarchy:
┌─────────────────────────────────────────────────────────────┐
│                    CPU Core                                 │
│  ┌─────────────────┬─────────────────┬─────────────────┐   │
│  │   L1 Data       │   L1 Instruction│   L1 Unified    │   │
│  │   Cache         │   Cache         │   Cache         │   │
│  │   (32-64 KB)    │   (32-64 KB)    │   (32-64 KB)    │   │
│  └─────────────────┴─────────────────┴─────────────────┘   │
├─────────────────────────────────────────────────────────────┤
│                    L2 Cache                                │
│                  (256-512 KB)                              │
├─────────────────────────────────────────────────────────────┤
│                    L3 Cache                                │
│                   (4-32 MB)                                │
├─────────────────────────────────────────────────────────────┤
│                    Main Memory                             │
│                   (4-32 GB)                                │
└─────────────────────────────────────────────────────────────┘

Cache Configuration

// Cache configuration for ARM Cortex-M7
typedef struct {
    uint32_t l1_data_size;      // L1 Data Cache size
    uint32_t l1_instruction_size; // L1 Instruction Cache size
    uint32_t l2_size;            // L2 Cache size
    uint32_t cache_line_size;    // Cache line size
    uint32_t associativity;      // Cache associativity
} cache_config_t;

cache_config_t get_cache_config(void) {
    cache_config_t config = {0};
    
    // Read cache configuration from CPU
    uint32_t ctr = __builtin_arm_mrc(15, 0, 0, 0, 1);
    
    config.cache_line_size = 4 << ((ctr >> 16) & 0xF);
    config.l1_data_size = 4 << ((ctr >> 6) & 0x7);
    config.l1_instruction_size = 4 << ((ctr >> 0) & 0x7);
    
    return config;
}

Cache Line Structure

// Cache line aligned data structure
#define CACHE_LINE_SIZE 64

typedef struct {
    uint8_t data[CACHE_LINE_SIZE];
} __attribute__((aligned(CACHE_LINE_SIZE))) cache_line_t;

// Array of cache-line aligned data
typedef struct {
    cache_line_t lines[100];
} cache_aligned_array_t;

// Ensure data fits in cache lines
typedef struct {
    uint32_t value1;
    uint32_t value2;
    char padding[CACHE_LINE_SIZE - 8];  // Padding to next cache line
} __attribute__((aligned(CACHE_LINE_SIZE))) separated_data_t;

📊 Cache Performance Concepts

Locality Principles

Spatial Locality: The tendency to access data that is close together in memory.

Example of Spatial Locality:
┌─────────────────────────────────────────────────────────────┐
│                    Memory Array                             │
├─────────┬─────────┬─────────┬─────────┬─────────┬───────────┤
│  Data[0]│  Data[1]│  Data[2]│  Data[3]│  Data[4]│  Data[5]  │
└─────────┴─────────┴─────────┴─────────┴─────────┴───────────┘
         ↑
    Sequential access pattern
    (Good spatial locality)

Temporal Locality: The tendency to access the same data repeatedly over time.

Example of Temporal Locality:
for (int i = 0; i < 1000; i++) {
    sum += data[i];  // data[i] accessed multiple times
}

Cache Miss Types

Compulsory Misses: First access to data (unavoidable)
Capacity Misses: Cache is too small to hold all needed data
Conflict Misses: Multiple data items map to the same cache location
Coherence Misses: Cache invalidation in multi-core systems

Cache Performance Metrics

Hit Rate: Percentage of memory accesses that result in cache hits

Hit Rate = (Cache Hits) / (Cache Hits + Cache Misses) × 100%

Miss Rate: Percentage of memory accesses that result in cache misses

Miss Rate = (Cache Misses) / (Cache Hits + Cache Misses) × 100%

Average Memory Access Time (AMAT):

AMAT = Hit Time + Miss Rate × Miss Penalty

🔄 Memory Access Patterns

Sequential Access

Sequential access patterns are cache-friendly because they exhibit good spatial locality.

Sequential Access Pattern:
┌─────────────────────────────────────────────────────────────┐
│                    Memory Layout                            │
├─────────┬─────────┬─────────┬─────────┬─────────┬───────────┤
│  Data[0]│  Data[1]│  Data[2]│  Data[3]│  Data[4]│  Data[5]  │
└─────────┴─────────┴─────────┴─────────┴─────────┴───────────┘
    ↑         ↑         ↑         ↑         ↑         ↑
   Access 1  Access 2  Access 3  Access 4  Access 5  Access 6

Benefits:

Excellent spatial locality
High cache hit rates
Predictable performance
Easy to optimize

Random Access

Random access patterns can cause cache misses and poor performance.

Random Access Pattern:
┌─────────────────────────────────────────────────────────────┐
│                    Memory Layout                            │
├─────────┬─────────┬─────────┬─────────┬─────────┬───────────┤
│  Data[0]│  Data[1]│  Data[2]│  Data[3]│  Data[4]│  Data[5]  │
└─────────┴─────────┴─────────┴─────────┴─────────┴───────────┘
    ↑               ↑         ↑               ↑         ↑
   Access 1       Access 2   Access 3       Access 4   Access 5

Challenges:

Poor spatial locality
Low cache hit rates
Unpredictable performance
Difficult to optimize

Strided Access

Strided access patterns access data with a fixed stride (step size).

Strided Access Pattern (stride = 2):
┌─────────────────────────────────────────────────────────────┐
│                    Memory Layout                            │
├─────────┬─────────┬─────────┬─────────┬─────────┬───────────┤
│  Data[0]│  Data[1]│  Data[2]│  Data[3]│  Data[4]│  Data[5]  │
└─────────┴─────────┴─────────┴─────────┴─────────┴───────────┘
    ↑               ↑               ↑
   Access 1       Access 2       Access 3

Optimization Strategies:

Adjust data layout for better locality
Use cache-aware data structures
Implement prefetching

🔧 Cache Optimization Techniques

Data Structure Alignment

Align data structures to cache line boundaries to avoid cache line splits.

// Optimize data structures for cache
typedef struct {
    uint32_t frequently_accessed;  // Hot data
    uint32_t rarely_accessed;      // Cold data
    char padding[CACHE_LINE_SIZE - 8];  // Separate to different cache lines
} __attribute__((aligned(CACHE_LINE_SIZE))) hot_cold_separated_t;

// Array of structures (AoS) vs Structure of arrays (SoA)
// AoS - may cause cache misses
typedef struct {
    uint32_t x, y, z;
} point_aos_t;

point_aos_t points_aos[1000];  // Array of structures

// SoA - better cache locality
typedef struct {
    uint32_t x[1000];
    uint32_t y[1000];
    uint32_t z[1000];
} points_soa_t;

points_soa_t points_soa;  // Structure of arrays

Cache Line Padding

Use padding to separate frequently accessed data and prevent false sharing.

// Cache line padding to prevent false sharing
typedef struct {
    uint32_t counter;
    char padding[CACHE_LINE_SIZE - sizeof(uint32_t)];
} __attribute__((aligned(CACHE_LINE_SIZE))) padded_counter_t;

Memory Access Optimization

Loop Optimization: Optimize loops for cache-friendly access patterns
Data Layout: Arrange data for sequential access
Prefetching: Prefetch data before it's needed
Blocking: Process data in cache-sized blocks

🔧 Implementation

Cache Line Optimization

Data Structure Alignment

// Optimize data structures for cache
typedef struct {
    uint32_t frequently_accessed;  // Hot data
    uint32_t rarely_accessed;      // Cold data
    char padding[CACHE_LINE_SIZE - 8];  // Separate to different cache lines
} __attribute__((aligned(CACHE_LINE_SIZE))) hot_cold_separated_t;

// Array of structures (AoS) vs Structure of arrays (SoA)
// AoS - may cause cache misses
typedef struct {
    uint32_t x, y, z;
} point_aos_t;

point_aos_t points_aos[1000];  // Array of structures

// SoA - better cache locality
typedef struct {
    uint32_t x[1000];
    uint32_t y[1000];
    uint32_t z[1000];
} points_soa_t;

points_soa_t points_soa;  // Structure of arrays

Cache Line Padding

// Cache line padding to prevent false sharing
typedef struct {
    uint32_t counter;
    char padding[CACHE_LINE_SIZE - sizeof(uint32_t)];
} __attribute__((aligned(CACHE_LINE_SIZE))) padded_counter_t;

// Array of padded counters for multi-threaded access
padded_counter_t counters[NUM_THREADS];

Memory Access Patterns

Sequential Access Optimization

// Optimized sequential access
void optimized_sequential_access(uint32_t* data, size_t size) {
    // Process data in cache-line sized blocks
    const size_t block_size = CACHE_LINE_SIZE / sizeof(uint32_t);
    
    for (size_t i = 0; i < size; i += block_size) {
        size_t end = (i + block_size < size) ? i + block_size : size;
        
        // Process block
        for (size_t j = i; j < end; j++) {
            data[j] = process_data(data[j]);
        }
    }
}

Strided Access Optimization

// Optimized strided access
void optimized_strided_access(uint32_t* data, size_t size, size_t stride) {
    // Use blocking for strided access
    const size_t block_size = CACHE_LINE_SIZE / sizeof(uint32_t);
    
    for (size_t block_start = 0; block_start < size; block_start += block_size * stride) {
        for (size_t offset = 0; offset < stride; offset++) {
            for (size_t i = block_start + offset; i < size; i += stride) {
                if (i < block_start + block_size * stride) {
                    data[i] = process_data(data[i]);
                }
            }
        }
    }
}

🔒 False Sharing Prevention

What is False Sharing?

False sharing occurs when two or more threads access different variables that happen to be on the same cache line, causing unnecessary cache invalidations.

False Sharing Example:
┌─────────────────────────────────────────────────────────────┐
│                    Cache Line                               │
├─────────────┬─────────────┬─────────────┬───────────────────┤
│  Thread 1   │  Thread 2   │  Thread 3   │     Padding      │
│   Counter   │   Counter   │   Counter   │                  │
└─────────────┴─────────────┴─────────────┴───────────────────┘

Prevention Techniques

Cache Line Padding: Add padding to separate variables
Cache Line Alignment: Align structures to cache line boundaries
Data Layout: Arrange data to minimize false sharing
Thread-Local Storage: Use thread-local variables

// Prevent false sharing with padding
typedef struct {
    uint32_t counter;
    char padding[CACHE_LINE_SIZE - sizeof(uint32_t)];
} __attribute__((aligned(CACHE_LINE_SIZE))) padded_counter_t;

// Array of padded counters
padded_counter_t counters[NUM_THREADS];

🔄 Cache Prefetching

What is Prefetching?

Prefetching is a technique that loads data into cache before it's needed, reducing cache misses.

Types of Prefetching

Hardware Prefetching: Automatic prefetching by the CPU
Software Prefetching: Explicit prefetching by the programmer
Compiler Prefetching: Automatic prefetching by the compiler

Software Prefetching

// Software prefetching example
void prefetch_example(uint32_t* data, size_t size) {
    for (size_t i = 0; i < size; i++) {
        // Prefetch next cache line
        if (i + CACHE_LINE_SIZE/sizeof(uint32_t) < size) {
            __builtin_prefetch(&data[i + CACHE_LINE_SIZE/sizeof(uint32_t)], 0, 3);
        }
        
        // Process current data
        data[i] = process_data(data[i]);
    }
}

🔄 Cache Flushing and Invalidation

When to Flush/Invalidate

DMA Operations: Before/after DMA transfers
Multi-core Systems: When sharing data between cores
I/O Operations: Before/after I/O operations
Security: When clearing sensitive data

Implementation

// Cache flush and invalidate functions
void cache_flush(void* addr, size_t size) {
    // Flush cache lines containing the address range
    uintptr_t start = (uintptr_t)addr & ~(CACHE_LINE_SIZE - 1);
    uintptr_t end = ((uintptr_t)addr + size + CACHE_LINE_SIZE - 1) & ~(CACHE_LINE_SIZE - 1);
    
    for (uintptr_t addr = start; addr < end; addr += CACHE_LINE_SIZE) {
        __builtin_arm_dccmvac((void*)addr);
    }
}

void cache_invalidate(void* addr, size_t size) {
    // Invalidate cache lines containing the address range
    uintptr_t start = (uintptr_t)addr & ~(CACHE_LINE_SIZE - 1);
    uintptr_t end = ((uintptr_t)addr + size + CACHE_LINE_SIZE - 1) & ~(CACHE_LINE_SIZE - 1);
    
    for (uintptr_t addr = start; addr < end; addr += CACHE_LINE_SIZE) {
        __builtin_arm_dccimvac((void*)addr);
    }
}

🔄 Multi-core Cache Considerations

Cache Coherency

In multi-core systems, each core has its own cache, and cache coherency protocols ensure data consistency.

Cache Coherency Protocols

MESI Protocol: Modified, Exclusive, Shared, Invalid
MOESI Protocol: Modified, Owned, Exclusive, Shared, Invalid
MSI Protocol: Modified, Shared, Invalid

Multi-core Optimization

False Sharing Prevention: Use padding and alignment
Cache-Aware Data Layout: Arrange data for minimal contention
Thread Affinity: Bind threads to specific cores
NUMA awareness: Consider NUMA architecture

// Multi-core cache-aware data structure
typedef struct {
    uint32_t data[NUM_CORES][CACHE_LINE_SIZE/sizeof(uint32_t)];
} __attribute__((aligned(CACHE_LINE_SIZE))) cache_aligned_data_t;

📊 Performance Profiling

Cache Performance Metrics

Cache Hit Rate: Percentage of cache hits
Cache Miss Rate: Percentage of cache misses
Cache Miss Types: Compulsory, capacity, conflict misses
Memory Bandwidth: Data transfer rate

Profiling Tools

Hardware Counters: CPU performance counters
Cachegrind: Cache simulation tool
perf: Linux performance analysis tool
Intel VTune: Intel performance profiler

Profiling Example

// Cache performance profiling
void profile_cache_performance(void) {
    // Start profiling
    uint64_t start_cycles = __builtin_readcyclecounter();
    
    // Perform cache-intensive operation
    cache_intensive_operation();
    
    // End profiling
    uint64_t end_cycles = __builtin_readcyclecounter();
    uint64_t cycles = end_cycles - start_cycles;
    
    printf("Operation took %llu cycles\n", cycles);
}

⚠️ Common Pitfalls

1. Ignoring Cache Line Boundaries

Problem: Data structures not aligned to cache lines Solution: Use cache line alignment and padding

// Incorrect: Not aligned
typedef struct {
    uint32_t a, b, c;
} unaligned_struct_t;

// Correct: Aligned to cache line
typedef struct {
    uint32_t a, b, c;
    char padding[CACHE_LINE_SIZE - 12];
} __attribute__((aligned(CACHE_LINE_SIZE))) aligned_struct_t;

2. False Sharing

Problem: Multiple threads accessing different variables on same cache line Solution: Use padding and alignment

// Incorrect: Potential false sharing
uint32_t counters[NUM_THREADS];

// Correct: Padded to prevent false sharing
typedef struct {
    uint32_t counter;
    char padding[CACHE_LINE_SIZE - sizeof(uint32_t)];
} padded_counter_t;

padded_counter_t counters[NUM_THREADS];

3. Poor Memory Access Patterns

Problem: Random or strided access patterns Solution: Optimize data layout and access patterns

// Incorrect: Poor access pattern
for (int i = 0; i < N; i++) {
    for (int j = 0; j < M; j++) {
        data[j][i] = process(data[j][i]);  // Column-major access
    }
}

// Correct: Better access pattern
for (int i = 0; i < N; i++) {
    for (int j = 0; j < M; j++) {
        data[i][j] = process(data[i][j]);  // Row-major access
    }
}

4. Ignoring Cache Size

Problem: Assuming cache size is larger than actual Solution: Query cache configuration and design accordingly

// Query cache configuration
cache_config_t config = get_cache_config();
printf("L1 Data Cache: %u KB\n", config.l1_data_size);
printf("Cache Line Size: %u bytes\n", config.cache_line_size);

✅ Best Practices

1. Data Structure Design

Align to cache lines: Use cache line alignment for frequently accessed data
Separate hot and cold data: Keep frequently and rarely accessed data separate
Use appropriate data structures: Choose structures for cache-friendly access
Consider data layout: Arrange data for sequential access patterns

2. Memory Access Patterns

Sequential access: Prefer sequential over random access
Blocking: Process data in cache-sized blocks
Prefetching: Use prefetching for predictable access patterns
Stride optimization: Optimize strided access patterns

3. Multi-core Considerations

False sharing prevention: Use padding and alignment
Cache coherency: Understand cache coherency protocols
Thread affinity: Bind threads to specific cores
NUMA awareness: Consider NUMA architecture

4. Performance Monitoring

Profile regularly: Monitor cache performance
Use appropriate tools: Use cache profiling tools
Measure impact: Measure the impact of optimizations
Iterate: Continuously improve cache performance

5. Code Organization

Cache-aware algorithms: Design algorithms for cache efficiency
Data locality: Keep related data close together
Memory layout: Optimize memory layout for access patterns
Compilation flags: Use appropriate compilation flags

🎯 Interview Questions

Basic Questions

What is cache-aware programming?
- Technique to optimize code for CPU cache efficiency
- Focuses on spatial and temporal locality
- Reduces cache misses and improves performance
What are the different types of cache misses?
- Compulsory misses: First access to data
- Capacity misses: Cache too small
- Conflict misses: Multiple data items map to same location
- Coherence misses: Cache invalidation in multi-core systems
What is false sharing and how do you prevent it?
- Multiple threads accessing different variables on same cache line
- Causes unnecessary cache invalidations
- Prevent with padding and alignment

Advanced Questions

How would you optimize a matrix multiplication for cache performance?
- Use blocking/tiling techniques
- Optimize memory access patterns
- Consider cache line alignment
- Use cache-aware data structures
How would you design a cache-efficient hash table?
- Use cache-line aligned buckets
- Optimize for sequential access
- Consider cache-friendly collision resolution
- Use appropriate data structures
How would you profile cache performance in an embedded system?
- Use hardware performance counters
- Implement cache simulation
- Monitor cache hit/miss rates
- Use profiling tools

Implementation Questions

Write a cache-efficient matrix multiplication algorithm
Implement a cache-aware hash table
Design a cache-friendly data structure for multi-threaded access
Write code to detect and prevent false sharing

📚 Additional Resources

Books

"Computer Architecture: A Quantitative Approach" by Hennessy and Patterson
"The Art of Computer Programming, Volume 1" by Donald Knuth
"High Performance Computing" by Kevin Dowd

Online Resources

Tools

Cachegrind: Cache simulation and profiling
perf: Linux performance analysis
Intel VTune: Intel performance profiler
Valgrind: Memory and cache profiling

Standards

C11: C language standard with cache considerations
C++11: C++ standard with cache-aware features
POSIX: Portable Operating System Interface

Next Steps: Explore Memory Management to understand memory allocation strategies, or dive into Performance Optimization for broader performance techniques.

FilesExpand file tree

Cache_Aware_Programming.md

Latest commit

History

Cache_Aware_Programming.md

File metadata and controls

Cache-Aware Programming

📋 Table of Contents

🎯 Overview

Key Concepts for Embedded Development

🤔 What is Cache-Aware Programming?

Core Principles

Why Cache Performance Matters

🎯 Why is Cache Performance Important?

Performance Impact

Real-world Impact

When Cache Optimization Matters

🧠 How CPU Caches Work

Cache Basics

Cache Organization

Cache Line Concept

Cache Hit vs Cache Miss

Cache Replacement Policies

🏗️ Cache Architecture

Cache Hierarchy

Cache Configuration

Cache Line Structure

📊 Cache Performance Concepts

Locality Principles

Cache Miss Types

Cache Performance Metrics

🔄 Memory Access Patterns

Sequential Access

Random Access

Strided Access

🔧 Cache Optimization Techniques

Data Structure Alignment

Cache Line Padding

Memory Access Optimization

🔧 Implementation

Cache Line Optimization

Data Structure Alignment

Cache Line Padding

Memory Access Patterns

Sequential Access Optimization

Strided Access Optimization

🔒 False Sharing Prevention

What is False Sharing?

Prevention Techniques

🔄 Cache Prefetching

What is Prefetching?

Types of Prefetching

Software Prefetching

🔄 Cache Flushing and Invalidation

When to Flush/Invalidate

Implementation

🔄 Multi-core Cache Considerations

Cache Coherency

Cache Coherency Protocols

Multi-core Optimization

📊 Performance Profiling

Cache Performance Metrics

Profiling Tools

Profiling Example

⚠️ Common Pitfalls

1. Ignoring Cache Line Boundaries

2. False Sharing

3. Poor Memory Access Patterns

4. Ignoring Cache Size

✅ Best Practices

1. Data Structure Design

2. Memory Access Patterns

3. Multi-core Considerations

4. Performance Monitoring

5. Code Organization

🎯 Interview Questions

Basic Questions

Advanced Questions

Implementation Questions

📚 Additional Resources