You are using an ad blocker that is interfering with our web typography and internal javascript. Please whitelist our domain to live in a more beautiful world. No ads here, just really great software!

Nexthink Receives Gartner Peer Insights 'Voice of the Customer' distinction Read the Report

Blog Post|12 minutes

Smarter CPU Testing – How to Benchmark Kaby Lake & Haswell Memory Latency

Smarter CPU Testing – How to Benchmark Kaby Lake & Haswell Memory Latency
published
September 2nd

This article originally appeared in Medium.

Modern CPUs Basics: Cache Hierarchy

Intel Kaby Lake Cache Hierarchy and Access Latency. Source: Intel SDM

L1 cache hit latency:   5 cycles / 2.5 GHz = 2 ns
L2 cache hit latency:  12 cycles / 2.5 GHz = 4.8 ns
L3 cache hit latency:  42 cycles / 2.5 GHz = 16.8 ns
Memory access latency: L3 cache latency + DRAM latency = ~60-100 ns

Benchmarking Intel Kaby Lake Memory Latency

List Nodes Padding

Adjacent Nodes Share the Same Cache Line
// Memory page size. Default page size is 4 KB
const auto kPageSize = 4_KB;struct ListNode {
  ListNode *next;
  std::byte padding[kPageSize];
};

Working Set Size vs Number of Nodes

// Cache line size: 64 bytes for x86-64, 128 bytes for A64 ARMs
const auto kCachelineSize = 64_B;// Each memory access fetches a cache line
const auto num_nodes = mem_block_size / kCachelineSize;// Allocate a contiguous list of nodes for an iteration
std::vector<ListNode> list(num_nodes);

The List Traversal Benchmark

// Make a cycle of the list nodes
for (size_t i = 0; i < list.size() - 1; i++)
  list[i].next = &list[i + 1];
list[list.size() - 1].next = &list[0];
static auto traverse_list(ListNode *head, size_t num_ops) {
  while (num_ops--) head = head->next;
  return head;
}

Intel Kaby Lake Memory Latency Results

-----------------------------------------------------------
Benchmark                                   Time      Nodes
-----------------------------------------------------------
memory_latency_list/size KB:1            1.01 ns         16
memory_latency_list/size KB:2            1.02 ns         32
memory_latency_list/size KB:4            1.10 ns         64
memory_latency_list/size KB:8            3.32 ns        128
memory_latency_list/size KB:16           3.32 ns        256
memory_latency_list/size KB:32           3.33 ns        512
memory_latency_list/size KB:64           5.33 ns       1024
memory_latency_list/size KB:128          8.58 ns         2k
memory_latency_list/size KB:256          13.9 ns         4k
memory_latency_list/size KB:512          14.2 ns         8k
memory_latency_list/size KB:1024         14.2 ns        16k
memory_latency_list/size KB:2048         17.3 ns        32k
memory_latency_list/size KB:4096         63.1 ns        64k
memory_latency_list/size KB:8192         96.4 ns       128k
memory_latency_list/size KB:16384         104 ns       256k
L1 cache hit latency:   5 cycles / 2.5 GHz = 2 ns
L2 cache hit latency:  12 cycles / 2.5 GHz = 4.8 ns
L3 cache hit latency:  42 cycles / 2.5 GHz = 16.8 ns
Memory access latency: L3 cache latency + DRAM latency = ~60-100 ns
Intel Kaby Lake List Node Access Latency

Benchmarking Intel Haswell Memory Latency

L1 cache hit latency:   5 cycles / 2.6 GHz = 1.92 ns
L2 cache hit latency:  11 cycles / 2.6 GHz = 4.23 ns
L3 cache hit latency:  34 cycles / 2.6 GHz = 13.08 ns
Memory access latency: L3 cache latency + DRAM latency = ~60-100 ns
-----------------------------------------------------------
Benchmark                                   Time      Nodes
-----------------------------------------------------------
memory_latency_list/size KB:1            1.30 ns         16
memory_latency_list/size KB:2            1.29 ns         32
memory_latency_list/size KB:4            1.30 ns         64
memory_latency_list/size KB:8            3.92 ns        128
memory_latency_list/size KB:16           4.03 ns        256
memory_latency_list/size KB:32           3.90 ns        512
memory_latency_list/size KB:64           6.54 ns       1024
memory_latency_list/size KB:128          9.71 ns         2k
memory_latency_list/size KB:256          17.4 ns         4k
memory_latency_list/size KB:512          21.2 ns         8k
memory_latency_list/size KB:1024         21.2 ns        16k
memory_latency_list/size KB:2048         28.9 ns        32k
memory_latency_list/size KB:3072         43.3 ns        48k
memory_latency_list/size KB:4096         52.0 ns        64k
memory_latency_list/size KB:8192         71.8 ns       128k
memory_latency_list/size KB:16384        80.4 ns       256k
Intel Haswell List Node Access Latency

Analyzing Memory Latency Benchmark

Intel Kaby Lake Level 1 Cache Latency Analysis

------------------------------------------------------
Benchmark                              Time      Nodes
------------------------------------------------------
memory_latency_list/size KB:1       1.02 ns         16
memory_latency_list/size KB:2       1.03 ns         32
memory_latency_list/size KB:3       1.03 ns         48
memory_latency_list/size KB:4       1.05 ns         64
memory_latency_list/size KB:5       3.33 ns         80
memory_latency_list/size KB:6       3.33 ns         96
memory_latency_list/size KB:7       3.33 ns        112
memory_latency_list/size KB:8       3.35 ns        128
Intel Kaby Lake Level 1 Cache Access Latency Jump

Modern CPUs Basics: Virtual and Physical Memory

Process Virtual Address Space and Physical Memory
Level 1 Instruction Cache TLB (ITLB) for 4 KB pages: 128 entries
Level 1 Data Cache TLB (DTLB) for 4 KB pages:        64 entries
Level 1 Data Cache TLB (DTLB) for 2/4 MB pages:      32 entries
Level 1 Data Cache TLB (DTLB) for 1 GB pages:        4 entries
Level 2 Unified TLB (STLB) for 4 KB and 2 MB pages:  1536 entries
Level 2 Unified Cache TLB (STLB) for 1 GB pages:     16 entries

Level 1 Cache Latency Analysis (continued)

Intel Kaby Lake Level 1 Cache Access Latency Jump
Level 1 Data Cache TLB (DTLB) for 4 KB pages:        64 entries

Intel Kaby Lake Level 2 Cache Latency Analysis

Intel Kaby Lake Level 2 Cache Latency Jump
--------------------------------------------------------
Benchmark                                Time      Nodes
--------------------------------------------------------
memory_latency_list/size KB:48        4.72 ns        768
memory_latency_list/size KB:64        5.42 ns       1024
memory_latency_list/size KB:80        5.34 ns      1.25k
memory_latency_list/size KB:96        5.80 ns       1.5k
memory_latency_list/size KB:112       7.97 ns      1.75k
memory_latency_list/size KB:128       8.34 ns         2k
memory_latency_list/size KB:144       8.36 ns      2.25k
memory_latency_list/size KB:160       8.82 ns       2.5k
Level 2 Unified TLB (STLB) for 4 KB and 2 MB pages:  1536 entries

Intel Kaby Lake Level 3 Cache Latency Analysis

Intel Kaby Lake Level 3 Cache Latency Jump

What About The Cache Prefetching?

How to Avoid TLB Misses

Key Takeaways