This is a small microbenchmark that uses a loop body plus a "large array access" so that demand accesses to the cache hierarchy and DRAM can be observed in a controllable way.
The main goals are:
- To roughly understand L1/L2/DRAM miss frequency (MPKI/PKI).
- To switch access patterns (dense / strided) and see how
- demand accesses that reach DRAM, and
- L1/L2 hit rates
change.
- In the future, to use this as the basis for trace generation for wrong-path prefetch / loop-tail experiments.
A Python script that wraps perf stat can automatically compute MPKI and DRAM fill PKI.
benchmark.c
Loop body that accesses a small arrayAwhich blows out L1 and a large arrayB.scripts/run_perf_mpki.py
Wrapper that callsperf statand computes and prints L1/L2 miss rates, MPKI, DRAM fill PKI, and IPC.
You can just use gcc.
gcc -O3 -march=native -Wall -o benchmark benchmark.c./benchmark A_bytes B_bytes chunk_bytes [access_mode] [stride_elems] [outer_scale]-
A_bytes- Size of the small array
Ain bytes. - On every iteration, we scan all elements and blow out L1.
- Size of the small array
-
B_bytes- Logical size of
Bin bytes. - This divided by
chunk_bytesdetermines the outer loop count.
- Logical size of
-
chunk_bytes- Size (in bytes) used to cut B into "chunks".
B_bytesmust be a multiple ofchunk_bytes.
-
access_mode(optional, default 0)0= dense access (contiguous access)1= strided access
-
stride_elems(optional, default 8)- Stride when
access_mode=1(in units of double elements). - Even when
access_mode=0, it still affects the allocation size of array B (see below).
- Stride when
-
outer_scale(optional, default 1)- How many times to run the entire loop.
- A knob that scales the effective outer loop count.
The main parameters in the code are:
A_elems = A_bytes / sizeof(double)B_elems = B_bytes / sizeof(double)… logical number of elements in Bchunk_elems = chunk_bytes / sizeof(double)
Constraints:
A_elems > 0B_elems > 0chunk_elems > 0B_elems % chunk_elems == 0
When the logical B is divided into chunks:
base_outer_iters = B_elems / chunk_elems
Then we scale everything with outer_scale:
outer_iters = base_outer_iters * outer_scale
The actual loops in the kernel are:
for (outer = 0; outer < outer_iters; outer++) {
// Sweep the entire A to blow out L1
for (i = 0; i < A_elems; i++) { ... }
size_t base = outer * chunk_elems * stride_elems;
// Always loop exactly chunk_elems times
for (j = 0; j < chunk_elems; j++) {
size_t idx = base + j * stride_elems;
sum += B[idx];
}
}-
When
access_mode == 0(dense):stride_elemsis effectively fixed to 1.- However,
user_stridestill affects the allocation size of B.
-
When
access_mode == 1(strided):stride_elems = user_strideis used as is.
In summary:
-
The logical B size (
B_bytes) andchunk_bytesdetermine the outer and inner loop counts. -
Changing only
stride_elemsmeans:- The outer and inner loop counts stay the same.
- The number of executed instructions is almost the same.
- Only the access pattern and working set change.
The actual number of elements allocated for B is
B_elems_alloc = B_elems * user_stride * outer_scale;-
In the dense case (
access_mode=0):- Access uses
stride_elems = 1. - But we allocate with slack, multiplied by
user_strideandouter_scale.
- Access uses
-
In the strided case (
access_mode=1):stride_elems = user_stride.- The required size is
B_elems * user_stride * outer_scale, which exactly matchesB_elems_alloc.
The reasons for this design are:
-
Keep the same loop counts and same function.
-
Change only
stride_elemsso that:- The number of instructions is almost unchanged.
- Only the memory access pattern and the rate of reaching DRAM changes, which makes comparison easier.
-
A_bytes- Slightly larger than the core's L1D.
- Example: if L1D = 32 KiB, use
A_bytes = 32*1024.
-
B_bytes- Make it much larger than the LLC so that you reach DRAM often enough.
- Example:
B_bytes = 512*1024*1024(512 MiB).
-
chunk_bytes- Around the L2 size, or a subset of it.
- Example:
chunk_bytes = 512*1024(512 KiB).
-
access_mode/stride_elems-
Dense:
access_mode = 0,stride_elemscan be anything (it only affects allocation size). -
Strided:
access_mode = 1, for example:stride_elems = 8(about 1 element per cache line)stride_elems = 16, etc.
-
-
outer_scale- Use a larger value when you want to emphasize the IPC / MPKI of the loop body.
- Example:
outer_scale = 100.
This script:
-
Calls
perf stat -x,to collect CSV output. -
Parses the following counters:
cyclesinstructionsL1-dcache-loadsL1-dcache-load-missesl2_cache_accesses_from_dc_missesl2_cache_misses_from_dc_misses- DRAM related events (these differ by node)
-
And from these, it computes and displays:
- L1 miss rate
- L2 miss rate (using L1D misses as the denominator)
- L1 / L2 MPKI
- DRAM fill PKI (demand fills per 1K instructions)
- IPC
Format:
./scripts/run_perf_mpki.py [--node demeter|artemis] <binary> [binary-args...]If you omit --node, it runs in demeter mode.
./scripts/run_perf_mpki.py ./benchmark 32768 536870912 524288 1 16 100In this mode, DRAM fills are counted as the sum of:
ls_refills_from_sys.ls_mabresp_lcl_dramls_refills_from_sys.ls_mabresp_rmt_dram
./scripts/run_perf_mpki.py --node artemis ./benchmark 32768 536870912 524288 1 16 100In this mode, DRAM fills are treated as:
- Use
ls_dmnd_fills_from_sys.mem_io_localas the local part. - Treat remote as 0.
- Raw stderr from
perf(CSV) - Parsed counter list
- Miss rates and IPC
- MPKI / PKI
Example:
=== perf raw output (stderr) ===
... (perf output as is) ...
=== Parsed counters ===
node : demeter
cycles : 1921158234
instructions : 4649560413
L1-dcache-loads : 78971700
L1-dcache-load-misses : 134840330
l2_cache_accesses_from_dc_misses : 134891535
l2_cache_misses_from_dc_misses : 13975630
Demand DRAM fills (L1D): 57480933 (local=57480933, remote=0)
=== Rates / IPC ===
L1 miss rate : 170.75 %
L2 miss rate (on L1D misses): 10.36 %
IPC : 2.420
=== Per-1K-instruction metrics (MPKI/PKI) ===
L1 MPKI : 28.954
L2 MPKI : 3.029
Demand DRAM fills (L1D) PKI : 12.464
Here:
-
L1 MPKI / L2 MPKI
L1/L2 misses per 1000 instructions.
-
DRAM fill PKI
- The sum of
ls_refills_from_sys...orls_dmnd_fills_from_sys...interpreted as "demand DRAM fill count per 1000 instructions".
- The sum of
-
L1 MPKI / L2 MPKI
- Normally increases when you increase the stride.
- When you increase
outer_scale, the effect of initialization is reduced and the miss characteristics of the loop body show up more clearly.
-
DRAM fill PKI
-
A rough indicator of "how much this loop is hitting DRAM".
-
For wrong-path prefetch evaluation:
- The absolute number of DRAM fills may not change much, but
- You want to see how the ratio of demand vs prefetch changes. This benchmark serves as a baseline for that.
-
-
This is a very simple streaming-style microbenchmark.
- It hardly includes factors that matter in real OS or applications, such as TLB behavior, snoops, coherence, and virtualization.
-
perfevent names differ by CPU generation and kernel, so:- When you use a new node, you should check
perf listfirst, and - Assume you may need to extend the event definitions in
run_perf_mpki.py.
- When you use a new node, you should check
-
Depending on the value of
perf_event_paranoid, the kernel may restrict perf and you may not be able to use it at all.
If you use this README as a base, it should make it much easier later to:
- Check assumptions when you wonder "are these numbers strange?"
- Do porting work when you bring this to a new node.
If needed, you can also add a table of "recommended parameter sets (dense/stride × outer_scale)" on top of this.