A Hands On Guide to Uncore Performance Monitors: Measuring Throughput and Histogram Based Latency

Measure Real-IO-Performance

When you need to know how fast data actually moves through PCIe® links, fabrics, or DMA engines, CPU IPC/CPI isn’t the right lens—those are core metrics. Real I/O bandwidth (BW) and latency live in the uncore (I/O controllers, fabrics, memory subsystem). Silicon vendors instrument these blocks with Performance counters that software can read in real time. For example, Intel’s IIO (Integrated I/O) uncore PMU is documented for analyzing PCIe controller traffic from software, while NVIDIA Jetson platforms expose HSIO/UCF uncore PMUs (readable via Linux perf) to measure bus/I/O events directly in software. 

1) Throughput from Uncore PMON/PMU Counters 

What the hardware exposes (registers you read) 

Most I/O PMON blocks provide: 

  • Cycle accounting (all in the PMON clock domain): 
  • active_cnt represents cycles where data moves forward without obstruction. 
  • busy_cnt represents cycles where data is ready but blocked by downstream flow control. 
  • idle_cnt represents cycles where the interface is completely inactive  

 

The window time base in cycles is: 

  Δ cycles = Δ active_cnt + Δ busy_cnt + Δ idle_cnt
 

Data movement: 

  • byte_cnt: payload bytes observed by the path (preferred; handles variable sizes & protocol effects) 
  • pkt_cnt: number of packets/transactions (use only when packet size is fixed)
     

How software samples BW counters at t₀ and t₁ 

These counter values live in memory‑mapped registers (CSRs) exposed by the uncore PMU driver: 

1. Reading CSRs

Each counter is a register (often 32‑ or 48/64‑bit). Software reads them via: 

  • A kernel interface (e.g., Linux perf PMU events or ioctls) that abstracts address/encoding, or 
  • Direct MMIO/CSR reads in firmware/RTOS environments.
    (Jetson docs show perf stat with uncore events; libpfm exposes Intel IIO events similarly.) [wseas.us][techcommun…rosoft.com] 

 

2. Snapshot semantics (common patterns) 

  • Free‑running counters: read each register back‑to‑back at t₀, then again at t₁; deltas are your window values. 
  • Freeze/snapshot: some PMUs provide a freeze bit to latch all counters atomically; software sets freeze, reads, then clears it—repeat at t₁. 
  • Read‑to‑clear windowing: reading (or a control bit) clears counters, so the next read yields the next window directly. Vendors document which mode applies for a given PMU. [wseas.us] 

 

3. 64‑bit reads

If counters are split into HI/LO, the driver or your access routine handles atomicity (e.g., read LO, then HI, re‑check LO). Platform PMU drivers usually abstract this. 

Example :  

Counter  Value at t₀  Value at t₁ 
active_cnt  125,000,000  215,000,000 
busy_cnt  25,000,000  55,000,000 
idle_cnt  100,000,000  105,000,000 
byte_cnt  8,000,000,000  20,000,000,000 

 

 

 

2) Histogram‑Based Latency  

 Latency is the amount of time it takes for a request to travel through a system from initiation to completion. 

 

In Histogram based implementation , each completed transaction increments exactly one latency bin (based on measured cycles). Ranges are configurable (linear or exponential), and the counters are monotonically increasing.  

How a bin counter increments (what the hardware does) 

  1. Ingress timestamp: At request enqueue, the PMON domain starts the latency cycle counter. 
  2. Egress timestamp: At request completion, the PMON domain stops the cycle counter  and derives the latency from cycle counter. 
  3. Histogram update: The computed latency is evaluated against configured threshold boundaries, and the appropriate histogram counter (hist_bin_k) is incremented. 

 

       if L ∈ [L0, U0]: hist_bin_0++
 

      elif L ∈ [L1, U1]: hist_bin_1++
 

     …
 

      else: hist_bin_N++ # open-ended last bin (≥ L_N)
 

What software computes from the bins 

Sample all bins at t₀ and t₁, then  

        Δbin_k = bin_k(t₁) − bin_k(t₀). 

Average latency (use a representative midpoint mid_k per bin): 

        total_pkts = Σ_k Δbin_k
        avg_latency ≈ (Σ_k Δbin_k * mid_k) / total_pkts 

 

Example :  

Bin  Latency Range  Representative midₖ 
Bin 0  0–15  8((0+15)/2) 
Bin 1  16–31  24((16+31)/2) 
Bin 2  32–63  48((32+63)/2) 
Bin 3  64–127  96((64+127)/2) 
Bin 4  ≥128  160( implementation specific fixed value) 

 

PMON histogram counters 

Bin  Count at t₀  Count at t₁ 
hist_bin_0  1,000,000  1,180,000 
hist_bin_1  600,000  690,000 
hist_bin_2  200,000  220,000 
hist_bin_3  20,000  24,000 
hist_bin_4  5,000  6,000 

      

Step 1: Compute per‑bin deltas 

Bin  Δ Count 
0  180,000 
1  90,000 
2  20,000 
3  4,000 
4  1,000 

  

Total packets=295,000 

 

3) A Minimal, Real‑Time Monitor Loop (software pattern) 

  1. Pin the sampling thread to avoid migration noise. 
  2. Read (active/busy/idle, byte_cnt, all hist_bin_k) at t₀ (use freeze/snapshot if available). 
  3. Sleep Δt (e.g., 100 ms). 
  4. Read again at t₁. 
  5. Compute 
  • a.Throughput: Δbytes/Δcycles * f_clk 
  • b. Latency: mean (midpoints)  

References 

  • Intel IIO uncore PMU (PCIe I/O analysis) — the IIO PMU is explicitly described as the facility to analyze PCIe controller traffic from software (available via libpfm / Linux PMU). [techcommun…rosoft.com] 
  • NVIDIA Jetson uncore PMUs (HSIO/UCF) — documentation and perf examples for reading bus/I/O events (including counters and frequencies) to compute bandwidth/latency in software. [wseas.us]