Distributions
Distributions are an optional extension to metric samples that tell you more about the underlying data that produced them. Properties with a dist_
prefix describe the data that contributed to a sample in more detail. Each of these properties is optional.
Emitters that are distribution-aware may treat events that carry them differently. emit_otlp
treats samples carrying an exponential histogram as an OTLP exponential histogram. emit_term
summarizes these same samples with quartiles.
Sums and extrema
Say we have the following metric sample:
#![allow(unused)] fn main() { extern crate emit; emit::count_sample!(name: "http_response", value: 500); }
This tells us we've seen 500 values, but doesn't tell us anything about those 500 values themselves.
Attaching the sum of values with dist_sum
gives us some more information:
#![allow(unused)] fn main() { extern crate emit; emit::count_sample!( name: "http_response", value: 500, props: emit::props! { dist_sum: 1689628, }, ); }
With both the sum and the count, we can compute the mean as 3379.256
.
The mean gives us a central point for the dataset, but the same mean could come from very different bounds.
Attaching the extrema with dist_min
and dist_max
further tells us what the range of values is:
#![allow(unused)] fn main() { extern crate emit; emit::count_sample!( name: "http_response", value: 500, props: emit::props! { dist_sum: 1689628, dist_min: 100, dist_max: 29046, } ); }
Sum and extrema data model
Each well-known metric aggregation (the value of metric_agg
) has a corresponding dist_{metric_agg}
. The type and semantics of distribution properties is the same as their aggregation.
Distribution properties vs separate metrics
The dist_sum
, dist_count
, dist_min
, and dist_max
properties each have a corresponding value for metric_agg
. For example, if we take the final sample from earlier we could split it into 4 individual samples instead:
#![allow(unused)] fn main() { extern crate emit; emit::count_sample!(name: "http_response", value: 500); emit::sum_sample!(name: "http_response", value: 1689628); emit::min_sample!(name: "http_response", value: 100); emit::max_sample!(name: "http_response", value: 29046); }
The difference between these two representations is whether those individual samples are valuable in their own right. Emitters may ignore distribution properties, so if you want to track that aggregation, then prefer separate samples.
Exponential histograms
A histogram is a compression of the underlying data source that buckets nearby values together and counts them, rather than storing the raw values themselves. Histograms give you an idea of how values are distributed across their range.
An exponential histogram automatically sizes its buckets using an exponential function, so buckets closer to zero are smaller (more accurate) than buckets further away from zero. They're good for light-tail distributions, where values are clustered near the front and extremes are rare. Light-tail distributions have roughly this shape:
Typical web request latencies follow this shape. Most requests for a given endpoint complete around the same time, but in rare circumstances they may take much longer.
emit
supports attaching an exponential histogram to a metric sample with the dist_exp_scale
and dist_exp_buckets
properties:
#![allow(unused)] fn main() { extern crate emit; emit::count_sample!( name: "http_response", value: 500, props: emit::props! { dist_sum: 1689628, dist_min: 100, dist_max: 29046, dist_exp_scale: 2, // Buckets have a complex type so need to be captured with // either `serde` or `sval` #[emit::as_sval] dist_exp_buckets: [ (99.07220457217667, 7), (117.81737057623761, 7), (140.10925536017402, 7), (166.61892335205206, 7), (198.14440914435335, 6), (235.63474115247521, 6), (280.218510720348, 9), (333.2378467041041, 9), (396.2888182887066, 12), (471.2694823049503, 11), (560.4370214406958, 13), (666.475693408208, 13), (792.5776365774132, 15), (942.5389646099006, 19), (1120.8740428813917, 24), (1332.951386816416, 24), (1585.1552731548263, 20), (1885.0779292198008, 21), (2241.748085762783, 32), (2665.9027736328317, 37), (3170.3105463096517, 28), (3770.1558584396016, 34), (4483.496171525566, 27), (5331.8055472656615, 34), (6340.621092619303, 34), (7540.311716879201, 19), (8966.99234305113, 5), (10663.611094531323, 2), (12681.242185238603, 2), (15080.623433758403, 4), (21327.222189062646, 3), (25362.484370477203, 8), (30161.2468675168, 1), ], } ); }
Exponential histogram data model
emit
's exponential histograms are a pair of well-known properties:
dist_exp_scale
: An integer with the scale of the histogram.dist_exp_buckets
: A 2-dimensional sequence of bucket midpoints and counts. The sequence may be constructed from an array of 2-element tuples, or a map where the keys are bucket midpoints and the values are counts. Buckets have a complex type, so need to be captured using either theas_serde
oras_sval
attributes. See Property Capturing for more details.
Building exponential histograms
emit
doesn't directly define a type that builds an exponential histogram for you. What it does provide is the midpoint
function, returning a Point
that can be stored in a BTreeMap
or HashMap
.
Here's an example type that can collect an exponential histogram from raw values:
#![allow(unused)] fn main() { extern crate emit; use std::collections::BTreeMap; struct MyDistribution { scale: i32, max_buckets: usize, total: u64, buckets: BTreeMap<emit::metric::exp::Point, u64>, } impl MyDistribution { pub fn new() -> Self { MyDistribution { // Pick a large initial scale, we'll resample automatically // when the number of stored buckets overflows `max_buckets` scale: 20, max_buckets: 160, total: 0, buckets: BTreeMap::new(), } } pub fn buckets(&self) -> &BTreeMap<emit::metric::exp::Point, u64> { &self.buckets } pub fn total(&self) -> u64 { self.total } pub fn scale(&self) -> i32 { self.scale } pub fn observe(&mut self, value: f64) { *self .buckets .entry(emit::metric::exp::midpoint(value, self.scale)) .or_default() += 1; self.total += 1; // If we've overflowed then reduce our scale and resample // Each time `scale` is decremented, our number of buckets will be halved if self.buckets.len() >= self.max_buckets { self.scale -= 1; let mut resampled = BTreeMap::new(); for (value, count) in &self.buckets { *resampled .entry(emit::metric::exp::midpoint(value.get(), self.scale)) .or_default() += *count; } self.buckets = resampled; } } } }
An exponential histogram in MyDistribution
can then be converted into a metric sample:
#![allow(unused)] fn main() { extern crate emit; #[derive(Default)] struct MyDistribution { scale: i32, max_buckets: usize, total: u64, buckets: std::collections::BTreeMap<emit::metric::exp::Point, u64>, } impl MyDistribution { pub fn buckets(&self) -> &std::collections::BTreeMap<emit::metric::exp::Point, u64> { &self.buckets } pub fn total(&self) -> u64 { self.total } pub fn scale(&self) -> i32 { self.scale } } let my_distribution = MyDistribution::default(); emit::count_sample!( name: "http_response", value: my_distribution.total(), props: emit::props! { dist_exp_scale: my_distribution.scale(), #[emit::as_sval] dist_exp_buckets: my_distribution.buckets(), }, ); }
How exponential histograms work
Exponential histograms internally use γ, a value close to 1
, as a log base for computing the bucket a sample belongs to.
The midpoint
function computes γ from scale
and uses it to find the midpoint of the bucket a raw value belongs to. You can also compute these values yourself if you don't want to store midpoints, or can use a faster but possibly non-portable implementation.
Computing γ
From scale
:
\[ γ = 2^{2^{-scale}} \]
From error
:
\[ γ = \frac{1 + error}{1 - error} \]
emit
uses the same scheme as OpenTelemetry for computing γ from scale
. This form has the benefit of perfect subsetting, where each decrement of the scale exactly halves the number of buckets. This makes it possible to resample or merge histograms with different scales without needing to interpolate any buckets.
Computing index
The index is the bucket that a value belongs to. Values that are close together will share the same bucket.
\[ index = \lceil{\log_γ (\lvert{value}\rvert)}\rceil \]
Computing midpoint
The midpoint is a value at the center of a bucket index.
\[ midpoint = \frac{γ^{index - 1} + γ^{index}}{2} \]
Rescaling
If you have a scale and range of bucket indexes, you can compute a new scale that fits them into a target maximum number of buckets.
\[ scale_1 = scale_0 - \lceil{\log_2 \left({\frac{index_{max} - index_{min}}{size}}\right)}\rceil \]
Computing error
from scale
If you have a scale, you can compute the error value from it, which gives you an idea of how accurate bucket values are. The error is slightly misleading because it's a percentage rather than an absolute value. Larger values can be further from their midpoint than smaller ones.
\[ error = \frac{2^{2^{-scale}} - 1}{2^{2^{-scale}} + 1} \]