Diversity & Distance
From Vectors to Diversity
The solver evaluates how diverse a selection is through a three-step process:
Vectors ──> Pairwise Distances ──> Separations ──> Diversity Score
(n x d) (n*(n-1)/2) (k x 1) (scalar)
-
Pairwise distances are computed once upfront between all
nvectors using the chosen distance metric. These are stored as a condensed distance vector (like scipy'spdist). -
Separations are computed for each selected vector: the distance to its nearest neighbor within the current selection. This is the key quantity -- a vector with high separation is well-spread from the rest of the selection; a vector with low separation is close to at least one other selected vector.
-
The diversity score aggregates the
kseparation values into a single scalar using the chosen diversity metric.
Distance Metrics
The distance metric determines how the distance between two vectors is measured.
| Metric | Formula | Notes |
|---|---|---|
L2_EUCLIDEAN |
$\(d = \sqrt{\sum_i (x_i - y_i)^2}\)$ | Standard Euclidean distance. Default. |
L1_MANHATTAN |
$\(d = \sum_i \lvert x_i - y_i \rvert\)$ | Less sensitive to outlier dimensions. |
L2S_EUCLIDEAN_SQUARED |
$\(d = \sum_i (x_i - y_i)^2\)$ | Avoids the square root. Produces identical solutions to L2_EUCLIDEAN when used with GEOMEAN_SEPARATION, since the geometric mean preserves distance ordering regardless of squaring. |
Separation
The separation of a selected vector is its minimum distance to any other selected vector:
where \(S\) is the current selection and \(d\) is the chosen distance metric.
Separation is maintained incrementally during optimization:
- Adding a vector only requires checking its distances to all other vectors and updating any separations that decrease.
- Removing a vector requires recomputing separation only for vectors whose nearest neighbor was the removed vector.
This incremental update is much cheaper than recomputing all separations from scratch after each swap.
Diversity Metrics
Each diversity metric aggregates the k separation values differently:
| Metric | Formula | Characteristics |
|---|---|---|
GEOMEAN_SEPARATION |
$\(\exp\!\left(\frac{1}{k}\sum_{v \in S} \ln(\text{sep}(v))\right)\)$ | Default. Balances all separations. Sensitive to any vector with low separation -- a single poorly-placed vector drags down the score. |
MIN_SEPARATION |
$\(\min_{v \in S} \text{sep}(v)\)$ | Only considers the worst-off vector (the closest pair). Equivalent to the p-dispersion problem. Many swaps produce tied scores. |
MEAN_SEPARATION |
$\(\frac{1}{k}\sum_{v \in S} \text{sep}(v)\)$ | Averages all separations. Less sensitive to individual outliers than geomean, but can be dominated by a few very high separations. |
APPROX_GEOMEAN_SEPARATION |
Same as geomean but using fast log/exp approximations | Slightly less accurate but faster per iteration. Useful for large-scale problems where iteration speed matters more than per-iteration precision. |
Which metric to choose?
GEOMEAN_SEPARATIONis the best default. It naturally penalizes any clustering in the selection while remaining smooth and differentiable in most of the search space.MIN_SEPARATIONis appropriate when you specifically care about the worst-case nearest neighbor distance (e.g., facility placement where minimum coverage radius matters). Expect slower convergence due to many tied scores.MEAN_SEPARATIONmaximizes total spread. It may tolerate some clustering as long as other vectors compensate with large separations.APPROX_GEOMEAN_SEPARATIONis a drop-in replacement forGEOMEAN_SEPARATIONwhen you want to trade a small amount of precision for more iterations per second.