Getting Started
Installation
Or with uv:
Basic Usage
1. Define a problem
A Maximum Diversity Problem consists of a set of n vectors in d dimensions, from which you want
to select k that are as diverse as possible.
import numpy as np
from max_div import MaxDivProblem, MaxDivSolverBuilder, seconds
# 200 random points in 5 dimensions
rng = np.random.default_rng(42)
vectors = rng.random((200, 5))
# Select 20 that maximize diversity
problem = MaxDivProblem.new(vectors, k=20)
Vectors are passed as a 2D numpy array.
2. Solve
Use the builder to configure and run the solver. The simplest approach is to just specify a
time budget. Here we don't specify a preset, so the default (SMART) is used:
The solver will print progress as it runs. Pass verbosity=0 for silent operation, or
verbosity=20 for a detailed progress table.
3. Interpret the results
# One-line summary
print(solution)
# MaxDivSolution: 20 vectors selected | diversity=0.7792 | 5.17s (39_008 iterations)
# Indices of the selected vectors
print(solution.i_selected) # e.g. array([ 3, 7, 14, ...], dtype=int32)
# The selected vectors themselves
selected_vectors = vectors[solution.i_selected]
# How long the solver ran
print(solution.duration) # 5.17s (39_008 iterations)
# Multi-component score of the solution
print(solution.score)
# size=1.0000 | constraints=1.0000 | diversity=0.7792
The score has three main components, shown in strict priority order.
In a final solution, size is always 1.0. For an unconstrained problem, constraints is also
always 1.0:
size-- 1.0 when exactlykvectors are selectedconstraints-- 1.0 when all fairness constraints are satisfieddiversity-- the main diversity score (higher is better)
This priority ordering means that solutions with better constraint satisfaction will always be preferred over solutions with higher diversity.
Adding Fairness Constraints
Constraints let you enforce that a minimum and/or maximum number of selected vectors come from specific subsets. This is useful for ensuring fair representation across groups.
from max_div import Constraint
# Vectors 0-99 are "group A", vectors 100-199 are "group B"
# Require at least 8 and at most 12 from each group
constraints = [
Constraint(int_set=set(range(0, 100)), min_count=8, max_count=12),
Constraint(int_set=set(range(100, 200)), min_count=8, max_count=12),
]
problem = MaxDivProblem.new(vectors, k=20, constraints=constraints)
solver = MaxDivSolverBuilder(problem).with_preset(seconds(5)).build()
solution = solver.solve(verbosity=0)
print(solution)
# MaxDivSolution: 20 vectors selected | diversity=0.7702 | constraints: 2/2 satisfied | ...
Note that the diversity is slightly lower than the unconstrained solution (0.7702 vs 0.7792) -- this is expected, since satisfying constraints restricts the search space.
Constraints can overlap -- a single vector can belong to multiple constraint groups.
If constraints are infeasible (or jointly infeasible), the solver will not fail. Instead, it gracefully finds the least infeasible solution -- the one that violates constraints as little as possible. This follows naturally from the score's priority ordering: a higher constraint score always outweighs a higher diversity score.
Choosing Metrics
Distance metrics
The distance metric determines how the distance between two vectors is measured.
from max_div import DistanceMetric
problem = MaxDivProblem.new(
vectors, k=20,
distance_metric=DistanceMetric.L2_EUCLIDEAN, # default
# distance_metric=DistanceMetric.L1_MANHATTAN,
# distance_metric=DistanceMetric.L2S_EUCLIDEAN_SQUARED,
)
| Metric | Description |
|---|---|
L2_EUCLIDEAN |
Standard Euclidean distance (default) |
L1_MANHATTAN |
Sum of absolute differences |
L2S_EUCLIDEAN_SQUARED |
Squared Euclidean -- avoids the square root, produces identical solutions when used with GEOMEAN_SEPARATION |
Diversity metrics
The diversity metric defines what "diverse" means for the selected subset. It operates on the separation of each selected vector -- its distance to its nearest neighbor in the selection. A well-spread selection will have high separations across all vectors; a clustered selection will have low separations for the vectors that are close together.
from max_div import DiversityMetric
problem = MaxDivProblem.new(
vectors, k=20,
diversity_metric=DiversityMetric.GEOMEAN_SEPARATION, # default
# diversity_metric=DiversityMetric.MIN_SEPARATION,
# diversity_metric=DiversityMetric.MEAN_SEPARATION,
# diversity_metric=DiversityMetric.APPROX_GEOMEAN_SEPARATION,
)
| Metric | Description | Best for |
|---|---|---|
GEOMEAN_SEPARATION |
Geometric mean of all separations (default) | General use -- balances spread across all selected vectors |
MIN_SEPARATION |
Minimum separation (= p-dispersion) | When the closest pair matters most |
MEAN_SEPARATION |
Arithmetic mean of all separations | When total spread is the objective |
APPROX_GEOMEAN_SEPARATION |
Fast approximation of GEOMEAN_SEPARATION |
Large-scale problems where speed matters |
Solver Presets
Presets configure the initialization and optimization strategies. All presets accept a
time budget (via seconds(), minutes(), hours()) or an iteration count
(via iterations()).
from max_div import SolverPreset, minutes, iterations
# Using a named preset
solver = (
MaxDivSolverBuilder(problem)
.with_preset(minutes(2), preset=SolverPreset.SMART)
.build()
)
# Or using iteration count instead of wall-clock time
solver = (
MaxDivSolverBuilder(problem)
.with_preset(iterations(10_000))
.build()
)
| Preset | Description |
|---|---|
RANDOM |
Baseline -- random initialization + random swap optimization. Fast but lower quality. |
GUIDED |
Distance-guided swaps -- biased towards removing low-separation vectors and adding high-separation ones. |
SMART |
Default. Adaptive swap-based optimization that learns which swap sizes and candidate selection strategies work best. |
THOROUGH |
Like SMART but with wider parameter ranges, exploring more of the search space at the cost of slower convergence. Best for long runs. |
As a rule of thumb, SMART (the default) is a good starting point. Use THOROUGH for long runs
(minutes to hours) where you want the best possible solution quality. Use RANDOM or GUIDED as
baselines for comparison.
The time budget should be chosen relative to the problem size. A good starting point is to give the
solver at least 10 * k to 100 * k iterations.
Reproducibility
The solver is deterministic for a given seed. Change the seed to get different solutions:
Running the same solver multiple times with different seeds and keeping the best result is a simple way to improve solution quality.
Next Steps
- Concepts -- how the solver pipeline, scoring, constraints, and diversity metrics work under the hood
- API Reference -- full details on all configuration options
- Benchmark Results -- see how different strategies and presets perform across various test problems
- CLI -- run benchmarks and solve problems from the command line