WRandAi: How Much Does Your Benchmark Depend on the Seed?

January 9, 2024 •

Evaluation GNN Open Source

A leaderboard tells you algorithm A beats algorithm B. But run the whole benchmark again with different random seeds and sometimes B beats A. If the ranking flips when you change nothing but the seed, the ranking wasn’t telling you much in the first place.

This was the central frustration of my PhD work on evaluating unsupervised graph neural networks. Papers reported state-of-the-art results, but the gaps between methods were often smaller than the variance you got from re-seeding. I wanted a single number that answers: how much of this ranking is real, and how much is noise? That number is the W Randomness Coefficient, and wrandai is the package that computes it.

The idea

You evaluate a set of algorithms across a suite of benchmark tests (every dataset–metric combination is one test), and you repeat each test over several random seeds. For each test and seed you get a ranking of the algorithms. If those rankings agree across seeds, the benchmark is stable. If they disagree, the result is fragile.

The W Randomness Coefficient measures that agreement. A coefficient near one means the rankings are consistent and you can trust the ordering; near zero means the ordering is essentially being decided by the random seed. The strongest variant uses the Wasserstein distance between the rank distributions, which accounts not just for whether rankings disagree but by how far.

Using it

pip install wrandai

The input is a single array shaped [benchmark_tests, seeds, algorithms] of absolute scores, where higher is better:

from wrandai import wrandai
import numpy as np

n_algorithms, n_benchmark_tests, n_random_seeds = 10, 20, 10

# perfectly consistent rankings vs. rankings shuffled per seed
perfect = np.zeros((n_benchmark_tests, n_random_seeds, n_algorithms))
random  = np.zeros((n_benchmark_tests, n_random_seeds, n_algorithms))
for test in range(n_benchmark_tests):
    for seed in range(n_random_seeds):
        vals = np.linspace(0, 1, n_algorithms).round(3)
        perfect[test, seed, :] = vals
        np.random.shuffle(vals)
        random[test, seed, :] = vals

print(wrandai.w_randomness(perfect, w_method='w_wasserstein'))  # ~1: rankings agree
print(wrandai.w_randomness(random,  w_method='w_wasserstein'))  # ~0: seed decides the order

There are three methods to choose from — w_wasserstein, w_ties and w_random_coeff — and return_ranks=True gives you each algorithm’s average rank across all tests and seeds.

Why bother packaging it

The coefficient came out of two papers (the consistent-benchmark paper and the comparison-of-measures paper), but a result trapped inside a paper doesn’t change how anyone runs experiments. Putting it behind pip install and a one-function API was the point: the friction of measuring seed-dependence has to be lower than the friction of ignoring it, or people will keep ignoring it.

The broader lesson stuck with me well beyond graphs. Most of the time when we say one model is better than another, we owe ourselves the second question — better across seeds, or better on this one? This little package exists to make that question cheap to ask.

The idea

Using it

Why bother packaging it

Get notified of new posts