I show how to construct an  infinite sequence of samples from a set $S$, such that for all $m>0$, the first $m$ terms of the sequence, all subsets of $S$ occur as evenly as possible.

Introduction

Consider the scenario where you need to select $m=5000$ subsets of the set $S$ of 11 items. One reasonable method,  is for randomly select any one of the possible  $9 \choose 4$  $=126$ sets, each time.

where each set must comprise of exactly  $k=4$ out of a possible $n=11$ items.

That is, if we denote the $n=9$ items as ${A,B,C,D,E,F,G,H,I}$, then one possibility for the first 20 subsets is:

$$ABEG,AGHI,BCDE,ADGI,ABFI,$$

$$BCDF,CEFG,CEFI,BEGH,BFGH,$$

$$ADEF,AEGI,ABDG,DFGH,ABCI,$$

$$BFGI,EFGH,ACHI,BEFG,AEHI$$

A commonly encountered problem with uniform random sampling, is that although each item is statistically equally likely to appear, for any finite number of samples, it almost certain that some items may be selected considerably more often than others. In many situations, it would be useful if such a discrepancy could be minimized.

For example, in the above example, item $G$ occurs 12 times and yet item $C$ only occurs 6 times. Similarly, particular pairs of items (2-sets) also occur more often than others. For example, in our case, both elements of the 2-set $\{ A,I\}$ are simultaneously selected 7 out of the 20 times, and yet the elements from the 2-set $\{C,G\}$ are only simultaneously selected once.

We now show how to create an open (infinite) sequence $\{s_0,s_1,s_2,\ldots, \}$, where each $s_i$ is an subset of $S$; such that for all values of $m$ and $j$ such that $(1 \leq j \leq k)$, that among the first $m$ terms of the sequence, the frequency of all possible subsets of size $j$ is distributed as evenly as possible.

We will also show how to make a minor modification to this method if you are only wishing to select samples of a specific cardinality. For example, each sample must comprise of exactly $k$ elements.

Low Discrepancy Sequences

Point distributions that are more even than ‘uniform random sampling’ are called low discrepancy sequences (also called quasirandom sequences). See wikipedia article “[low discrepancy sequences][1]” for a brief overview on these sequences. There are numerous ways to construct low discrepancy sequences, including Van der Corput/Halton, Kronecker/Weyl, Niederreiter, and Sobol to name a few. Furthermore, all of these can be used to create $d$-dimensional sequences.

The figure below, taken from my other post, “The Unreasonable effectiveness of quasirandom sequences” shows a comparison of various low discrepancy sequences in two-dimensions. The important feature to note is how all of the distributions are more evenly distributed and have less ‘gaps’ and less ‘clusters’ than the Random distribution [bottom right].

Low Discrepancy Combinatorics

This section shows how we can leverage the properties of low discrepancy point distributions to construct sequences of subsets that have low discrepancy. The following method works with the use of any low discrepancy sequences however, for the Halton, and Kronecker/Weyl sequences it only offers modest improvements to random sampling. In contrast to this, due to their intrinsic relationship with binary sequences, and especially to a special characteristic that Sobol termed, “Property A”, this method produces excellent results when either Sobol and Niederreiter sequences are used.

Methodology for constructing an evenly distributed sequence of  subsets

1. Let $i=1$.
2. Calculate the $i$-th term in the $n$-dimensional Sobol sequence
3. Convert this to a binary vector, $b_i$.
4. If $b_i$  has exactly $k$ ones in it, then this represents a subset that has exactly $k$ items, and so let this be the next subset; otherwise reject this $b_i$
5. $i \leftarrow i+1$.
6. Goto step 2, until number of desired subsets is achieved.

In many cases, this method requires rejecting a modest fraction of subsets. However, as most computers can still typically scan through several million terms in under a second, an acceptance rate of only a few percent, is most likely not an issue for most people. In practice, this means that this method will almost certainly work well for situations where $d \lesssim 20$.

An example, (continued)

To illustrate, let’s continue with the original example, where we need to select a $m =5000$ samples, with each sample having exactly $k=4$ objects out of a possible $N=9$ objects. It should be clear from how this extends to arbitrary $m,n$ and $k$.

First we calculate the $n=9$-dimensional Sobol sequence. This is usually available via most statistical software libraries. Typically each software implementation varies slightly as there are various choices for key parameters that need to be made. Thus when you try to replicate this method for your own application, the values in the your sequence may differ slightly, however, the final results will still hold!

Using Mathematica, the first few 11 terms of the $n=9$-dimensional Sobol sequence are:

$$t_0 = (0.710,0.954,0.116,0.087,0.559,0.001,0.874,0.466,0.654)$$

$$t_1 = (0.460,0.704,0.866,0.338,0.309,0.251,0.124,0.716,0.154)$$

$$t_2 = (0.960,0.204,0.366,0.838,0.809,0.751,0.624,0.216,0.216)$$

$$t_3 = (0.022,0.516,0.053,0.900,0.872,0.689,0.061,0.903,0.716)$$

$$t_4 = (0.522,0.016,0.554,0.400,0.372,0.189,0.562,0.403,0.966)$$

$$t_5 = (0.272,0.266,0.304,0.150,0.622,0.439,0.312,0.653,0.466)$$

$$t_6 = (0.772,0.766,0.804,0.650,0.122,0.939,0.812,0.153,0.341)$$

$$t_7 = (0.397,0.141,0.179,0.025,0.247,0.564,0.687,0.778,0.841)$$

$$t_8 = (0.897,0.641,0.679,0.525,0.747,0.063,0.187,0.278,0.591)$$

$$t_9 = (0.147,0.891,0.429,0.775,0.497,0.314,0.937,0.528,0.091)$$

$$t_{10} = (0.647,0.391,0.929,0.275,0.997,0.814,0.437,0.028,0.122).$$

Now convert each of the components in each term, into a binary $0/1$ value via the Round[.] function. This, results in the following sequence

$$b_0 = (1,1,0,0,1,0,1,0,1)$$

$$b_1 = (0,1,1,0,0,0,0,1,0)$$

$$b_2 = (1,0,0,1,1,1,1,0,0)$$

$$b_3 = (0,1,0,1,1,1,0,1,1)$$

$$b_4 = (1,0,1,0,0,0,1,0,1)$$

$$b_5 = (0,0,0,0,1,0,0,1,0)$$

$$b_6 = (1,1,1,1,0,1,1,0,0)$$

$$b_7 = (0,0,0,0,0,1,1,1,1)$$

$$b_8 = (1,1,1,1,1,0,0,0,1)$$

$$b_9 = (0,1,0,1,0,0,1,1,0)$$

$$b_{10} = (1,0,1,0,1,1,0,0,0)$$

The key part of the method now is to understand that each of the binary vectors defines a unique subset.

For examplem the first term $b_0 = (1,1,0,0,1,0,1,0,1)$ corresponds to selecting only the 1st, 2nd, 5th, 7th and 9th (reading from left to right). That is, the subset ${ABEGI}$. We note that this subset has 5 items not 4, so we reject this sample. Similarly for $b_1,b_2$ and $b_3$.

However, $b_4 = (1,0,1,0,0,0,1,0,1)$corresponds to $\{ACGI\}$ which has the exactly $k=4$ items, so this accept this as the first sample.

Thus the first few (accepted) terms are:

$$\{s_0,s_1,s_2,s_3,…\} = \{b_4,b_7,b_9,b_{10},…\} = \{ ACGI,FGHI, BDGH, ACEF ,… \}$$

If we follow this method for our particular case of $n=9,k=4$, the first 21000 terms of the Sobol sequence will typically result in 5000 acceptable subsets of length $k=4$.

Empirical Results

We can verify that for our example, where $n=9$, $k=4$, $m=5000$, this produces evenly combinations by listing the frequency counts for all  $j$-subsets, for $1\leq j \leq 4$.  (Note that the first column refers to the fact that for $j=1$ there are 9 different individual items; for $j=2$ there are 36 different pairs of items$\{x_p,x_q\}$; for $j=3$ there are 84 different triples $\{x_p,x_q,x_r\}$; and for $j=4$ there are 126 different 4-sets. $\{x_p,x_q,x_r,x_s\}$ )

$$\begin{array}{l|c|l|c|} & \text{# combinations}& \text{Frequency Counts} & \text{Std dev}, \sigma \\ \hline j=1 & 9 & 2220, 2220, 2221, 2221, 2222, 2222, 2223, 2225, 2226 & 1.6 \\ \hline j=2 & 36 & 830, 830, 831, 832, 832, \ldots, 835, 835, 837, 837, 837 & 1.6 \\ \hline j=3 & 84 & 236, 236, 236, 236, 236, \ldots, 240, 240, 240, 240, 240 & 1.1 \\ \hline j=4 & 126 & 39, 39, 39, 39, 39 \ldots, 40, 40, 40, 40, 40 & 0.5 \\ \hline \end{array}$$

This confirms that the quasirandom methodology produces frequency counts that are very evenly distributed for all subcombinations of items.

Comparison with Uniform Random Sampling

To show how much this is an improvement over simple uniform random sample, the follow frequency counts are given for a corresponding sample of 5000 randomly sampled $k=4$-tuples.

$$\begin{array}{l|c|l|c|} & \text{# combinations} & \text{Frequency Counts} & \text{Std dev}, \sigma \\ \hline j=1 & 9& 2176,2194,2200,2217,2218,2233,2236,2244,2282 & 31.3 \\ \hline j=2 & 36& 778, 793, 801, 805, 806, \ldots, 861, 863, 863, 872, 881 & 24.2 \\ \hline j=3 & 84& 199,211,212,212,214,\ldots,263,266,267,267 & 14.5 \\ \hline j=4 & 126& 19, 23, 27, 27, 28, \ldots, 51, 51, 52, 52, 53 & 6.1 \\ \hline \end{array}$$

Summary

The low discrepancy Sobol sequence can be used to construct a sequence of subsets such that the cumulative frequency of all possible subsets of variables is evenly distributed. In many situations, this can offer vast improvements when compared to simple uniform random sampling.

Other Posts you might like: