Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Proposes data thinning, a new approach for splitting an observation into two or more parts
Data thinning can be applied to any observation drawn from a “convolution closed” distribution
Data thinning has applications to model selection, evaluation, and inference
Cross-validation via data thinning provides an alternative to sample splitting
Data thinning can be used to validate the results of unsupervised learning approaches

Paper Content

Introduction

Data sets are growing in size and complexity
There is a need for methods to validate outputs of complex models
Sample splitting is a common method used to validate models
Data fission is an alternative to sample splitting proposed by Leiner et al. [2022]
Data fission involves decomposing a single observation into two parts
In some cases, the two parts are independent
Data thinning is a recipe for decomposing an observation into two parts that are independent and follow the same distribution as the original observation
Data thinning can be applied to any convolution-closed distribution
Data thinning can be extended to decompose an observation into an arbitrary number of independent parts
Data thinning provides an alternative to sample splitting in unsupervised settings

A review of convolution-closed distributions

Convolution-closed distributions are indexed by a parameter λ
Many well-known distributions are convolution-closed
Expectation is linear in λ for convolution-closed families
Density of G λ 1 ,λ 2 ,x can be written down for any F λ with a known density function
G λ 1 ,λ 2 ,x has a simple closed form for several well-known distributions

Data thinning

Algorithm 1 introduced to observe a realization of X ∼ F λ
Theorem 1 introduced, related to a proposal by Joe [1996]
X = X (1) + X (2) , X (1) ⊥ ⊥ X (2) , X (1) and X (2) follow same distribution as X
∈ (0, 1) is a tuning parameter that governs a tradeoff between how much information is in X (1) and X (2)
Table 2 summarizes data thinning proposal for several well-known distributions
Remark 3: additional parameter may be required
Remark 4: Binomial distribution not infinitely divisible
Application 1: model evaluation for unsupervised learning using data thinning
Example 2.1: thinned Poisson distribution
Example 2.2: thinned binomial distribution
Example 2.3: Application 1 with mean squared error loss

Effect of unknown nuisance parameters

Data thinning requires knowledge of a nuisance parameter
Proposition 1: If incorrect value of variance is used for data thinning, X (1) and X (2) are positively or negatively correlated
Proposition 2: If incorrect value of r is used for data thinning, X (1) and X (2) are positively or negatively correlated
Proposition 3: If incorrect value of α is used for data thinning, X (1) and X (2) are positively or negatively correlated
Figure 1 verifies Propositions 1-3 empirically
Algorithm 2 and Theorem 2 provide a general form of multi-fold data thinning
Example 3.2 provides multi-fold thinning of normal distribution
Table 3 reveals simple form of multi-fold thinning for univariate distributions
Application 2 uses multi-fold thinning to evaluate estimator μ(X) for unsupervised learning

Comparison of data thinning and data fission

Leiner et al. [2022] provide alternate strategies to decompose X
X (1) and X (2) are not independent in most cases
Data thinning and data fission proposals are identical in the Poisson case
Data fission does not guarantee that X (1) and X (2) resemble the distribution of X
Parameters of interest are entangled in the conditional distribution of X (2) | X (1)
Tuning parameter is hard to interpret
Data thinning provides a simple recipe to decompose convolution-closed distributions
Data fission requires knowledge of nuisance parameters to do inference on E

Simulation study

Data thinning can be applied to model selection and validation problems.
Data thinning is attractive for unsupervised learning.
Data thinning and multi-fold thinning are applied to unsupervised learning problems.
Data thinning is compared to naive approaches that use the same data to fit and validate unsupervised models.
Data thinning is applied to binomial distributed data and K-means clustering on Gamma distributed data.

Methods

Goal is to quantify NLL associated with approximating a binomial matrix
Algorithm 3 used to evaluate binomial PCA with NLL loss
Inputs: positive integer K, matrices X (train) and X (test)
Algorithm 4 used to evaluate Gamma clusters with NLL loss
Inputs: positive integer K, matrices X (train) and X (test)
Algorithms 3 and 4 applied with X (train) = X (test) and a (train) = a (test) = 1

Data thinning:

Thin data into two sets using Algorithm 1
Apply Algorithms 3 and 4 to the two sets
Thin data into M folds using Algorithm 2
Apply Algorithms 3 and 4 to each fold
Average the loss functions across the M folds
Expect U-shaped NLL curves with data thinning
Naive approach yields monotonically decreasing loss curves
NLL loss used in Algorithms 3 and 4, but other loss functions can be used

Results

Figure 3 displays the average negative log-likelihood loss for three simulation settings as a function of K.
Data thinning approaches correctly select the true value of K in all three settings, except for data thinning with 0.5 in the binomial PCA setting.
Increasing the value of remedies this issue.
The optimal value of is context-dependent.
Multi-fold data thinning selects the correct value of K more often than single-fold data thinning.
Revisiting a single-cell RNA sequencing dataset, heuristic solutions suggest retaining 5-7 principal components.
Data thinning provides a less heuristic approach for estimating the number of principal components.
Data thinning with 0.5 is used, and the relationship between E[ Ỹ (1) ] and E[ Ỹ (2) ] depends on the data processing.
Negative binomial data thinning can also be used.

Discussion

Proposed data thinning as a general technique for decomposing a random variable into two or more independent components
Applied data thinning to develop a version of cross-validation suitable for unsupervised learning
Data thinning preferable to sample splitting in supervised settings with small sample size
Power advantages to approaches such as sample splitting or selective inference in problems such as inference after variable selection
Estimate model’s test set error for a range of distributions
Framework can be used to study convolution-closed distributions
Considered impact of using incorrect value of nuisance parameter when performing data thinning
Future work to consider theoretical and empirical implications of performing data thinning with estimated nuisance parameter
Non-additive decompositions needed for distributions with bounded support
R package implementing data thinning and scripts to reproduce results
Proved (i), (ii) and (iii)
Simulation study described in Section 5
Preprocessing done to matrix X in Seurat tutorial
Preprocessing for X (1) and X (2) for data thinning alternative to Seurat tutorial
Principal components of Ỹ computed
Mean squared error loss used
Average mean squared error curves plotted as a function of K
Proportion of simulations that select correct value of K * using mean squared error loss plotted as a function of
Multi-fold thinning tends to select correct value of K more often than single-fold thinning
Preprocessing for X (1) and X (2) explained
Identity that makes Figure 6(a) and Figure 6(b) mathematically equivalent explained

Link to paper#

Abstract#

Paper Content#

Introduction#

A review of convolution-closed distributions#

Data thinning#

Effect of unknown nuisance parameters#

Comparison of data thinning and data fission#

Simulation study#

Methods#

Data thinning:#

Results#

Discussion#

Link to paper

Abstract

Paper Content

Introduction

A review of convolution-closed distributions

Data thinning

Effect of unknown nuisance parameters

Comparison of data thinning and data fission

Simulation study

Methods

Data thinning:

Results

Discussion