Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Proposes data thinning, a new approach for splitting an observation into two or more parts
  • Data thinning can be applied to any observation drawn from a “convolution closed” distribution
  • Data thinning has applications to model selection, evaluation, and inference
  • Cross-validation via data thinning provides an alternative to sample splitting
  • Data thinning can be used to validate the results of unsupervised learning approaches

Paper Content

Introduction

  • Data sets are growing in size and complexity
  • There is a need for methods to validate outputs of complex models
  • Sample splitting is a common method used to validate models
  • Data fission is an alternative to sample splitting proposed by Leiner et al. [2022]
  • Data fission involves decomposing a single observation into two parts
  • In some cases, the two parts are independent
  • Data thinning is a recipe for decomposing an observation into two parts that are independent and follow the same distribution as the original observation
  • Data thinning can be applied to any convolution-closed distribution
  • Data thinning can be extended to decompose an observation into an arbitrary number of independent parts
  • Data thinning provides an alternative to sample splitting in unsupervised settings

A review of convolution-closed distributions

  • Convolution-closed distributions are indexed by a parameter λ
  • Many well-known distributions are convolution-closed
  • Expectation is linear in λ for convolution-closed families
  • Density of G λ 1 ,λ 2 ,x can be written down for any F λ with a known density function
  • G λ 1 ,λ 2 ,x has a simple closed form for several well-known distributions

Data thinning

  • Algorithm 1 introduced to observe a realization of X ∼ F λ
  • Theorem 1 introduced, related to a proposal by Joe [1996]
  • X = X (1) + X (2) , X (1) ⊥ ⊥ X (2) , X (1) and X (2) follow same distribution as X
  • ∈ (0, 1) is a tuning parameter that governs a tradeoff between how much information is in X (1) and X (2)
  • Table 2 summarizes data thinning proposal for several well-known distributions
  • Remark 3: additional parameter may be required
  • Remark 4: Binomial distribution not infinitely divisible
  • Application 1: model evaluation for unsupervised learning using data thinning
  • Example 2.1: thinned Poisson distribution
  • Example 2.2: thinned binomial distribution
  • Example 2.3: Application 1 with mean squared error loss

Effect of unknown nuisance parameters

  • Data thinning requires knowledge of a nuisance parameter
  • Proposition 1: If incorrect value of variance is used for data thinning, X (1) and X (2) are positively or negatively correlated
  • Proposition 2: If incorrect value of r is used for data thinning, X (1) and X (2) are positively or negatively correlated
  • Proposition 3: If incorrect value of α is used for data thinning, X (1) and X (2) are positively or negatively correlated
  • Figure 1 verifies Propositions 1-3 empirically
  • Algorithm 2 and Theorem 2 provide a general form of multi-fold data thinning
  • Example 3.2 provides multi-fold thinning of normal distribution
  • Table 3 reveals simple form of multi-fold thinning for univariate distributions
  • Application 2 uses multi-fold thinning to evaluate estimator μ(X) for unsupervised learning

Comparison of data thinning and data fission

  • Leiner et al. [2022] provide alternate strategies to decompose X
  • X (1) and X (2) are not independent in most cases
  • Data thinning and data fission proposals are identical in the Poisson case
  • Data fission does not guarantee that X (1) and X (2) resemble the distribution of X
  • Parameters of interest are entangled in the conditional distribution of X (2) | X (1)
  • Tuning parameter is hard to interpret
  • Data thinning provides a simple recipe to decompose convolution-closed distributions
  • Data fission requires knowledge of nuisance parameters to do inference on E

Simulation study

  • Data thinning can be applied to model selection and validation problems.
  • Data thinning is attractive for unsupervised learning.
  • Data thinning and multi-fold thinning are applied to unsupervised learning problems.
  • Data thinning is compared to naive approaches that use the same data to fit and validate unsupervised models.
  • Data thinning is applied to binomial distributed data and K-means clustering on Gamma distributed data.

Methods

  • Goal is to quantify NLL associated with approximating a binomial matrix
  • Algorithm 3 used to evaluate binomial PCA with NLL loss
  • Inputs: positive integer K, matrices X (train) and X (test)
  • Algorithm 4 used to evaluate Gamma clusters with NLL loss
  • Inputs: positive integer K, matrices X (train) and X (test)
  • Algorithms 3 and 4 applied with X (train) = X (test) and a (train) = a (test) = 1

Data thinning:

  • Thin data into two sets using Algorithm 1
  • Apply Algorithms 3 and 4 to the two sets
  • Thin data into M folds using Algorithm 2
  • Apply Algorithms 3 and 4 to each fold
  • Average the loss functions across the M folds
  • Expect U-shaped NLL curves with data thinning
  • Naive approach yields monotonically decreasing loss curves
  • NLL loss used in Algorithms 3 and 4, but other loss functions can be used

Results

  • Figure 3 displays the average negative log-likelihood loss for three simulation settings as a function of K.
  • Data thinning approaches correctly select the true value of K in all three settings, except for data thinning with 0.5 in the binomial PCA setting.
  • Increasing the value of remedies this issue.
  • The optimal value of is context-dependent.
  • Multi-fold data thinning selects the correct value of K more often than single-fold data thinning.
  • Revisiting a single-cell RNA sequencing dataset, heuristic solutions suggest retaining 5-7 principal components.
  • Data thinning provides a less heuristic approach for estimating the number of principal components.
  • Data thinning with 0.5 is used, and the relationship between E[ Ỹ (1) ] and E[ Ỹ (2) ] depends on the data processing.
  • Negative binomial data thinning can also be used.

Discussion

  • Proposed data thinning as a general technique for decomposing a random variable into two or more independent components
  • Applied data thinning to develop a version of cross-validation suitable for unsupervised learning
  • Data thinning preferable to sample splitting in supervised settings with small sample size
  • Power advantages to approaches such as sample splitting or selective inference in problems such as inference after variable selection
  • Estimate model’s test set error for a range of distributions
  • Framework can be used to study convolution-closed distributions
  • Considered impact of using incorrect value of nuisance parameter when performing data thinning
  • Future work to consider theoretical and empirical implications of performing data thinning with estimated nuisance parameter
  • Non-additive decompositions needed for distributions with bounded support
  • R package implementing data thinning and scripts to reproduce results
  • Proved (i), (ii) and (iii)
  • Simulation study described in Section 5
  • Preprocessing done to matrix X in Seurat tutorial
  • Preprocessing for X (1) and X (2) for data thinning alternative to Seurat tutorial
  • Principal components of Ỹ computed
  • Mean squared error loss used
  • Average mean squared error curves plotted as a function of K
  • Proportion of simulations that select correct value of K * using mean squared error loss plotted as a function of
  • Multi-fold thinning tends to select correct value of K more often than single-fold thinning
  • Preprocessing for X (1) and X (2) explained
  • Identity that makes Figure 6(a) and Figure 6(b) mathematically equivalent explained