Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Proposes data thinning, a new approach for splitting an observation into two or more parts
- Data thinning can be applied to any observation drawn from a “convolution closed” distribution
- Data thinning has applications to model selection, evaluation, and inference
- Cross-validation via data thinning provides an alternative to sample splitting
- Data thinning can be used to validate the results of unsupervised learning approaches
Paper Content
Introduction
- Data sets are growing in size and complexity
- There is a need for methods to validate outputs of complex models
- Sample splitting is a common method used to validate models
- Data fission is an alternative to sample splitting proposed by Leiner et al. [2022]
- Data fission involves decomposing a single observation into two parts
- In some cases, the two parts are independent
- Data thinning is a recipe for decomposing an observation into two parts that are independent and follow the same distribution as the original observation
- Data thinning can be applied to any convolution-closed distribution
- Data thinning can be extended to decompose an observation into an arbitrary number of independent parts
- Data thinning provides an alternative to sample splitting in unsupervised settings
A review of convolution-closed distributions
- Convolution-closed distributions are indexed by a parameter λ
- Many well-known distributions are convolution-closed
- Expectation is linear in λ for convolution-closed families
- Density of G λ 1 ,λ 2 ,x can be written down for any F λ with a known density function
- G λ 1 ,λ 2 ,x has a simple closed form for several well-known distributions
Data thinning
- Algorithm 1 introduced to observe a realization of X ∼ F λ
- Theorem 1 introduced, related to a proposal by Joe [1996]
- X = X (1) + X (2) , X (1) ⊥ ⊥ X (2) , X (1) and X (2) follow same distribution as X
- ∈ (0, 1) is a tuning parameter that governs a tradeoff between how much information is in X (1) and X (2)
- Table 2 summarizes data thinning proposal for several well-known distributions
- Remark 3: additional parameter may be required
- Remark 4: Binomial distribution not infinitely divisible
- Application 1: model evaluation for unsupervised learning using data thinning
- Example 2.1: thinned Poisson distribution
- Example 2.2: thinned binomial distribution
- Example 2.3: Application 1 with mean squared error loss
Effect of unknown nuisance parameters
- Data thinning requires knowledge of a nuisance parameter
- Proposition 1: If incorrect value of variance is used for data thinning, X (1) and X (2) are positively or negatively correlated
- Proposition 2: If incorrect value of r is used for data thinning, X (1) and X (2) are positively or negatively correlated
- Proposition 3: If incorrect value of α is used for data thinning, X (1) and X (2) are positively or negatively correlated
- Figure 1 verifies Propositions 1-3 empirically
- Algorithm 2 and Theorem 2 provide a general form of multi-fold data thinning
- Example 3.2 provides multi-fold thinning of normal distribution
- Table 3 reveals simple form of multi-fold thinning for univariate distributions
- Application 2 uses multi-fold thinning to evaluate estimator μ(X) for unsupervised learning
Comparison of data thinning and data fission
- Leiner et al. [2022] provide alternate strategies to decompose X
- X (1) and X (2) are not independent in most cases
- Data thinning and data fission proposals are identical in the Poisson case
- Data fission does not guarantee that X (1) and X (2) resemble the distribution of X
- Parameters of interest are entangled in the conditional distribution of X (2) | X (1)
- Tuning parameter is hard to interpret
- Data thinning provides a simple recipe to decompose convolution-closed distributions
- Data fission requires knowledge of nuisance parameters to do inference on E
Simulation study
- Data thinning can be applied to model selection and validation problems.
- Data thinning is attractive for unsupervised learning.
- Data thinning and multi-fold thinning are applied to unsupervised learning problems.
- Data thinning is compared to naive approaches that use the same data to fit and validate unsupervised models.
- Data thinning is applied to binomial distributed data and K-means clustering on Gamma distributed data.
Methods
- Goal is to quantify NLL associated with approximating a binomial matrix
- Algorithm 3 used to evaluate binomial PCA with NLL loss
- Inputs: positive integer K, matrices X (train) and X (test)
- Algorithm 4 used to evaluate Gamma clusters with NLL loss
- Inputs: positive integer K, matrices X (train) and X (test)
- Algorithms 3 and 4 applied with X (train) = X (test) and a (train) = a (test) = 1
Data thinning:
- Thin data into two sets using Algorithm 1
- Apply Algorithms 3 and 4 to the two sets
- Thin data into M folds using Algorithm 2
- Apply Algorithms 3 and 4 to each fold
- Average the loss functions across the M folds
- Expect U-shaped NLL curves with data thinning
- Naive approach yields monotonically decreasing loss curves
- NLL loss used in Algorithms 3 and 4, but other loss functions can be used
Results
- Figure 3 displays the average negative log-likelihood loss for three simulation settings as a function of K.
- Data thinning approaches correctly select the true value of K in all three settings, except for data thinning with 0.5 in the binomial PCA setting.
- Increasing the value of remedies this issue.
- The optimal value of is context-dependent.
- Multi-fold data thinning selects the correct value of K more often than single-fold data thinning.
- Revisiting a single-cell RNA sequencing dataset, heuristic solutions suggest retaining 5-7 principal components.
- Data thinning provides a less heuristic approach for estimating the number of principal components.
- Data thinning with 0.5 is used, and the relationship between E[ Ỹ (1) ] and E[ Ỹ (2) ] depends on the data processing.
- Negative binomial data thinning can also be used.
Discussion
- Proposed data thinning as a general technique for decomposing a random variable into two or more independent components
- Applied data thinning to develop a version of cross-validation suitable for unsupervised learning
- Data thinning preferable to sample splitting in supervised settings with small sample size
- Power advantages to approaches such as sample splitting or selective inference in problems such as inference after variable selection
- Estimate model’s test set error for a range of distributions
- Framework can be used to study convolution-closed distributions
- Considered impact of using incorrect value of nuisance parameter when performing data thinning
- Future work to consider theoretical and empirical implications of performing data thinning with estimated nuisance parameter
- Non-additive decompositions needed for distributions with bounded support
- R package implementing data thinning and scripts to reproduce results
- Proved (i), (ii) and (iii)
- Simulation study described in Section 5
- Preprocessing done to matrix X in Seurat tutorial
- Preprocessing for X (1) and X (2) for data thinning alternative to Seurat tutorial
- Principal components of Ỹ computed
- Mean squared error loss used
- Average mean squared error curves plotted as a function of K
- Proportion of simulations that select correct value of K * using mean squared error loss plotted as a function of
- Multi-fold thinning tends to select correct value of K more often than single-fold thinning
- Preprocessing for X (1) and X (2) explained
- Identity that makes Figure 6(a) and Figure 6(b) mathematically equivalent explained