- Dataset Condensation via Efficient Synthetic-Data Parameterization The great success of machine learning with massive amounts of data comes at a price of huge computation costs and storage for training and tuning. Recent studies on dataset condensation attempt to reduce the dependence on such massive data by synthesizing a compact training dataset. However, the existing approaches have fundamental limitations in optimization due to the limited representability of synthetic datasets without considering any data regularity characteristics. To this end, we propose a novel condensation framework that generates multiple synthetic data with a limited storage budget via efficient parameterization considering data regularity. We further analyze the shortcomings of the existing gradient matching-based condensation methods and develop an effective optimization technique for improving the condensation of training data information. We propose a unified algorithm that drastically improves the quality of condensed data against the current state-of-the-art on CIFAR-10, ImageNet, and Speech Commands. 8 authors · May 30, 2022
1 Mixture of experts models for multilevel data: modelling framework and approximation theory Multilevel data are prevalent in many real-world applications. However, it remains an open research problem to identify and justify a class of models that flexibly capture a wide range of multilevel data. Motivated by the versatility of the mixture of experts (MoE) models in fitting regression data, in this article we extend upon the MoE and study a class of mixed MoE (MMoE) models for multilevel data. Under some regularity conditions, we prove that the MMoE is dense in the space of any continuous mixed effects models in the sense of weak convergence. As a result, the MMoE has a potential to accurately resemble almost all characteristics inherited in multilevel data, including the marginal distributions, dependence structures, regression links, random intercepts and random slopes. In a particular case where the multilevel data is hierarchical, we further show that a nested version of the MMoE universally approximates a broad range of dependence structures of the random effects among different factor levels. 2 authors · Sep 29, 2022
- Frequency Prior Guided Matching: A Data Augmentation Approach for Generalizable Semi-Supervised Polyp Segmentation Automated polyp segmentation is essential for early diagnosis of colorectal cancer, yet developing robust models remains challenging due to limited annotated data and significant performance degradation under domain shift. Although semi-supervised learning (SSL) reduces annotation requirements, existing methods rely on generic augmentations that ignore polyp-specific structural properties, resulting in poor generalization to new imaging centers and devices. To address this, we introduce Frequency Prior Guided Matching (FPGM), a novel augmentation framework built on a key discovery: polyp edges exhibit a remarkably consistent frequency signature across diverse datasets. FPGM leverages this intrinsic regularity in a two-stage process. It first learns a domain-invariant frequency prior from the edge regions of labeled polyps. Then, it performs principled spectral perturbations on unlabeled images, aligning their amplitude spectra with this learned prior while preserving phase information to maintain structural integrity. This targeted alignment normalizes domain-specific textural variations, thereby compelling the model to learn the underlying, generalizable anatomical structure. Validated on six public datasets, FPGM establishes a new state-of-the-art against ten competing methods. It demonstrates exceptional zero-shot generalization capabilities, achieving over 10% absolute gain in Dice score in data-scarce scenarios. By significantly enhancing cross-domain robustness, FPGM presents a powerful solution for clinically deployable polyp segmentation under limited supervision. 3 authors · Jul 30, 2025
26 R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training Data mixing strategies have successfully reduced the costs involved in training language models. While promising, such methods suffer from two flaws. First, they rely on predetermined data domains (e.g., data sources, task types), which may fail to capture critical semantic nuances, leaving performance on the table. Second, these methods scale with the number of domains in a computationally prohibitive way. We address these challenges via R&B, a framework that re-partitions training data based on semantic similarity (Regroup) to create finer-grained domains, and efficiently optimizes the data composition (Balance) by leveraging a Gram matrix induced by domain gradients obtained throughout training. Unlike prior works, it removes the need for additional compute to obtain evaluation information such as losses or gradients. We analyze this technique under standard regularity conditions and provide theoretical insights that justify R&B's effectiveness compared to non-adaptive mixing approaches. Empirically, we demonstrate the effectiveness of R&B on five diverse datasets ranging from natural language to reasoning and multimodal tasks. With as little as 0.01% additional compute overhead, R&B matches or exceeds the performance of state-of-the-art data mixing strategies. 10 authors · May 1, 2025 1
1 On the Trajectory Regularity of ODE-based Diffusion Sampling Diffusion-based generative models use stochastic differential equations (SDEs) and their equivalent ordinary differential equations (ODEs) to establish a smooth connection between a complex data distribution and a tractable prior distribution. In this paper, we identify several intriguing trajectory properties in the ODE-based sampling process of diffusion models. We characterize an implicit denoising trajectory and discuss its vital role in forming the coupled sampling trajectory with a strong shape regularity, regardless of the generated content. We also describe a dynamic programming-based scheme to make the time schedule in sampling better fit the underlying trajectory structure. This simple strategy requires minimal modification to any given ODE-based numerical solvers and incurs negligible computational cost, while delivering superior performance in image generation, especially in 5sim 10 function evaluations. 5 authors · May 18, 2024
- Unearthing InSights into Mars: Unsupervised Source Separation with Limited Data Source separation involves the ill-posed problem of retrieving a set of source signals that have been observed through a mixing operator. Solving this problem requires prior knowledge, which is commonly incorporated by imposing regularity conditions on the source signals, or implicitly learned through supervised or unsupervised methods from existing data. While data-driven methods have shown great promise in source separation, they often require large amounts of data, which rarely exists in planetary space missions. To address this challenge, we propose an unsupervised source separation scheme for domains with limited data access that involves solving an optimization problem in the wavelet scattering covariance representation spacex2014an interpretable, low-dimensional representation of stationary processes. We present a real-data example in which we remove transient, thermally-induced microtiltsx2014known as glitchesx2014from data recorded by a seismometer during NASA's InSight mission on Mars. Thanks to the wavelet scattering covariances' ability to capture non-Gaussian properties of stochastic processes, we are able to separate glitches using only a few glitch-free data snippets. 6 authors · Jan 27, 2023
- Learning Internal Biological Neuron Parameters and Complexity-Based Encoding for Improved Spiking Neural Networks Performance This study introduces a novel approach by replacing the traditional perceptron neuron model with a biologically inspired probabilistic meta neuron, where the internal neuron parameters are jointly learned, leading to improved classification accuracy of spiking neural networks (SNNs). To validate this innovation, we implement and compare two SNN architectures: one based on standard leaky integrate-and-fire (LIF) neurons and another utilizing the proposed probabilistic meta neuron model. As a second key contribution, we present a new biologically inspired classification framework that uniquely integrates SNNs with Lempel-Ziv complexity (LZC) a measure closely related to entropy rate. By combining the temporal precision and biological plausibility of SNNs with the capacity of LZC to capture structural regularity, the proposed approach enables efficient and interpretable classification of spatiotemporal neural data, an aspect not addressed in existing works. We consider learning algorithms such as backpropagation, spike-timing-dependent plasticity (STDP), and the Tempotron learning rule. To explore neural dynamics, we use Poisson processes to model neuronal spike trains, a well-established method for simulating the stochastic firing behavior of biological neurons. Our results reveal that depending on the training method, the classifier's efficiency can improve by up to 11.00%, highlighting the advantage of learning additional neuron parameters beyond the traditional focus on weighted inputs alone. 3 authors · Aug 8, 2025
2 Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes I argue that data becomes temporarily interesting by itself to some self-improving, but computationally limited, subjective observer once he learns to predict or compress the data in a better way, thus making it subjectively simpler and more beautiful. Curiosity is the desire to create or discover more non-random, non-arbitrary, regular data that is novel and surprising not in the traditional sense of Boltzmann and Shannon but in the sense that it allows for compression progress because its regularity was not yet known. This drive maximizes interestingness, the first derivative of subjective beauty or compressibility, that is, the steepness of the learning curve. It motivates exploring infants, pure mathematicians, composers, artists, dancers, comedians, yourself, and (since 1990) artificial systems. 1 authors · Dec 23, 2008