Back to Rankings

Differentially Private Synthetic Data via APIs 4: Tabular Data

Toan Tran, Arturs Backurs, Zinan Lin, Victor Reis, Li Xiong, Sergey Yekhanin

cs.LG
Share
#1918 of 5669 · cs.LG
Tournament Score
1437±42
10501750
53%
Win Rate
10
Wins
9
Losses
19
Matches
Rating
7.2/ 10
Significance7.5
Rigor7
Novelty6.8
Clarity8

Abstract

This paper investigates the problem of generating synthetic tabular data with differential privacy (DP) guarantees, enabling data sharing in sensitive domains. Despite extensive study, state-of-the-art methods often focus on minimizing low-order marginal query errors and overlook the challenges posed by high-order correlations. To address this gap, we extend the Private Evolution (PE) framework, originally developed for DP-compliant image and text synthesis, to tabular data. We introduce Tab-PE -- an algorithm for synthetic tabular data generation under DP constraints. Tab-PE iteratively improves a candidate dataset via an evolutionary process that leverages tabular-specialized operators to produce variations, privately scores them, and selects the highest-quality samples to retain and propagate. In contrast to the original PE, which relies on large foundation models, Tab-PE employs heuristic operators with significantly lower computational costs, making PE more practical and scalable for tabular data. Through extensive experiments on real-world and simulation datasets, we demonstrate that Tab-PE substantially outperforms prior baselines on datasets exhibiting high-order correlations. Compared to the best baseline -- AIM, Tab-PE improves classification accuracy by up to 10% while running 28 times faster.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Tab-PE — Differentially Private Synthetic Tabular Data via Private Evolution

1. Core Contribution

The paper identifies a genuine blind spot in the DP synthetic tabular data literature: existing state-of-the-art methods (AIM, PrivMRF, PrivGSD, etc.) are fundamentally built on low-order marginal queries, and standard benchmarks (Adult, Bank, Census) are dominated by low-order correlations, making this limitation invisible. The authors propose Tab-PE, which adapts the Private Evolution (PE) framework to tabular data using lightweight heuristic operators — random initialization, random-walk variation with scheduled mutation decay, and DP nearest-neighbor histogram scoring — instead of the foundation models used in PE for images/text.

The key insight is that full-record nearest-neighbor matching implicitly captures high-dimensional dependencies without explicitly enumerating exponentially many marginal queries. This sidesteps the curse of dimensionality that plagues marginal-based approaches when high-order correlations matter.

2. Methodological Rigor

Strengths:

  • The privacy analysis is clean and well-grounded, reusing the standard Gaussian mechanism composition from the original PE framework. The sensitivity analysis (each private sample affects exactly one histogram bin) is straightforward and correct.
  • The formal definition of k-way correlation via total correlation gaps (Equation 1) and Proposition A.1 connecting tree depth gaps to correlation order provides a principled way to characterize when high-order correlations exist.
  • The experimental design is thorough: XOR stress tests provide clean theoretical intuition, SCM simulations offer realistic but controlled settings with known ground truth, and real-world datasets validate practical applicability.
  • The two-stage selection strategy (sampling then ranking) is well-motivated and ablated.
  • Concerns:

  • The distance metric (Equation 4) uses a simple weighted combination of Hamming distance for categoricals and normalized squared Euclidean for numericals. This is acknowledged as a limitation but is a meaningful one — in high-dimensional spaces with many irrelevant features, this metric may degrade significantly.
  • The method assumes known numerical bounds and (effectively) known class distributions. While the authors show robustness to noisy class counts, the bounds assumption is non-trivial in practice.
  • The number of synthetic samples is set to 10-20% of the original dataset at ε=1.0, which could be limiting for downstream tasks requiring larger datasets. The oversampling experiment (random duplication) is simplistic.
  • Hyperparameter sensitivity is explored but the method has many parameters (T, T_sampling, m, μ_init, μ_final, γ, λ), and optimal settings may vary across datasets.
  • 3. Potential Impact

    Practical significance:

  • The 28× speedup over AIM while achieving better accuracy on high-order datasets is compelling. Running entirely on CPUs without GPUs significantly lowers the barrier to adoption.
  • The identification that standard benchmarks mask a fundamental limitation of existing methods is an important methodological contribution that could redirect evaluation practices in the field.
  • The new benchmark suite (XOR, SCM simulations, curated high-order real-world datasets) fills a genuine evaluation gap.
  • Broader applicability:

  • Healthcare, finance, and other sensitive domains often have complex multi-feature interactions (e.g., drug interactions, financial fraud patterns) where high-order correlations are critical. Tab-PE directly addresses this need.
  • The demonstration that PE can work with trivially simple operators (no foundation models, no simulators) expands the conceptual scope of the PE framework.
  • Limitations in impact:

  • On standard low-order benchmarks, Tab-PE is ~1% behind AIM, which may limit adoption in settings where users don't know a priori whether their data has high-order correlations.
  • The gap to non-private upper bounds remains large (e.g., 30% accuracy gap on Artificial Characters), suggesting the method, while better than alternatives, is still far from solving the problem.
  • 4. Timeliness & Relevance

    The paper addresses a timely need. As DP synthetic data moves toward real-world deployment, the limitations of marginal-based methods become increasingly important. The PE framework has gained significant traction for images and text, and extending it to tabular data — the most common data modality in practice — is a natural and important step. The concurrent finding by Swanberg et al. (2025) that LLM-based PE for tabular data underperforms makes this contribution more valuable, as it shows that the right API design matters more than model sophistication.

    5. Strengths & Limitations

    Key strengths:

  • Clear identification of a systematic evaluation gap in the field
  • Elegant simplicity of the method — no foundation models, no training, CPU-only
  • Comprehensive experimental coverage across simulation and real-world settings
  • Strong computational efficiency advantages
  • Well-structured two-stage refinement with principled ablations
  • Open-source code
  • Notable weaknesses:

  • The method's advantage is largely confined to datasets with demonstrable high-order correlations; for the more common low-order case, it offers no improvement
  • The naive distance metric is a significant limitation for very high-dimensional or sparse data
  • The flattened MNIST experiment, while impressive, conflates "tabular" with "flattened image" — the practical relevance is questionable
  • Limited theoretical analysis of convergence or approximation quality beyond the privacy guarantee
  • The datasets with "high-order correlations" are somewhat cherry-picked; the paper would benefit from a more systematic survey of how common such correlations are in practice
  • Overall Assessment:

    This is a solid contribution that identifies a real problem, proposes a clean and practical solution, and demonstrates its effectiveness convincingly. The impact is somewhat bounded by the specificity of the setting (high-order correlations) and the remaining accuracy gap to non-private baselines, but the efficiency gains and the reframing of evaluation practices could have lasting influence on the field.

    Rating:7.2/ 10
    Significance 7.5Rigor 7Novelty 6.8Clarity 8

    Generated Jun 9, 2026

    Comparison History (19)

    Wonvs. Latent World Recovery for Multimodal Learning with Missing Modalities

    Paper 2 addresses a critical bottleneck in sensitive data sharing across numerous fields (healthcare, finance, etc.) by advancing differentially private tabular data synthesis. Its approach successfully captures complex high-order correlations while delivering highly quantifiable and impressive improvements (up to 10% better accuracy and 28x faster than the state-of-the-art baseline). While Paper 1 offers a valuable methodology for missing modalities in multi-omics, Paper 2's broader applicability to virtually any domain utilizing sensitive tabular data, combined with its substantial scalability and efficiency gains, suggests a higher potential for widespread cross-disciplinary impact.

    gemini-3.1-pro-preview·Jun 11, 2026
    Lostvs. Bootstrapped Monitoring: Leveraging Transparent Reasoning to Oversee Stronger AI Agents

    Paper 2 likely has higher impact due to its novelty and timeliness in AI safety/control, proposing a general oversight protocol (bootstrapped monitoring) relevant to rapidly advancing frontier agents. It targets a widely recognized real-world risk (monitoring capability gaps and collusion) and could influence both alignment research and deployment practices across domains where agents act. While Paper 1 is methodologically solid and practically useful for DP tabular synthesis, it is a more incremental advance within an established subfield with narrower cross-field reach than AI control paradigms.

    gpt-5.2·Jun 11, 2026
    Lostvs. Breaking the Tokenizer Barrier: On-Policy Distillation across Model Families

    Paper 2 addresses a fundamental limitation in LLM knowledge distillation—the tokenizer barrier between model families—which has broad implications across the entire LLM ecosystem. Enabling cross-tokenizer on-policy distillation unlocks numerous teacher-student combinations previously impossible, with wide applicability in post-training pipelines. Paper 1, while solid and practical for DP synthetic tabular data, represents a more incremental extension of the Private Evolution framework to a specific data modality. Paper 2's impact spans more broadly across the rapidly growing LLM field and enables a paradigm shift in how distillation is conducted.

    claude-opus-4-6·Jun 9, 2026
    Lostvs. Assessing Sample Quality in Conditional Generation under Compositional Shift

    Paper 1 addresses a fundamental bottleneck in applying generative AI to scientific discovery: evaluating generated samples in extrapolative regimes where no ground truth exists. By providing a novel, reference-free trust score, it unlocks broader applications in fields like biological imaging and materials science. While Paper 2 offers significant improvements in differentially private tabular data generation, Paper 1's conceptual innovation and direct relevance to accelerating empirical scientific research give it a higher potential for broad scientific impact.

    gemini-3.1-pro-preview·Jun 9, 2026
    Lostvs. When Do Local Score Models Extrapolate Across Size? A Diagnostic Theory and Benchmark

    Paper 2 has higher potential impact due to its broadly applicable theoretical framing of size extrapolation in local score-based generative models, a key pain point in scientific ML (physics, chemistry, materials). It contributes new theory (quasi-locality via Gaussian-smoothed scores, size-uniform comparison theorem) plus a diagnostic benchmark (FDLF) with exact, controllable ground truth, enabling rigorous evaluation across methods. The insights can influence model design and evaluation standards across diffusion/score modeling. Paper 1 is valuable and practical for DP tabular synthesis, but is more application-narrow and more incremental relative to an active DP synthesis landscape.

    gpt-5.2·Jun 9, 2026
    Wonvs. Lost in the Non-convex Loss Landscape: How to Fine-tune the Large Time Series Model?

    Paper 2 likely has higher impact: it advances a timely, high-stakes area (differentially private synthetic tabular data) with broad applicability in healthcare, finance, and public-sector data sharing. Extending Private Evolution to tabular data while removing reliance on large foundation models is a notable innovation with clear practicality (much faster) and addresses an important unmet need (high-order correlations). The methodological framing (DP guarantees + extensive evaluation) and cross-field relevance of DP data release give it wider potential reach than Paper 1’s more domain-specific fine-tuning technique for large time-series models.

    gpt-5.2·Jun 9, 2026
    Wonvs. The Confidence Trap: Calibration Attacks for Graph Neural Networks

    Paper 2 addresses the broadly important problem of differentially private synthetic data generation for tabular data, which has wide real-world applications across healthcare, finance, and government. It extends the Private Evolution framework to a new domain with practical improvements (28x faster, 10% accuracy gain), addressing a gap in handling high-order correlations. Paper 1, while technically sound, targets a niche problem (calibration attacks on GNNs) with a narrower audience. Paper 2's combination of privacy guarantees, practical scalability, and broad applicability across sensitive data domains gives it higher potential impact.

    claude-opus-4-6·Jun 9, 2026
    Wonvs. Causal Longitudinal Prior-Fitted Networks for Counterfactual Outcome Prediction

    Paper 2 likely has higher impact: it advances a broadly applicable, timely problem (differentially private synthetic tabular data) with direct real-world utility across healthcare, finance, and public-sector data sharing. Methodologically, it extends an existing DP synthesis framework to a new data modality with a practical, compute-efficient design and reports substantial empirical gains (utility and speed) on real datasets, addressing high-order correlation shortcomings of prior work. Paper 1 is innovative but more specialized to longitudinal causal forecasting and depends on synthetic pretraining assumptions, likely narrowing immediate adoption.

    gpt-5.2·Jun 9, 2026
    Lostvs. Consistency Training Along the Transformer Stack

    Paper 1 introduces a novel, unified framework for AI alignment through consistency training across transformer internals, addressing multiple safety threats with cross-threat generalization. This has broad implications for AI safety—a critically timely field—and provides both practical techniques and mechanistic insights. Paper 2 makes a solid contribution to DP synthetic tabular data but is more incremental, adapting an existing framework (Private Evolution) to a new modality. Paper 1's breadth of impact across alignment, interpretability, and safety, combined with its methodological novelty, gives it higher potential impact.

    claude-opus-4-6·Jun 9, 2026
    Lostvs. Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving

    Paper 1 addresses a critical bottleneck in the highly active field of LLM deployment: KV cache memory and bandwidth pressure. By proposing a system that improves throughput by up to 2.6x while maintaining accuracy, it offers massive immediate economic and practical benefits for real-world AI applications. While Paper 2 presents a valuable contribution to privacy-preserving synthetic data, the scale, timeliness, and broader industry reliance on efficient LLM serving give Paper 1 a significantly higher potential for widespread scientific and technological impact.

    gemini-3.1-pro-preview·Jun 9, 2026