Generating Synthetic Data for Deep Neural Network Classification Using Global Differential Privacy Based Optimization
Keywords:
Classifier, Differential Privacy, Synthetic Dataset, Privacy Budget, Prompt Variance LossAbstract
Anonymizing individual text samples before dissemination, is an open research problem in Natural Language Processing (NLP). Significant efforts have been devoted to constructing such mechanisms by employing Local Differential Privacy (LDP) in the model training phase. However, LDP requires substantial noise in the update rule and often comes at the expense of the output language's quality. In this study, we address this limitation by introducing Global Differential Privacy (GDP). Specifically, we first train a generative language model in a differentially private manner and subsequently sample data from it. To do so, a novel idea of Prompt Variance Loss (PVL) is introduced, that enables the model to generate correct samples for a given instruction, thereby giving remarkable results. Experiments demonstrate that the synthetic datasets maintain privacy without leaking sensitive information from the original data as well as exhibit high suitability for training models and doing further analysis on real-world data. Notably, we show that training classifiers on private synthetic data outperforms directly training classifiers on real data with DP-SGD.