Josep Domingo-Ferrer, Francesc Sebé, Agusti Solanas
Computers & Mathematics with Applications 55 (4), 714-732
- Description: Microaggregation is a family of methods for statistical disclosure control (SDC) of microdata (records on individuals and/or companies), that is, for masking microdata so that they can be released without disclosing private information on the underlying individuals. Microaggregation techniques are currently being used by many statistical agencies. The principle of microaggregation is to group original database records into small aggregates prior to publication. Each aggregate should contain at least k records to prevent disclosure of individual information, where k is a constant value preset by the data protector. In addition to it being a good masking method, microaggregation has recently been shown useful to achieve k-anonymity. In k-anonymity, the parameter k specifies the maximum acceptable disclosure risk, so that, once a value for k has been selected, the only job left is to maximize data utility: if microaggregation is used to implement k-anonymity, maximizing utility can be achieved by microaggregating optimally, i.e. with minimum within-groups variability loss. Unfortunately, optimal microaggregation can only be computed in polynomial time for univariate data. For multivariate data, it has been shown to be NP-hard. We present in this paper a polynomial-time approximation to microaggregate multivariate numerical data for which bounds to optimal microaggregation can be derived at least for two different optimality criteria: minimum within-groups Euclidean distance and minimum within-groups sum of squares. Beyond the theoretical interest of being the first microaggregation proposal with proven approximation bounds for any k, our method is empirically shown to be comparable to the best available heuristics for multivariate microaggregation.