sure.privacy package
Submodules
sure.privacy.privacy module
- sure.privacy.privacy.adversary_dataset(training_set: DataFrame | DataFrame | LazyFrame, validation_set: DataFrame | DataFrame | LazyFrame, original_dataset_sample_fraction: float = 0.2) DataFrame
Create an adversary dataset for the Membership Inference Test given a training and validation set. The validation set must be smaller than the training set.
The size of the resulting adversary dataset is a fraction of the sum of the training set size and the validation set size.
It takes half of the final rows from the training set and the other half from the validation set. It adds a column to mark which rows was sampled from the training set.
Parameters
- training_setpd.DataFrame
The training set as a pandas DataFrame.
- validation_setpd.DataFrame
The validation set as a pandas DataFrame.
- original_dataset_sample_fractionfloat, optional
How many rows (a fraction from 0 to 1) to sample from the concatenation of the training and validation set, by default 0.2
Returns
- pd.DataFrame
A new pandas DataFrame in which half of the rows come from the training set and the other half come from the validation set.
- sure.privacy.privacy.membership_inference_test(adversary_dataset: DataFrame | DataFrame | LazyFrame, synthetic_dataset: DataFrame | DataFrame | LazyFrame, adversary_guesses_ground_truth: ndarray | DataFrame | DataFrame | LazyFrame | Series, parallel: bool = True, path_to_json: str = '')
Simulate a Membership Inference Attack on the provided synthetic dataset using an adversary dataset.
Parameters
- adversary_datasetpd.DataFrame, pl.DataFrame, or pl.LazyFrame
The dataset used by the adversary, containing features for the attack simulation.
- synthetic_datasetpd.DataFrame, pl.DataFrame, or pl.LazyFrame
The synthetic dataset on which the Membership Inference Attack is performed.
- adversary_guesses_ground_truthnp.ndarray, pd.DataFrame, pl.DataFrame, pl.LazyFrame, or pl.Series
Ground truth labels indicating whether a sample is from the original training dataset or not.
- parallelbool, optional
Whether to use parallel processing for distance calculations, by default True.
- path_to_jsonstr, optional
Path to save the attack output as a JSON file. If empty, the output is not saved, by default “”.
Returns
- dict
A dictionary containing the attack results, including distance thresholds, precisions, and risk score.
Notes
This function simulates an attack where an adversary attempts to distinguish between real and synthetic samples.
The attack results are saved to a JSON file if path_to_json is provided.
Module contents
- sure.privacy.adversary_dataset(training_set: DataFrame | DataFrame | LazyFrame, validation_set: DataFrame | DataFrame | LazyFrame, original_dataset_sample_fraction: float = 0.2) DataFrame
Create an adversary dataset for the Membership Inference Test given a training and validation set. The validation set must be smaller than the training set.
The size of the resulting adversary dataset is a fraction of the sum of the training set size and the validation set size.
It takes half of the final rows from the training set and the other half from the validation set. It adds a column to mark which rows was sampled from the training set.
Parameters
- training_setpd.DataFrame
The training set as a pandas DataFrame.
- validation_setpd.DataFrame
The validation set as a pandas DataFrame.
- original_dataset_sample_fractionfloat, optional
How many rows (a fraction from 0 to 1) to sample from the concatenation of the training and validation set, by default 0.2
Returns
- pd.DataFrame
A new pandas DataFrame in which half of the rows come from the training set and the other half come from the validation set.
- sure.privacy.dcr_stats(dcr_name: str, distances_to_closest_record: ndarray, path_to_json: str = '') Dict
This function returns the statisitcs for an array containing DCR computed previously.
Parameters
- dcr_namestr
Name with which the DCR will be saved with in the JSON file used to generate the final report. Can be one of the following:
synth_train
synth_val
other
- distances_to_closest_recordnp.ndarray
A 1D-array containing the Distance to the Closest Record for each row of a dataframe shape (dataframe rows, )
Returns
- Dict
A dictionary containing mean and percentiles of the given DCR array.
- sure.privacy.distance_to_closest_record(dcr_name: str, x_dataframe: DataFrame | DataFrame | LazyFrame, y_dataframe: DataFrame | DataFrame | LazyFrame | None = None, feature_weights: ndarray | List | None = None, parallel: bool = True, save_output: bool = True, path_to_json: str = '') ndarray
Compute the distancees to closest record of dataframe X from dataframe Y using a modified version of the Gower’s distance. The two dataframes may contain mixed datatypes (numerical and categorical).
Paper references: * A General Coefficient of Similarity and Some of Its Properties, J. C. Gower * Dimensionality Invariant Similarity Measure, Ahmad Basheer Hassanat
Parameters
- dcr_namestr
Name with which the DCR will be saved with in the JSON file used to generate the final report. Can be one of the following:
synth_train
synth_val
other
- x_dataframepd.DataFrame
A dataset containing numerical and categorical data.
- categorical_featuresList
List of booleans that indicates which features are categorical. If categoricals_features[i] is True, feature i is categorical. Must have same length of x_dataframe.columns.
- y_dataframepd.DataFrame, optional
Another dataset containing numerical and categorical data, by default None. It must contains the same columns of x_dataframe. If None, the distance matrix is computed between x_dataframe and x_dataframe
- feature_weightsList, optional
List of features weights to use computing distances, by default None. If None, each feature weight is 1.0
- parallelBoolean, optional
Whether to enable the parallelization to compute Gower matrix, by default True
- save_outputbool
If True, saves the DCR information into the JSON file used to generate the final report.
Returns
- np.ndarray
1D array containing the Distance to the Closest Record for each row of x_dataframe shape (x_dataframe rows, )
Raises
- TypeError
If dc_name is not one of the names listed above.
- TypeError
If X and Y don’t have the same (number of) columns.
- sure.privacy.membership_inference_test(adversary_dataset: DataFrame | DataFrame | LazyFrame, synthetic_dataset: DataFrame | DataFrame | LazyFrame, adversary_guesses_ground_truth: ndarray | DataFrame | DataFrame | LazyFrame | Series, parallel: bool = True, path_to_json: str = '')
Simulate a Membership Inference Attack on the provided synthetic dataset using an adversary dataset.
Parameters
- adversary_datasetpd.DataFrame, pl.DataFrame, or pl.LazyFrame
The dataset used by the adversary, containing features for the attack simulation.
- synthetic_datasetpd.DataFrame, pl.DataFrame, or pl.LazyFrame
The synthetic dataset on which the Membership Inference Attack is performed.
- adversary_guesses_ground_truthnp.ndarray, pd.DataFrame, pl.DataFrame, pl.LazyFrame, or pl.Series
Ground truth labels indicating whether a sample is from the original training dataset or not.
- parallelbool, optional
Whether to use parallel processing for distance calculations, by default True.
- path_to_jsonstr, optional
Path to save the attack output as a JSON file. If empty, the output is not saved, by default “”.
Returns
- dict
A dictionary containing the attack results, including distance thresholds, precisions, and risk score.
Notes
This function simulates an attack where an adversary attempts to distinguish between real and synthetic samples.
The attack results are saved to a JSON file if path_to_json is provided.
- sure.privacy.number_of_dcr_equal_to_zero(dcr_name: str, distances_to_closest_record: ndarray, path_to_json: str = '') int | int8 | uint8 | int16 | uint16 | int32 | uint32 | int64 | uint64
Return the number of 0s in the given DCR array, that is the number of duplicates/clones detected.
Parameters
- distances_to_closest_recordnp.ndarray
A 1D-array containing the Distance to the Closest Record for each row of a dataframe shape (dataframe rows, )
Returns
- int
The number of 0s in the given DCR array.
- sure.privacy.validation_dcr_test(dcr_synth_train: ndarray, dcr_synth_validation: ndarray, path_to_json: str = '') float | float16 | float32 | float64
If the returned percentage is close to (or smaller than) 50%, then the synthetic datset’s records are equally close to the original training set and to the validation set. In this casse the synthetic data does not allow to conjecture whether a record was or was not contained in the training dataset.
If the returned percentage is greater than 50%, then the synthetic datset’s records are closer to the training set than to the validation set, indicating that vulnerable records are present in the synthetic dataset.
Parameters
- dcr_synth_trainnp.ndarray
A 1D-array containing the Distance to the Closest Record for each row of the synthetic dataset wrt the training dataset, shape (synthetic rows, )
- dcr_synth_validationnp.ndarray
A 1D-array containing the Distance to the Closest Record for each row of the synthetic dataset wrt the validation dataset, shape (synthetic rows, )
Returns
- float
The percentage of synthetic rows closer to the training dataset than to the validation dataset.
Raises
- ValueError
If the two DCR array given as parameters have different shapes.