Utils

class sure.Preprocessor(data: LazyFrame | DataFrame | DataFrame, discarding_threshold: float = 0.9, get_discarded_info: bool = False, excluded_col: List = [], time: str | None = None, missing_values_threshold: float = 0.999, n_bins: int = 0, scaling: str = 'normalize', num_fill_null: Literal['interpolate', 'forward', 'backward', 'min', 'max', 'mean', 'zero', 'one'] = 'mean', cat_labels_threshold: float = 0.01, unseen_labels='ignore', target_columns=None)

Bases: object

categorical_features: Tuple[str]
discarded_features: List[str] | Dict[str, str]
extract_ts_features(data: LazyFrame | DataFrame, y: Series | Series | None = None, time: str | None = None, column_id: str | None = None) DataFrame

Extract relevant time-series features from the provided data.

Parameters

datapl.LazyFrame or pd.DataFrame

The input dataset containing the time-series data. It can be a Polars LazyFrame or a Pandas DataFrame.

ypl.Series or pd.Series

The label series associated with the data. It can be a Polars Series or a Pandas Series.

timestr, optional

The name of the time column used to sort the data. If not provided, the method will try to use self.time if available.

column_idstr, optional

The name of the ID column, if present in the data. This is used to distinguish different time-series within the same dataset.

Returns

pd.DataFrame

A DataFrame containing the extracted and filtered relevant time-series features.

Raises

ValueError

If the provided data is not a Polars LazyFrame or a Pandas or Polars DataFrame.

ValueError

If the provided label series is not a Polars Series or a Pandas Series.

ValueError

If the time column name is not provided and self.time is not available.

Notes

  • The function uses the extract_relevant_features method from the tsfresh library

to extract features from the time-series data. - The method stores the filtered features in self.features_filtered for further use.

fill_null_strategy

alias of Literal[‘interpolate’, ‘forward’, ‘backward’, ‘min’, ‘max’, ‘mean’, ‘zero’, ‘one’]

get_categorical_features() Tuple[str]

Return the list of categorical features.

get_numerical_features() Tuple[str]

Return the list of numerical features.

numerical_features: Tuple[str]
scaling

A class for preprocessing datasets, including feature selection, handling missing values, scaling, and time-series feature extraction.

Attributes

numerical_featuresTuple[str]

Names of the numerical features in the dataset.

categorical_featuresTuple[str]

Names of the categorical features in the dataset.

temporal_featuresTuple[str]

Names of the temporal features in the dataset.

discarded_featuresUnion[List[str], Dict[str, str]]

Features that were discarded during preprocessing, along with reasons if available.

single_value_columnsDict[str, str]

Dictionary storing columns with only one unique value, along with the unique value.

alias of Literal[‘normalize’, ‘standardize’, ‘quantile’]

single_value_columns: Dict[str, str]
temporal_features: Tuple[str]
transform(data: LazyFrame | DataFrame | DataFrame) DataFrame | DataFrame

Transform the input dataset by processing numerical, temporal, and categorical columns. This includes filling null values, scaling or discretizing numerical features, and encoding categorical features.

Parameters

datapl.LazyFrame or pl.DataFrame or pd.DataFrame

The input dataset to be transformed. It can be a Polars LazyFrame, Polars DataFrame, or a Pandas DataFrame.

Returns

pl.DataFrame or pd.DataFrame

The transformed dataset, returned as a Polars DataFrame or a Pandas DataFrame, depending on the input data type.

Raises

SystemExit

If the input data type does not match the data type used when the Preprocessor was initialized.

Notes

  • The method identifies and processes numerical, temporal, and categorical features separately.

  • Categorical features are filled with the most frequent value and then one-hot encoded.

  • Numerical features can be normalized, standardized, or discretized based on the specified parameters.

  • Temporal features are filled using interpolation and reordered to the beginning of the dataset.

sure.report(df_real: DataFrame | DataFrame | LazyFrame, df_synth: DataFrame | DataFrame | LazyFrame, path_to_json: str = '')

Generate the report app

Utility

sure.utility.compute_mutual_info(real_data: DataFrame | LazyFrame | DataFrame | ndarray, synth_data: DataFrame | LazyFrame | DataFrame | ndarray, path_to_json: str = '') Tuple[DataFrame, DataFrame, DataFrame]

Compute the correlation matrices for both real and synthetic datasets, and calculate the difference between these matrices.

Parameters

real_dataUnion[pl.DataFrame, pl.LazyFrame, pd.DataFrame, np.ndarray]

The real dataset, which can be in the form of a Polars DataFrame, LazyFrame, pandas DataFrame, or numpy ndarray.

synth_dataUnion[pl.DataFrame, pl.LazyFrame, pd.DataFrame, np.ndarray]

The synthetic dataset, provided in the same format as real_data.

path_to_jsonstr, optional

File path to save the correlation matrices and their differences in JSON format, by default “”.

Returns

Tuple[pl.DataFrame, pl.DataFrame, pl.DataFrame]

A tuple containing: - real_corr: Correlation matrix of the real dataset with column names included. - synth_corr: Correlation matrix of the synthetic dataset with column names included. - diff_corr: Difference between the correlation matrices of the real and synthetic

datasets, with values smaller than 1e-5 substituted with 0.

Raises

ValueError

If the features in the real and synthetic datasets do not match or if non-numerical features are present.

sure.utility.compute_statistical_metrics(real_data: DataFrame | LazyFrame | DataFrame | ndarray, synth_data: DataFrame | LazyFrame | DataFrame | ndarray, path_to_json: str = '') Tuple[Dict, Dict, Dict]

Compute statistical metrics for numerical, categorical, and temporal features in both real and synthetic datasets.

Parameters

real_dataUnion[pl.DataFrame, pl.LazyFrame, pd.DataFrame, np.ndarray]

The real dataset containing numerical, categorical, and/or temporal features.

synth_dataUnion[pl.DataFrame, pl.LazyFrame, pd.DataFrame, np.ndarray]

The synthetic dataset containing numerical, categorical, and/or temporal features.

path_to_jsonstr, optional

The file path to save the comparison metrics in JSON format, by default “”.

Returns

Tuple[Dict, Dict, Dict]

A tuple containing three dictionaries with statistical comparisons for numerical, categorical, and temporal features, respectively.

sure.utility.compute_utility_metrics_class(X_train: DataFrame | LazyFrame | DataFrame | ndarray, X_synth: DataFrame | LazyFrame | DataFrame | ndarray, X_test: DataFrame | LazyFrame | DataFrame | ndarray, y_train: DataFrame | LazyFrame | Series | Series | DataFrame | ndarray, y_synth: DataFrame | LazyFrame | Series | Series | DataFrame | ndarray, y_test: DataFrame | LazyFrame | Series | Series | DataFrame | ndarray, custom_metric: Callable | None = None, classifiers: List[Callable] = 'all', predictions: bool = False, path_to_json: str = '')

Train and evaluate classification models on both real and synthetic datasets.

Parameters

X_trainUnion[pl.DataFrame, pl.LazyFrame, pd.DataFrame, np.ndarray]

Training features for real data.

X_synthUnion[pl.DataFrame, pl.LazyFrame, pd.DataFrame, np.ndarray]

Training features for synthetic data.

X_testUnion[pl.DataFrame, pl.LazyFrame, pd.DataFrame, np.ndarray]

Test features for evaluation.

y_trainUnion[pl.DataFrame, pl.LazyFrame, pl.Series, pd.Series, pd.DataFrame, np.ndarray]

Training labels for real data.

y_synthUnion[pl.DataFrame, pl.LazyFrame, pl.Series, pd.Series, pd.DataFrame, np.ndarray]

Training labels for synthetic data.

y_testUnion[pl.DataFrame, pl.LazyFrame, pl.Series, pd.Series, pd.DataFrame, np.ndarray]

Test labels for evaluation.

custom_metricCallable, optional

Custom metric for model evaluation, by default None.

classifiersList[Callable], optional

List of classifiers to use, or “all” for all available classifiers, by default “all”.

predictionsbool, optional

If True, returns predictions along with model performance, by default False.

path_to_jsonstr, optional

Path to save the output JSON files, by default “”.

Returns

Union[pd.DataFrame, pl.DataFrame, np.ndarray]

Model performance metrics for both real and synthetic datasets, and optionally predictions.

sure.utility.compute_utility_metrics_regr(X_train: DataFrame | LazyFrame | DataFrame | ndarray, X_synth: DataFrame | LazyFrame | DataFrame | ndarray, X_test: DataFrame | LazyFrame | DataFrame | ndarray, y_train: DataFrame | LazyFrame | Series | Series | DataFrame | ndarray, y_synth: DataFrame | LazyFrame | Series | Series | DataFrame | ndarray, y_test: DataFrame | LazyFrame | Series | Series | DataFrame | ndarray, custom_metric: Callable | None = None, regressors: List[Callable] = 'all', predictions: bool = False, path_to_json: str = '')

Train and evaluate regression models on both real and synthetic datasets.

Parameters

X_trainUnion[pl.DataFrame, pl.LazyFrame, pd.DataFrame, np.ndarray]

Training features for real data.

X_synthUnion[pl.DataFrame, pl.LazyFrame, pd.DataFrame, np.ndarray]

Training features for synthetic data.

X_testUnion[pl.DataFrame, pl.LazyFrame, pd.DataFrame, np.ndarray]

Test features for evaluation.

y_trainUnion[pl.DataFrame, pl.LazyFrame, pl.Series, pd.Series, pd.DataFrame, np.ndarray]

Training labels for real data.

y_synthUnion[pl.DataFrame, pl.LazyFrame, pl.Series, pd.Series, pd.DataFrame, np.ndarray]

Training labels for synthetic data.

y_testUnion[pl.DataFrame, pl.LazyFrame, pl.Series, pd.Series, pd.DataFrame, np.ndarray]

Test labels for evaluation.

custom_metricCallable, optional

Custom metric for model evaluation, by default None.

regressorsList[Callable], optional

List of regressors to use, or “all” for all available regressors, by default “all”.

predictionsbool, optional

If True, returns predictions along with model performance, by default False.

path_to_jsonstr, optional

Path to save the output JSON files, by default “”.

Returns

Union[pd.DataFrame, pl.DataFrame, np.ndarray]

Model performance metrics for both real and synthetic datasets, and optionally predictions.

Privacy

sure.privacy.adversary_dataset(training_set: DataFrame | DataFrame | LazyFrame, validation_set: DataFrame | DataFrame | LazyFrame, original_dataset_sample_fraction: float = 0.2) DataFrame

Create an adversary dataset for the Membership Inference Test given a training and validation set. The validation set must be smaller than the training set.

The size of the resulting adversary dataset is a fraction of the sum of the training set size and the validation set size.

It takes half of the final rows from the training set and the other half from the validation set. It adds a column to mark which rows was sampled from the training set.

Parameters

training_setpd.DataFrame

The training set as a pandas DataFrame.

validation_setpd.DataFrame

The validation set as a pandas DataFrame.

original_dataset_sample_fractionfloat, optional

How many rows (a fraction from 0 to 1) to sample from the concatenation of the training and validation set, by default 0.2

Returns

pd.DataFrame

A new pandas DataFrame in which half of the rows come from the training set and the other half come from the validation set.

sure.privacy.dcr_stats(dcr_name: str, distances_to_closest_record: ndarray, path_to_json: str = '') Dict

This function returns the statisitcs for an array containing DCR computed previously.

Parameters

dcr_namestr

Name with which the DCR will be saved with in the JSON file used to generate the final report. Can be one of the following:

  • synth_train

  • synth_val

  • other

distances_to_closest_recordnp.ndarray

A 1D-array containing the Distance to the Closest Record for each row of a dataframe shape (dataframe rows, )

Returns

Dict

A dictionary containing mean and percentiles of the given DCR array.

sure.privacy.distance_to_closest_record(dcr_name: str, x_dataframe: DataFrame | DataFrame | LazyFrame, y_dataframe: DataFrame | DataFrame | LazyFrame | None = None, feature_weights: ndarray | List | None = None, parallel: bool = True, save_output: bool = True, path_to_json: str = '') ndarray

Compute the distancees to closest record of dataframe X from dataframe Y using a modified version of the Gower’s distance. The two dataframes may contain mixed datatypes (numerical and categorical).

Paper references: * A General Coefficient of Similarity and Some of Its Properties, J. C. Gower * Dimensionality Invariant Similarity Measure, Ahmad Basheer Hassanat

Parameters

dcr_namestr

Name with which the DCR will be saved with in the JSON file used to generate the final report. Can be one of the following:

  • synth_train

  • synth_val

  • other

x_dataframepd.DataFrame

A dataset containing numerical and categorical data.

categorical_featuresList

List of booleans that indicates which features are categorical. If categoricals_features[i] is True, feature i is categorical. Must have same length of x_dataframe.columns.

y_dataframepd.DataFrame, optional

Another dataset containing numerical and categorical data, by default None. It must contains the same columns of x_dataframe. If None, the distance matrix is computed between x_dataframe and x_dataframe

feature_weightsList, optional

List of features weights to use computing distances, by default None. If None, each feature weight is 1.0

parallelBoolean, optional

Whether to enable the parallelization to compute Gower matrix, by default True

save_outputbool

If True, saves the DCR information into the JSON file used to generate the final report.

Returns

np.ndarray

1D array containing the Distance to the Closest Record for each row of x_dataframe shape (x_dataframe rows, )

Raises

TypeError

If dc_name is not one of the names listed above.

TypeError

If X and Y don’t have the same (number of) columns.

sure.privacy.membership_inference_test(adversary_dataset: DataFrame | DataFrame | LazyFrame, synthetic_dataset: DataFrame | DataFrame | LazyFrame, adversary_guesses_ground_truth: ndarray | DataFrame | DataFrame | LazyFrame | Series, parallel: bool = True, path_to_json: str = '')

Simulate a Membership Inference Attack on the provided synthetic dataset using an adversary dataset.

Parameters

adversary_datasetpd.DataFrame, pl.DataFrame, or pl.LazyFrame

The dataset used by the adversary, containing features for the attack simulation.

synthetic_datasetpd.DataFrame, pl.DataFrame, or pl.LazyFrame

The synthetic dataset on which the Membership Inference Attack is performed.

adversary_guesses_ground_truthnp.ndarray, pd.DataFrame, pl.DataFrame, pl.LazyFrame, or pl.Series

Ground truth labels indicating whether a sample is from the original training dataset or not.

parallelbool, optional

Whether to use parallel processing for distance calculations, by default True.

path_to_jsonstr, optional

Path to save the attack output as a JSON file. If empty, the output is not saved, by default “”.

Returns

dict

A dictionary containing the attack results, including distance thresholds, precisions, and risk score.

Notes

  • This function simulates an attack where an adversary attempts to distinguish between real and synthetic samples.

  • The attack results are saved to a JSON file if path_to_json is provided.

sure.privacy.number_of_dcr_equal_to_zero(dcr_name: str, distances_to_closest_record: ndarray, path_to_json: str = '') int | int8 | uint8 | int16 | uint16 | int32 | uint32 | int64 | uint64

Return the number of 0s in the given DCR array, that is the number of duplicates/clones detected.

Parameters

distances_to_closest_recordnp.ndarray

A 1D-array containing the Distance to the Closest Record for each row of a dataframe shape (dataframe rows, )

Returns

int

The number of 0s in the given DCR array.

sure.privacy.validation_dcr_test(dcr_synth_train: ndarray, dcr_synth_validation: ndarray, path_to_json: str = '') float | float16 | float32 | float64
  • If the returned percentage is close to (or smaller than) 50%, then the synthetic datset’s records are equally close to the original training set and to the validation set. In this casse the synthetic data does not allow to conjecture whether a record was or was not contained in the training dataset.

  • If the returned percentage is greater than 50%, then the synthetic datset’s records are closer to the training set than to the validation set, indicating that vulnerable records are present in the synthetic dataset.

Parameters

dcr_synth_trainnp.ndarray

A 1D-array containing the Distance to the Closest Record for each row of the synthetic dataset wrt the training dataset, shape (synthetic rows, )

dcr_synth_validationnp.ndarray

A 1D-array containing the Distance to the Closest Record for each row of the synthetic dataset wrt the validation dataset, shape (synthetic rows, )

Returns

float

The percentage of synthetic rows closer to the training dataset than to the validation dataset.

Raises

ValueError

If the two DCR array given as parameters have different shapes.