sure.distance_metrics package

Submodules

sure.distance_metrics.distance module

sure.distance_metrics.distance.dcr_stats(dcr_name: str, distances_to_closest_record: ndarray, path_to_json: str = '') Dict

This function returns the statisitcs for an array containing DCR computed previously.

Parameters

dcr_namestr

Name with which the DCR will be saved with in the JSON file used to generate the final report. Can be one of the following:

  • synth_train

  • synth_val

  • other

distances_to_closest_recordnp.ndarray

A 1D-array containing the Distance to the Closest Record for each row of a dataframe shape (dataframe rows, )

Returns

Dict

A dictionary containing mean and percentiles of the given DCR array.

sure.distance_metrics.distance.distance_to_closest_record(dcr_name: str, x_dataframe: DataFrame | DataFrame | LazyFrame, y_dataframe: DataFrame | DataFrame | LazyFrame | None = None, feature_weights: ndarray | List | None = None, parallel: bool = True, save_output: bool = True, path_to_json: str = '') ndarray

Compute the distancees to closest record of dataframe X from dataframe Y using a modified version of the Gower’s distance. The two dataframes may contain mixed datatypes (numerical and categorical).

Paper references: * A General Coefficient of Similarity and Some of Its Properties, J. C. Gower * Dimensionality Invariant Similarity Measure, Ahmad Basheer Hassanat

Parameters

dcr_namestr

Name with which the DCR will be saved with in the JSON file used to generate the final report. Can be one of the following:

  • synth_train

  • synth_val

  • other

x_dataframepd.DataFrame

A dataset containing numerical and categorical data.

categorical_featuresList

List of booleans that indicates which features are categorical. If categoricals_features[i] is True, feature i is categorical. Must have same length of x_dataframe.columns.

y_dataframepd.DataFrame, optional

Another dataset containing numerical and categorical data, by default None. It must contains the same columns of x_dataframe. If None, the distance matrix is computed between x_dataframe and x_dataframe

feature_weightsList, optional

List of features weights to use computing distances, by default None. If None, each feature weight is 1.0

parallelBoolean, optional

Whether to enable the parallelization to compute Gower matrix, by default True

save_outputbool

If True, saves the DCR information into the JSON file used to generate the final report.

Returns

np.ndarray

1D array containing the Distance to the Closest Record for each row of x_dataframe shape (x_dataframe rows, )

Raises

TypeError

If dc_name is not one of the names listed above.

TypeError

If X and Y don’t have the same (number of) columns.

sure.distance_metrics.distance.number_of_dcr_equal_to_zero(dcr_name: str, distances_to_closest_record: ndarray, path_to_json: str = '') int | int8 | uint8 | int16 | uint16 | int32 | uint32 | int64 | uint64

Return the number of 0s in the given DCR array, that is the number of duplicates/clones detected.

Parameters

distances_to_closest_recordnp.ndarray

A 1D-array containing the Distance to the Closest Record for each row of a dataframe shape (dataframe rows, )

Returns

int

The number of 0s in the given DCR array.

sure.distance_metrics.distance.validation_dcr_test(dcr_synth_train: ndarray, dcr_synth_validation: ndarray, path_to_json: str = '') float | float16 | float32 | float64
  • If the returned percentage is close to (or smaller than) 50%, then the synthetic datset’s records are equally close to the original training set and to the validation set. In this casse the synthetic data does not allow to conjecture whether a record was or was not contained in the training dataset.

  • If the returned percentage is greater than 50%, then the synthetic datset’s records are closer to the training set than to the validation set, indicating that vulnerable records are present in the synthetic dataset.

Parameters

dcr_synth_trainnp.ndarray

A 1D-array containing the Distance to the Closest Record for each row of the synthetic dataset wrt the training dataset, shape (synthetic rows, )

dcr_synth_validationnp.ndarray

A 1D-array containing the Distance to the Closest Record for each row of the synthetic dataset wrt the validation dataset, shape (synthetic rows, )

Returns

float

The percentage of synthetic rows closer to the training dataset than to the validation dataset.

Raises

ValueError

If the two DCR array given as parameters have different shapes.

sure.distance_metrics.gower_matrix_c module

sure.distance_metrics.gower_matrix_c.gower_matrix_c(X_categorical, X_numerical, Y_categorical, Y_numerical, numericals_ranges, features_weight_sum, fill_diagonal, first_index)

Module contents