sure.distance_metrics package
Submodules
sure.distance_metrics.distance module
- sure.distance_metrics.distance.dcr_stats(dcr_name: str, distances_to_closest_record: ndarray, path_to_json: str = '') Dict
This function returns the statisitcs for an array containing DCR computed previously.
Parameters
- dcr_namestr
Name with which the DCR will be saved with in the JSON file used to generate the final report. Can be one of the following:
synth_train
synth_val
other
- distances_to_closest_recordnp.ndarray
A 1D-array containing the Distance to the Closest Record for each row of a dataframe shape (dataframe rows, )
Returns
- Dict
A dictionary containing mean and percentiles of the given DCR array.
- sure.distance_metrics.distance.distance_to_closest_record(dcr_name: str, x_dataframe: DataFrame | DataFrame | LazyFrame, y_dataframe: DataFrame | DataFrame | LazyFrame | None = None, feature_weights: ndarray | List | None = None, parallel: bool = True, save_output: bool = True, path_to_json: str = '') ndarray
Compute the distancees to closest record of dataframe X from dataframe Y using a modified version of the Gower’s distance. The two dataframes may contain mixed datatypes (numerical and categorical).
Paper references: * A General Coefficient of Similarity and Some of Its Properties, J. C. Gower * Dimensionality Invariant Similarity Measure, Ahmad Basheer Hassanat
Parameters
- dcr_namestr
Name with which the DCR will be saved with in the JSON file used to generate the final report. Can be one of the following:
synth_train
synth_val
other
- x_dataframepd.DataFrame
A dataset containing numerical and categorical data.
- categorical_featuresList
List of booleans that indicates which features are categorical. If categoricals_features[i] is True, feature i is categorical. Must have same length of x_dataframe.columns.
- y_dataframepd.DataFrame, optional
Another dataset containing numerical and categorical data, by default None. It must contains the same columns of x_dataframe. If None, the distance matrix is computed between x_dataframe and x_dataframe
- feature_weightsList, optional
List of features weights to use computing distances, by default None. If None, each feature weight is 1.0
- parallelBoolean, optional
Whether to enable the parallelization to compute Gower matrix, by default True
- save_outputbool
If True, saves the DCR information into the JSON file used to generate the final report.
Returns
- np.ndarray
1D array containing the Distance to the Closest Record for each row of x_dataframe shape (x_dataframe rows, )
Raises
- TypeError
If dc_name is not one of the names listed above.
- TypeError
If X and Y don’t have the same (number of) columns.
- sure.distance_metrics.distance.number_of_dcr_equal_to_zero(dcr_name: str, distances_to_closest_record: ndarray, path_to_json: str = '') int | int8 | uint8 | int16 | uint16 | int32 | uint32 | int64 | uint64
Return the number of 0s in the given DCR array, that is the number of duplicates/clones detected.
Parameters
- distances_to_closest_recordnp.ndarray
A 1D-array containing the Distance to the Closest Record for each row of a dataframe shape (dataframe rows, )
Returns
- int
The number of 0s in the given DCR array.
- sure.distance_metrics.distance.validation_dcr_test(dcr_synth_train: ndarray, dcr_synth_validation: ndarray, path_to_json: str = '') float | float16 | float32 | float64
If the returned percentage is close to (or smaller than) 50%, then the synthetic datset’s records are equally close to the original training set and to the validation set. In this casse the synthetic data does not allow to conjecture whether a record was or was not contained in the training dataset.
If the returned percentage is greater than 50%, then the synthetic datset’s records are closer to the training set than to the validation set, indicating that vulnerable records are present in the synthetic dataset.
Parameters
- dcr_synth_trainnp.ndarray
A 1D-array containing the Distance to the Closest Record for each row of the synthetic dataset wrt the training dataset, shape (synthetic rows, )
- dcr_synth_validationnp.ndarray
A 1D-array containing the Distance to the Closest Record for each row of the synthetic dataset wrt the validation dataset, shape (synthetic rows, )
Returns
- float
The percentage of synthetic rows closer to the training dataset than to the validation dataset.
Raises
- ValueError
If the two DCR array given as parameters have different shapes.
sure.distance_metrics.gower_matrix_c module
- sure.distance_metrics.gower_matrix_c.gower_matrix_c(X_categorical, X_numerical, Y_categorical, Y_numerical, numericals_ranges, features_weight_sum, fill_diagonal, first_index)