anomed_challenge.challenge

This module provides means to create challenges for the AnoMed competition platform, focusing on the data and utility definition, avoiding any web communication issues.

Module Contents

Classes

InMemoryNumpyArrays

An instance of a NumpyDataset for Numpy arrays that resides in the main memory.

NpzFromDisk

An instance of a NumpyDataset, suitable for .npz files saved to disk.

NumpyDataset

An abstract class defining a basic interface for handling datasets, based on Numpy arrays.

SupervisedLearningMIAChallenge

This class represents supervised learning ML challenges, where the threat model in concerned with membership inference attacks (MIA).

TabularDataReconstructionChallenge

This class represents challenges that aim to protect tabular data via anonymization, while maintaining certain utility. The threat model involves attackers that aim to reconstruct the tabular data from the anonymized data plus background knowledge.

Functions

discard_targets

A dataset transformer which discards the target array, replacing in by an empty one.

evaluate_membership_inference_attack

Evaluate the prediction of a membership estimator in terms of binary accuracy, true positive rate and false positive rate.

evaluate_MIA

A shorter alias for evaluate_membership_inference_attack.

strict_binary_accuracy

Calculate the ‘strict’ binary accuracy of the prediction with respect to the ground truth.

Data

AnonymizationScheme

What scheme does the anonymized data follow?

API

anomed_challenge.challenge.AnonymizationScheme: TypeAlias = None

What scheme does the anonymized data follow?

  • same: Leaky data and anonymized data match regarding shape (columns) and datatype.

  • generalization: Cardinal data will be represented by (min, max) ranges, nominal data will be one-hot encoded.

  • microaggregation: TODO

anomed_challenge.challenge.discard_targets(data: anomed_challenge.challenge.NumpyDataset) anomed_challenge.challenge.InMemoryNumpyArrays

A dataset transformer which discards the target array, replacing in by an empty one.

Use this function to avoid leaking targets/labels.

Parameters:

data (NumpyDataset) – The dataset to discard the features of.

Returns:

A new dataset, containing the old feature array and an empty target array.

Return type:

InMemoryNumpyArrays

anomed_challenge.challenge.evaluate_membership_inference_attack(prediction: numpy.ndarray, ground_truth: numpy.ndarray) dict[str, float]

Evaluate the prediction of a membership estimator in terms of binary accuracy, true positive rate and false positive rate.

The ground truth and the corresponding dataset is provided by SupervisedLearningMIAChallenge.MIA_evaluation_data (dataset is first component, membership mask is second component).

Parameters:
  • prediction (np.ndarray) – A non-empty, one-dimensional boolean array containing the estimator’s membership prediction.

  • ground_truth (np.ndarray) – A non-empty, one-dimensional boolean array containing the true memberships.

Returns:

A dictionary of metrics, namely accuracy (‘acc’), false positive rate (‘fpr’) and true positive rate (‘tpr’).

Return type:

dict[str, float]

Raises:

ValueError – If prediction or ground_truth are empty arrays, or if they are not of the same length or not boolean arrays.

anomed_challenge.challenge.evaluate_MIA(prediction: numpy.ndarray, ground_truth: numpy.ndarray) dict[str, float]

A shorter alias for evaluate_membership_inference_attack.

class anomed_challenge.challenge.InMemoryNumpyArrays(X: numpy.ndarray, y: numpy.ndarray)

Bases: anomed_challenge.challenge.NumpyDataset

An instance of a NumpyDataset for Numpy arrays that resides in the main memory.

Initialization

Parameters:
  • X (np.ndarray) – The feature array.

  • y (np.ndarray) – The target array.

get() tuple[numpy.ndarray, numpy.ndarray]
class anomed_challenge.challenge.NpzFromDisk(npz_filepath: str | pathlib.Path, X_label: str = 'X', y_label: str = 'y')

Bases: anomed_challenge.challenge.NumpyDataset

An instance of a NumpyDataset, suitable for .npz files saved to disk.

Initialization

Parameters:
  • npz_filepath (str | Path) – The path to the .npz file.

  • X_label (str, optional) – The label of the array within the .npz file which should be treated as X, the feature array. By default “X”.

  • y_label (str, optional) – The label of the array within the .npz file which should be treated as y, the target array. By default “y”.

.. rubric:: Notes

Other arrays that might reside in the .npz file are ignored.

get() tuple[numpy.ndarray, numpy.ndarray]
class anomed_challenge.challenge.NumpyDataset

Bases: abc.ABC

An abstract class defining a basic interface for handling datasets, based on Numpy arrays.

It is intended for the common case of supervised learning, where you have feature arrays and target arrays. Inheriting classes only have to provide a get method. Default implementations for __eq__, __repr__ and __str__ are given.

abstract get() tuple[numpy.ndarray, numpy.ndarray]

An accessor to the feature array and the target array.

Returns:

(X, y) – X is the feature array, y is the target array.

Return type:

tuple[np.ndarray, np.ndarray]

__eq__(other) bool
__repr__() str
__str__() str
property shapes: list[tuple[int, ...]]
property dtypes: list[numpy.dtype]
anomed_challenge.challenge.strict_binary_accuracy(prediction: numpy.ndarray, ground_truth: numpy.ndarray) dict[str, float]

Calculate the ‘strict’ binary accuracy of the prediction with respect to the ground truth.

By strict accuracy, we mean the fraction of the number of times where

`prediction[i] == ground_truth[i]`,

for 0 <= i < len(prediction) is True, divided by len(prediction). Note this is not the same as sum(prediction==ground_truth), which is more forgiving for higher dimensional arrays (for one-dimensional arrays, it is equivalent though).

Parameters:
  • prediction (np.ndarray) – An estimator’s prediction. Should have same shape and dtype as ground_truth and should not be empty.

  • ground_truth (np.ndarray) – The respective ground truth. Should have same shape and dtype as prediction and should not be empty.

Returns:

A dictionary with key accuracy and a strict accuracy value.

Return type:

dict[str, float]

Raises:

ValueError – If prediction, or ground_truth is empty, if their shape or if their dtype does not match.

class anomed_challenge.challenge.SupervisedLearningMIAChallenge(training_data: anomed_challenge.challenge.NumpyDataset, tuning_data: anomed_challenge.challenge.NumpyDataset, validation_data: anomed_challenge.challenge.NumpyDataset, anonymizer_evaluator: Callable[[numpy.ndarray, numpy.ndarray], dict[str, float]], MIA_evaluator: Callable[[numpy.ndarray, numpy.ndarray], dict[str, float]], MIA_evaluation_dataset_length: int, seed: int | None = None)

This class represents supervised learning ML challenges, where the threat model in concerned with membership inference attacks (MIA).

Instances bundle training, tuning and validation datasets for anonymizers (privacy preserving ML models); member, non-member and evaluation datasets for deanonymizers (attacks on anonymizers); and also means of evaluating anonymizers and deanonymizers. If you are able to define your challenge obeying this interface, you will be able to transform it into a AnoMed compatible web server “for free”, by using anomed_challenge.supervised_learning_MIA_challenge_server_factory.

Initialization

Parameters:
  • training_data (NumpyDataset) – The dataset offered to peers, whose goal is to train anonymizers.

  • tuning_data (NumpyDataset) – The dataset offered to peers, whose goals is to tune their anonymizers which are currently in training (this is also called “validation data” outside of the medical ML community).

  • validation_data (NumpyDataset) – The dataset used to validate (in the regulatory sense) the performance of a fully trained anonymizer. Not to be confused with what is called “validation data” outside of the medical ML field – for that we use the term “tuning data”.

  • anonymizer_evaluator (Callable[[np.ndarray, np.ndarray], dict[str, float]]) – A way to evaluate the prediction (first argument) of an anonymizer compared to the ground truth (second argument, will be substituted by the target array (y) of validation_data).

  • MIA_evaluator (Callable[[np.ndarray, np.ndarray], dict[str, float]]) – A way to evaluate the prediction of a membership inference attack (first argument) compared to the ground truth memberships (second argument, defined by the second component of MIA_evaluation_data(...)).

  • MIA_evaluation_dataset_length (int, optional) – The number of members and also the number of non-members to include in the return value of MIA_evaluation_data. That means in total, the length of that dataset will be 2 * MIA_evaluation_dataset_length. We suggest to set this to at least 100.

  • seed (int | None, optional) – If given, use this seed to create members and non_members. By default None, which means obtain randomness non-deterministically at runtime.

property members: anomed_challenge.challenge.NumpyDataset

A dataset to train membership inference attacks. It consists of features and corresponding targets that the model under attack has seen during training (therefore “members”), i.e. the data is a subset of training_data.

property non_members: anomed_challenge.challenge.NumpyDataset

A dataset to train membership inference attacks. It consists of features and corresponding targets that the model under attack has not seen during training (therefore “non_members”), i.e. the data is a subset of validation_data.

MIA_evaluation_data(anonymizer: str, deanonymizer: str, data_split: Literal[tuning, validation]) tuple[anomed_challenge.challenge.NumpyDataset, numpy.ndarray]

A dataset and corresponding memberships to evaluate the success of a membership inference attack.

The dataset is individual for the specific combination of anonymizer, deanonymizer and the setting of data_split (at least with high probability). Its size is determined by 2 * MIA_evaluation_dataset_length (provided at initialization).

Parameters:
  • anonymizer (str) – The identifier of the anonymizer being under attack.

  • deanonymizer (str) – The identifier of the membership inference attack.

  • data_split (Literal["tuning", "validation"]) – Which ground truth to use for reference. There is one dataset determined to be used once for validation and another, disjoint dataset determined to be used multiple times for tuning.

Returns:

(dataset, memberships) – A dataset of the same width and dtype as the training, tuning and validation data (although not of the same length/height) and a boolean array with corresponding memberships (ground truth). If memberships[i] is True, dataset[i] == (X[i], y[i]) is a member, i.e. part of the training dataset. If memberships[i] is False, it is not a member, i.e. part of the validation dataset.

Return type:

tuple[NumpyDataset, np.ndarray]

evaluate_anonymizer(prediction: numpy.ndarray, ground_truth: numpy.ndarray) dict[str, float]

Evaluate an anonymizer, governed by anonymizer_evaluator provided at initialization.

Parameters:
  • prediction (np.ndarray) – The prediction of the anonymizer being under evaluation, i.e. an array of the same dtype and shape as the target array of the validation dataset.

  • ground_truth (np.ndarray) – The corresponding ground truth (target array of the validation dataset).

Returns:

A dictionary of evaluation metrics, depending on the return value of anonymizer_evaluator.

Return type:

dict[str, float]

evaluate_membership_inference_attack(prediction: numpy.ndarray, ground_truth: numpy.ndarray) dict[str, float]

Evaluate a membership inference attack, governed by MIA_evaluator provided at initialization.

Parameters:
  • prediction (np.ndarray) – The prediction of the membership inference attack being under evaluation, i.e. a boolean array of size 2 * MIA_evaluation_dataset_length.

  • ground_truth (np.ndarray) – The corresponding ground truth (second component of MIA_evaluation_data(...)).

Returns:

A dictionary of evaluation metrics, depending on the return value of MIA_evaluator.

Return type:

dict[str, float]

class anomed_challenge.challenge.TabularDataReconstructionChallenge(leaky_data: pandas.DataFrame, background_knowledge: pandas.DataFrame, utility_evaluator: Callable[[pandas.DataFrame, anomed_challenge.challenge.AnonymizationScheme, pandas.DataFrame], dict[str, float]], privacy_evaluator: Callable[[pandas.DataFrame, pandas.DataFrame], dict[str, float]])

This class represents challenges that aim to protect tabular data via anonymization, while maintaining certain utility. The threat model involves attackers that aim to reconstruct the tabular data from the anonymized data plus background knowledge.

Initialization

Parameters:
  • leaky_data (pd.Dataframe) – The original tabular data which should be protected by anonymization.

  • background_knowledge (pd.DataFrame) – Background knowledge that is accessible by attackers.

  • utility_evaluator (Callable[[pd.DataFrame, AnonymizationScheme, DataFrame], dict[str, float]]) –

    How to evaluate the utility of anonymized data, depending of the kind of anonymization. The first argument is the anonymized data, the second argument is the anonymization scheme and the third argument is the original/leaky data for comparison.

    This function parameter is used, when evaluate_utility is called.

  • privacy_evaluator (Callable[[pd.DataFrame, pd.DataFrame], dict[str, float]]) –

    How to evaluate the quality of reconstructed data, i.e. the success of the attacker. This is also an indirect measure of how privacy preserving the anonymized data (as an attack target) has been. The first argument is the reconstructed data, the second argument the leaky data, which the attack target’s anonymized data depends on.

    This function parameter is used, when evaluate_privacy is called.

evaluate_utility(anonymized_data: pandas.DataFrame, anonymization_scheme: anomed_challenge.challenge.AnonymizationScheme, leaky_data: pandas.DataFrame) dict[str, float]
evaluate_privacy(reconstructed_data: pandas.DataFrame, leaky_data: pandas.DataFrame) dict[str, float]