anomed_challenge.challenge
This module provides means to create challenges for the AnoMed competition platform, focusing on the data and utility definition, avoiding any web communication issues.
Module Contents
Classes
An instance of a NumpyDataset for Numpy arrays that resides in the main memory. |
|
An instance of a NumpyDataset, suitable for .npz files saved to disk. |
|
An abstract class defining a basic interface for handling datasets, based on Numpy arrays. |
|
This class represents supervised learning ML challenges, where the threat model in concerned with membership inference attacks (MIA). |
|
This class represents challenges that aim to protect tabular data via anonymization, while maintaining certain utility. The threat model involves attackers that aim to reconstruct the tabular data from the anonymized data plus background knowledge. |
Functions
A dataset transformer which discards the target array, replacing in by an empty one. |
|
Evaluate the prediction of a membership estimator in terms of binary accuracy, true positive rate and false positive rate. |
|
A shorter alias for |
|
Calculate the ‘strict’ binary accuracy of the prediction with respect to the ground truth. |
Data
What scheme does the anonymized data follow? |
API
- anomed_challenge.challenge.AnonymizationScheme: TypeAlias = None
What scheme does the anonymized data follow?
same: Leaky data and anonymized data match regarding shape (columns) and datatype.
generalization: Cardinal data will be represented by (min, max) ranges, nominal data will be one-hot encoded.
microaggregation: TODO
- anomed_challenge.challenge.discard_targets(data: anomed_challenge.challenge.NumpyDataset) anomed_challenge.challenge.InMemoryNumpyArrays
A dataset transformer which discards the target array, replacing in by an empty one.
Use this function to avoid leaking targets/labels.
- Parameters:
data (NumpyDataset) – The dataset to discard the features of.
- Returns:
A new dataset, containing the old feature array and an empty target array.
- Return type:
- anomed_challenge.challenge.evaluate_membership_inference_attack(prediction: numpy.ndarray, ground_truth: numpy.ndarray) dict[str, float]
Evaluate the prediction of a membership estimator in terms of binary accuracy, true positive rate and false positive rate.
The ground truth and the corresponding dataset is provided by
SupervisedLearningMIAChallenge.MIA_evaluation_data(dataset is first component, membership mask is second component).- Parameters:
prediction (np.ndarray) – A non-empty, one-dimensional boolean array containing the estimator’s membership prediction.
ground_truth (np.ndarray) – A non-empty, one-dimensional boolean array containing the true memberships.
- Returns:
A dictionary of metrics, namely accuracy (‘acc’), false positive rate (‘fpr’) and true positive rate (‘tpr’).
- Return type:
dict[str, float]
- Raises:
ValueError – If
predictionorground_truthare empty arrays, or if they are not of the same length or not boolean arrays.
- anomed_challenge.challenge.evaluate_MIA(prediction: numpy.ndarray, ground_truth: numpy.ndarray) dict[str, float]
A shorter alias for
evaluate_membership_inference_attack.
- class anomed_challenge.challenge.InMemoryNumpyArrays(X: numpy.ndarray, y: numpy.ndarray)
Bases:
anomed_challenge.challenge.NumpyDatasetAn instance of a NumpyDataset for Numpy arrays that resides in the main memory.
Initialization
- Parameters:
X (np.ndarray) – The feature array.
y (np.ndarray) – The target array.
- get() tuple[numpy.ndarray, numpy.ndarray]
- class anomed_challenge.challenge.NpzFromDisk(npz_filepath: str | pathlib.Path, X_label: str = 'X', y_label: str = 'y')
Bases:
anomed_challenge.challenge.NumpyDatasetAn instance of a NumpyDataset, suitable for .npz files saved to disk.
Initialization
- Parameters:
npz_filepath (str | Path) – The path to the .npz file.
X_label (str, optional) – The label of the array within the .npz file which should be treated as
X, the feature array. By default “X”.y_label (str, optional) – The label of the array within the .npz file which should be treated as
y, the target array. By default “y”.
.. rubric:: Notes
Other arrays that might reside in the .npz file are ignored.
- get() tuple[numpy.ndarray, numpy.ndarray]
- class anomed_challenge.challenge.NumpyDataset
Bases:
abc.ABCAn abstract class defining a basic interface for handling datasets, based on Numpy arrays.
It is intended for the common case of supervised learning, where you have feature arrays and target arrays. Inheriting classes only have to provide a
getmethod. Default implementations for__eq__,__repr__and__str__are given.- abstract get() tuple[numpy.ndarray, numpy.ndarray]
An accessor to the feature array and the target array.
- Returns:
(X, y) – X is the feature array, y is the target array.
- Return type:
tuple[np.ndarray, np.ndarray]
- __eq__(other) bool
- __repr__() str
- __str__() str
- property shapes: list[tuple[int, ...]]
- property dtypes: list[numpy.dtype]
- anomed_challenge.challenge.strict_binary_accuracy(prediction: numpy.ndarray, ground_truth: numpy.ndarray) dict[str, float]
Calculate the ‘strict’ binary accuracy of the prediction with respect to the ground truth.
By strict accuracy, we mean the fraction of the number of times where
`prediction[i] == ground_truth[i]`,
for
0 <= i < len(prediction)isTrue, divided bylen(prediction). Note this is not the same assum(prediction==ground_truth), which is more forgiving for higher dimensional arrays (for one-dimensional arrays, it is equivalent though).- Parameters:
prediction (np.ndarray) – An estimator’s prediction. Should have same shape and dtype as
ground_truthand should not be empty.ground_truth (np.ndarray) – The respective ground truth. Should have same shape and dtype as
predictionand should not be empty.
- Returns:
A dictionary with key
accuracyand a strict accuracy value.- Return type:
dict[str, float]
- Raises:
ValueError – If
prediction, orground_truthis empty, if their shape or if their dtype does not match.
- class anomed_challenge.challenge.SupervisedLearningMIAChallenge(training_data: anomed_challenge.challenge.NumpyDataset, tuning_data: anomed_challenge.challenge.NumpyDataset, validation_data: anomed_challenge.challenge.NumpyDataset, anonymizer_evaluator: Callable[[numpy.ndarray, numpy.ndarray], dict[str, float]], MIA_evaluator: Callable[[numpy.ndarray, numpy.ndarray], dict[str, float]], MIA_evaluation_dataset_length: int, seed: int | None = None)
This class represents supervised learning ML challenges, where the threat model in concerned with membership inference attacks (MIA).
Instances bundle training, tuning and validation datasets for anonymizers (privacy preserving ML models); member, non-member and evaluation datasets for deanonymizers (attacks on anonymizers); and also means of evaluating anonymizers and deanonymizers. If you are able to define your challenge obeying this interface, you will be able to transform it into a AnoMed compatible web server “for free”, by using
anomed_challenge.supervised_learning_MIA_challenge_server_factory.Initialization
- Parameters:
training_data (NumpyDataset) – The dataset offered to peers, whose goal is to train anonymizers.
tuning_data (NumpyDataset) – The dataset offered to peers, whose goals is to tune their anonymizers which are currently in training (this is also called “validation data” outside of the medical ML community).
validation_data (NumpyDataset) – The dataset used to validate (in the regulatory sense) the performance of a fully trained anonymizer. Not to be confused with what is called “validation data” outside of the medical ML field – for that we use the term “tuning data”.
anonymizer_evaluator (Callable[[np.ndarray, np.ndarray], dict[str, float]]) – A way to evaluate the prediction (first argument) of an anonymizer compared to the ground truth (second argument, will be substituted by the target array (
y) ofvalidation_data).MIA_evaluator (Callable[[np.ndarray, np.ndarray], dict[str, float]]) – A way to evaluate the prediction of a membership inference attack (first argument) compared to the ground truth memberships (second argument, defined by the second component of
MIA_evaluation_data(...)).MIA_evaluation_dataset_length (int, optional) – The number of members and also the number of non-members to include in the return value of
MIA_evaluation_data. That means in total, the length of that dataset will be2 * MIA_evaluation_dataset_length. We suggest to set this to at least 100.seed (int | None, optional) – If given, use this seed to create
membersandnon_members. By defaultNone, which means obtain randomness non-deterministically at runtime.
- property members: anomed_challenge.challenge.NumpyDataset
A dataset to train membership inference attacks. It consists of features and corresponding targets that the model under attack has seen during training (therefore “members”), i.e. the data is a subset of
training_data.
- property non_members: anomed_challenge.challenge.NumpyDataset
A dataset to train membership inference attacks. It consists of features and corresponding targets that the model under attack has not seen during training (therefore “non_members”), i.e. the data is a subset of
validation_data.
- MIA_evaluation_data(anonymizer: str, deanonymizer: str, data_split: Literal[tuning, validation]) tuple[anomed_challenge.challenge.NumpyDataset, numpy.ndarray]
A dataset and corresponding memberships to evaluate the success of a membership inference attack.
The dataset is individual for the specific combination of
anonymizer,deanonymizerand the setting ofdata_split(at least with high probability). Its size is determined by2 * MIA_evaluation_dataset_length(provided at initialization).- Parameters:
anonymizer (str) – The identifier of the anonymizer being under attack.
deanonymizer (str) – The identifier of the membership inference attack.
data_split (Literal["tuning", "validation"]) – Which ground truth to use for reference. There is one dataset determined to be used once for validation and another, disjoint dataset determined to be used multiple times for tuning.
- Returns:
(dataset, memberships) – A dataset of the same width and dtype as the training, tuning and validation data (although not of the same length/height) and a boolean array with corresponding memberships (ground truth). If
memberships[i]isTrue,dataset[i] == (X[i], y[i])is a member, i.e. part of the training dataset. Ifmemberships[i]isFalse, it is not a member, i.e. part of the validation dataset.- Return type:
tuple[NumpyDataset, np.ndarray]
- evaluate_anonymizer(prediction: numpy.ndarray, ground_truth: numpy.ndarray) dict[str, float]
Evaluate an anonymizer, governed by
anonymizer_evaluatorprovided at initialization.- Parameters:
prediction (np.ndarray) – The prediction of the anonymizer being under evaluation, i.e. an array of the same dtype and shape as the target array of the validation dataset.
ground_truth (np.ndarray) – The corresponding ground truth (target array of the validation dataset).
- Returns:
A dictionary of evaluation metrics, depending on the return value of
anonymizer_evaluator.- Return type:
dict[str, float]
- evaluate_membership_inference_attack(prediction: numpy.ndarray, ground_truth: numpy.ndarray) dict[str, float]
Evaluate a membership inference attack, governed by
MIA_evaluatorprovided at initialization.- Parameters:
prediction (np.ndarray) – The prediction of the membership inference attack being under evaluation, i.e. a boolean array of size
2 * MIA_evaluation_dataset_length.ground_truth (np.ndarray) – The corresponding ground truth (second component of
MIA_evaluation_data(...)).
- Returns:
A dictionary of evaluation metrics, depending on the return value of
MIA_evaluator.- Return type:
dict[str, float]
- class anomed_challenge.challenge.TabularDataReconstructionChallenge(leaky_data: pandas.DataFrame, background_knowledge: pandas.DataFrame, utility_evaluator: Callable[[pandas.DataFrame, anomed_challenge.challenge.AnonymizationScheme, pandas.DataFrame], dict[str, float]], privacy_evaluator: Callable[[pandas.DataFrame, pandas.DataFrame], dict[str, float]])
This class represents challenges that aim to protect tabular data via anonymization, while maintaining certain utility. The threat model involves attackers that aim to reconstruct the tabular data from the anonymized data plus background knowledge.
Initialization
- Parameters:
leaky_data (pd.Dataframe) – The original tabular data which should be protected by anonymization.
background_knowledge (pd.DataFrame) – Background knowledge that is accessible by attackers.
utility_evaluator (Callable[[pd.DataFrame, AnonymizationScheme, DataFrame], dict[str, float]]) –
How to evaluate the utility of anonymized data, depending of the kind of anonymization. The first argument is the anonymized data, the second argument is the anonymization scheme and the third argument is the original/leaky data for comparison.
This function parameter is used, when
evaluate_utilityis called.privacy_evaluator (Callable[[pd.DataFrame, pd.DataFrame], dict[str, float]]) –
How to evaluate the quality of reconstructed data, i.e. the success of the attacker. This is also an indirect measure of how privacy preserving the anonymized data (as an attack target) has been. The first argument is the reconstructed data, the second argument the leaky data, which the attack target’s anonymized data depends on.
This function parameter is used, when
evaluate_privacyis called.
- evaluate_utility(anonymized_data: pandas.DataFrame, anonymization_scheme: anomed_challenge.challenge.AnonymizationScheme, leaky_data: pandas.DataFrame) dict[str, float]
- evaluate_privacy(reconstructed_data: pandas.DataFrame, leaky_data: pandas.DataFrame) dict[str, float]