Game theory, which tries to predict how the behavior of competitors affects the choices of other participants, can help researchers figure out the best way to share biomedical data while protecting the anonymity of those who provide it from hackers.
Modern biomedical research, such as the National COVID Cohort Collaborative and the Personal Genome Project, requires large amounts of individual-specific data. Making detailed data sets public without violating anyone’s privacy is a major challenge for such projects.
To that end, many programs that collect and disseminate genomic data mask personal information in the data, which can be used to re-identify subjects. Even so, residual data could potentially be used to track personal information from other sources, which could be correlated with biomedical data to mine a subject’s identity. For example, comparing someone’s DNA data to a public genealogical database like Ancestry.com can sometimes yield the person’s last name, which can be used alongside demographic data to track the person’s identity through an online public records search engine like PeopleFinders.
Our research group, the Center for Genetic Privacy and Identity in Community Settings, has developed several approaches to help assess and mitigate privacy risks in biomedical data sharing. Our approach can be used to protect various types of data, such as personal demographics or genome sequences, from anonymous attacks.
Our recent work uses two-person leader-follower games to model interactions between data subjects and potentially malicious data users. In this model, data subjects move first, deciding which data to share. The adversary then makes the next move, deciding whether to attack based on the shared data.
An approach to evaluating shared data using game theory involves scoring each strategy for privacy and the value of shared data. Policy involves a trade-off between omitting or hiding parts of the data to protect identity and keeping the data as useful as possible.
The optimal policy allows data subjects to share the most data with the least risk. However, finding the optimal strategy is challenging because genome sequencing data has many dimensions, making an exhaustive search of all possible data-sharing strategies impractical.
To overcome this problem, we develop search algorithms that focus on a small subset of policies that are most likely to contain the best ones. We demonstrate that our approach is most effective considering the utility of the data to the public and the privacy of the data subject.
Why it matters
The worst case scenario is that the attacker has unlimited power and no aversion to economic loss, which is usually highly unlikely. However, data managers sometimes focus on these scenarios, which can lead them to overestimate the risk of re-identification and share much less data than they are secure with.
The goal of our work is to create a systematic way to reason about risk, which also illustrates the value of sharing data. Our game-based approach not only provides a more realistic estimate of the risk of re-identification, but also identifies data sharing strategies that strike the right balance between utility and privacy.
What other research is being done
Data managers use encryption to protect biomedical data. Other methods include adding noise to the data and hiding parts of the data.
This work builds on our previous work, which pioneered the use of game theory to assess the risk of re-identification in health data and prevent identity attacks on genomic data. Our current study is the first to consider an attack in which an attacker can access multiple resources and combine them in a step-by-step manner.
We are now working to expand our game-based approach to simulating player uncertainty and rationality. We are also trying to consider environments with multiple data providers and multiple types of data receivers.