Yaniv Ehrlich’s lab at MIT has a new paper out in Science today, with a companion policy piece from the National Human Genome Research Institute at NIH. Apologies for icky paywalls but these are important papers.
The gist is that a savvy computational scientist can find enough breadcrumbs in a genome to figure out the surnames for participants in supposedly de-identified studies. The methods for the paper are reminiscent of the re-identification approaches against the Netflix database, the AOL search database, and the Massachusetts health records database: cross-referencing de-identified information with other public information, and then using that against other records associated with surnames.
And it gets bigger - 135,000 or so records may potentially identify millions more through inheritance of short tandem repeats. I’ve probably mangled the approach. The short version is perhaps better summed up by a picture:
The dude in red is mathematics. The dude in white is your anonymity.
These two papers force us to be honest in talking about genomic information and identifiability: the basics to re-identify significant portions of people from their genomes alone are already in place. And they’re already strongly able to lead to surnames, long before we hit the mythical $100 genome.
So what does honesty in this space mean? It means we shouldn’t promise people that we can both de-identify data and make it useful. It means that we should also celebrate the benefits that de-identification brings, and think of it in a risk-reward context for those joining studies that involve genome publication online. It doesn’t mean stop sharing, stop sequencing. It means stop pretending the methods for de-identification work very well. A lot of people will go away anyway, but a lot will share.
Both papers reject the idea of ceasing data sharing as a result of the research, which is heartening. We are in a world where we are simply less anonymous than we used to be, than we’ve ever been. There are enough unique things about us all, and enough devices capturing them, and powerful enough algorithms, that this stuff is simply doable now.
We need to develop a whole spectrum of ways to manage privacy. My own work on consent is just a piece of a tapestry, for those who really want to donate, to share, to be exposed. Hopefully this opens up more space at the table for the new approaches that are bubbling in health privacy management. We need data markets. Data banks (you know, like old-school community banks). Data conservation trusts (like land trusts - I’m going to publish something soon on this topic). We need entrepreneurs to fill the gaps between all open and all closed, to provide products that make someone’s data alive to her, not just to a gearhead with a taste for naive bayesian inference.
Pandora’s box didn’t open with this paper. It’s been sitting open for quite a while now, just waiting for the right eyes to see it.