YODA yada yada
Yale University published a draft “Open Data Access policy” on November 15th to make available rhBMP-2 clinical trial data that they created in conjunction with Medtronic.
Let me get this out of the way before I go negative: the fact that this policy exists, is on the web, and is open to comment is an unalloyed Good Thing. It’s progress. It’s to be commended.
For those who aren’t initiates in the arcane world of open data or pharmaceutical testing, this is data related to a clinical trial of a kind of biological chemical found in our bodies that plays a role in bone development. Most of the time, this kind of data is collected once, then locked away, never to be seen or used again.
One of the reasons drugs cost so much is that we never re-use the data from clinical trials, so we never learn anything from failures, or from secondary uses of data. It’s an incredibly inefficient system. This project at Yale is an attempt to address that inefficiency by making the data available.
But that’s the extent of my nice words. What follows is a point by point review of the policy. The short version: this is not an “open data” policy. It’s a data access policy, and if they’re not going to fix it, they need to rename it. Because those of us who care about the definition of “open data” actually meaning something are going to criticize, persistently and loudly, if there’s any attempt to claim the title for data under this policy.
I apologize in advance for the length of this rant, but it’s a long policy. Also, they’re using the name of Yoda in vain.

I. Decision: There Will Be a Data Registration Process For Data Access
OK, I can live with this. We do this at Sage Bionetworks with Synapse, for example. Whether or not it’s a good thing depends on what terms and conditions researchers are forced to accept as part of registration, and who is allowed to register. Which brings us to the next point.
II. Decision: Registration For Data Access Will Include Investigator Disclosure And Submission Of A Study Proposal
This is not great. You’ll need all of the below to even apply for registration:
· Principal investigator’s: name, degree(s), SCOPUS ID, primary employer, and contact information, including phone, mailing address, and email address.
· Name, degree, and SCOPUS ID of other key investigative team members.
· Funding source and conflict of interest statement (using a modified version of the ICMJE disclosure form) for all team members.
· Certification of IRB approval (or waiver) from academic/university partner [see section III].
· Project specific aims, main and secondary outcome of interest and analysis plan [an example proposal will be available on the YODA project web site], and timeline.

Huh. I guess this would basically rule out…well, every data scientist in the world that doesn’t have a PhD and a SCOPUS ID. Which is pretty much the vast majority of them. It’s a lockout policy intended to limit liability and contain research to the stuff Yale and Medtronic think is ok.
The only acceptable part of this is that they will publish the registration applications – so we’ll know what they’re turning down, I suppose. But that can also create a disincentive to even apply, as it means that you’ll have told the world the questions you wish to ask before you’ve even had a chance to ask them.
This decision also basically guarantees the data will never be integrated with other data, because then these requirements would have to be syndicated over to all the other data. So trials about similar diseases, genetic networks including rhBMP-2, computational networks? Segregated by this decision, forever.
This decision further creates the possibility of catastrophic success. Should this data actually become essential, the transaction costs of reviewing applications will skyrocket. I’m doubtful that the faculty reviewers involved will enjoy spending their time looking at incrementally different applications to access the data rather than doing, yaknow, novel research that helps them get more funding.
III. Decision: Registration For Data Access Will Require At Least One Key Investigative Team Member From An Academic/University-Based Partner

The justification for this is that “This requirement strengthens the likelihood that the data requester (and eventual user) will have the scientific capability to conduct the proposed analysis” and securely store the data.
Based on my experience, the odds that someone can use data are not strongly correlated with academic credentials. For every whipsmart data scientist in academia there are a dozen more with strong chops, secure Amazon web systems, and killer Bayesian models outside academia.
Imagine if Nate Silver had been required to go through this kind of process to access the polls? Only mathematicians with a political partner need apply! I’m sure all the pundits who called him a wizard would have loved to sponsor his application.
IV. Decision: Requests For Data Access Will Be Reviewed For ‘Completeness’ and to Ensure that Data Use Limitations are Met
This is an unnecessary bit of overhead, but if you’re going to make people file applications, and make faculty members review them, then this makes sense. Might as well make the review process as complete (not to mention time-consuming) for your own faculty as possible.
V. Decision: Limitations Will Be Placed On Data Use

“Data requestors will be required to certify that they will meet the following expectations:
· The data will be used to create generalizable scientific knowledge.
· The findings derived from analysis of the data will only be publicly disseminated through the peer-reviewed biomedical literature or a scientific meeting.
· The data will explicitly not be used for commercial purposes or pursuant of litigation.

What the hell is generalizable scientific knowledge? Where does it stop and start? Who gets to enforce the definition? Are we talking Karl Popper, or Kuhn, or Feyerabend – or Arbesman – here? This is so vague it gives those who would deny a data access request total power to say no under its rubric. An exploratory data scientist simply looking to test a model? Door, locked.
The findings will only be disseminated through the literature and meetings? No social media? No blogging? They won’t be published as new algorithms directly into R clients? Seriously? Utterly myopic view of how knowledge is now communicated.
The litigation thing I’m actually willing to give them. There’s too many law firms that would descend on this with fangs bared to start class action lawsuits, at least now. Until we have social norms and judicial precedent to deal intelligently with clinical data this kind of constraint might be part of the deal.
“Data will not be released to applicants whose intent is clearly based on commercial or legal purposes.”
OK, so if I keep a blog with Google ads, is that commercial? If I work on cancer as my 20% project at Google? Who decides what “clearly” means? What if I’m an academic sponsored by a Medtronic competitor (this appears, ironically, to be okey-dokey)? Again, broad reasons to say no to those wishing to exploratory computational research.
VI. Decision: There Will Not Be A Data Use Fee

Yay! Of course that’s just for a year.

VII. Decision: Medtronic Will Be A Party To The Data Use Agreement, With Authority To Enforce It
Oh, awesome.

Just go read all the terms and conditions and then don’t even bother applying. Anyone who actually goes through this is either going to be someone they already know or a total masochist.
Worse, the data use agreement gives Medtronic the right to snoop into your research. It’s a party to the deal. It has the right to enforce the deal. Sign with care!
“DATA DISTRIBUTION”
Data will be distributed via an “encrypted USB flash drive via FedEx (or similarly secure shipping company)”.

Glad they specified a “secure” shipping company.
Seriously? A USB stick? The thing that gets left everywhere, falls out of pockets, carries viruses into Iranian nuclear facilities, and is used for corporate espionage? After all this detail to secure the data the actual delivery method is very likely to result in the leak of at least an encrypted copy of the data into the wild (when there is a Wikipedia heading on “USB drive data leakage” you may have a problem with it).
If you’re going to go to all this effort to lock it down, just keep it simple and use a secure cloud service and encryption. At least then you can monitor the download and there’s not a physical copy floating around for an exhausted postdoc to drop on the floor and get picked up by a janitor.
CONCLUSIONS.
Well, I doubt seriously that this is going to change significantly. This reads as if it were written by Medtronic and then a well-meaning committee of scientists attempted to ameliorate its worst excesses. And I’m glad for the fact that it was created, placed online, and comment was requested.
We have definitions for openness precisely so the words are not used in ways that mislead us. To make sure that when something claims the mantle of “open” that it is indeed open. To create a certain level of quality control in the open world.
There’s nothing wrong with not being open. What’s wrong is claiming to be open, when one is in fact not open. It is important that we hold all to task against the open definition, and that we call something what it is. Because this is not an open data policy. It’s an Yale Access to Data Agreement – a YADA, not a YODA.
And there’s nothing inherently wrong with that. It’s not bad to make data available under kind of insane terms, because at least it’s better than not making it available at all. But this isn’t open. Not even close.