Make plan for supplementary data package #29

arlin · 2015-10-13T16:10:47Z

We have a data repo to go with this manuscript. Currently it is private because of the private demographic info. Make a list of what we need to share, and make a plan for creating a supplementary data package, or a repository for sharing the required files.

Then write that up as a separate ticket.

rvosa · 2015-10-16T11:55:03Z

So the steps should involve something like:

make it so that the demographics table has foreign keys to the participants table (right now we are repeating names and event IDs rather than person IDs)
come up with some acceptable way of hiding or encrypting people's demographic data. My first inclination was to think that we could simply have foreign keys from the demographic data pointing to a small table that explains what each demographic category is - but an astute puzzler would probably be able to figure out what the categories are just by looking at people's names (e.g. male vs female, 'murican vs forun) so probably the whole table should be encrypted.
decide how to make the cleaned, ready repository available. My vote would be to link the repo with Zenodo so that we can created releases that have resolvable DOIs slapped onto them.

arlin · 2015-10-16T13:47:29Z

So far as I can see, only demographics.csv has sensitive information. Yes, anonymized demographics data could be hacked. Apparently this is common. I think it is a standard practice to have a license for sharing such data where the user agrees not to try to identify persons. As I see it, we can (1) keep demographics.csv private except under a confidentiality agreement, (2) anonymize demographics.csv (e.g., remove the names) and share it under a license prohibiting hacking.

Regarding how to make the data public. Can we just move all the public data into the public hackathon manuscript repo?

hlapp · 2015-10-18T02:50:28Z

This will be difficult. A license won't work (licenses grant rights that someone would not otherwise have - these data are facts, so licenses don't apply), and a Data Use Agreement will be difficult to enforce agreement to.

I suggest simply not to publish data that we can't openly and publicly deposit. I'm also afraid that any personal data, whether they were previously public or not, will need to go through IRB approval.

rvosa · 2015-10-18T14:06:02Z

So does this mean that we are inviting acrimony from reviewers because we
are discussing data that we don't make available? Do we actually need to
identify people at all to say something about demographics? I.e. what can't
we say if we make the demographics table available without the names or
person ID columns? Would that be private enough?
Op Sun, 18 Oct 2015 om 04:50 schreef Hilmar Lapp [email protected]

This will be difficult. A license won't work (licenses grant rights that
someone would not otherwise have - these data are facts, so licenses don't
apply), and a Data Use Agreement will be difficult to enforce agreement to.

I suggest simply not to publish data that we can't openly and publicly
deposit. I'm also afraid that any personal data, whether they were
previously public or not, will need to go through IRB approval.

—
Reply to this email directly or view it on GitHub
#29 (comment)
.

hlapp · 2015-10-18T14:35:39Z

So does this mean that we are inviting acrimony from reviewers because we
are discussing data that we don't make available?

I don't think so. For example, this is the norm more than the exception in many social science fields - for obvious reasons. We just need to be clear why we are not publishing the data that we aren't, and ideally there'd be a way to get access to them, for example by request and signing a DUA. (Zenodo supports this reasonably.)

rvosa · 2015-10-18T14:39:52Z

OK, but could we actually make it available anyway after dropping the
columns with names (I recall there are no person IDs in that table anyway,
right? Can't check right now). I don't think we'd lose anything.
Op Sun, 18 Oct 2015 om 16:35 schreef Hilmar Lapp [email protected]

So does this mean that we are inviting acrimony from reviewers because we
are discussing data that we don't make available?

I don't think so. For example, this is the norm more than the exception in
many social science fields - for obvious reasons. We just need to be clear
why we are not publishing the data that we aren't, and ideally there'd be a
way to get access to them, for example by request and signing a DUA.
(Zenodo supports this reasonably.)

—
Reply to this email directly or view it on GitHub
#29 (comment)
.

hlapp · 2015-10-18T14:44:18Z

Any data that can potentially be re-identified needs to be cleared with the IRB unless we're keeping it under wraps.

rvosa · 2015-10-18T14:47:14Z

Right, but providing a table that lists that at hackathon X there was a
Pacific Islander in attendance (say) - does that have the potential for
re-identification? More so than the actual names we provide, and which are
public data?

Op Sun, 18 Oct 2015 om 16:44 schreef Hilmar Lapp [email protected]

Any data that can potentially be re-identified needs to be cleared with
the IRB unless we're keeping it under wraps.

—
Reply to this email directly or view it on GitHub
#29 (comment)
.

hlapp · 2015-10-18T14:54:52Z

Yes, absolutely. If we had had 10,000 participants and 100 of them are Pacific Islanders, the chance of re-identification is low. But not so with small numbers. Nobody voluntarily or otherwise stated their ethnicity on the public pages.

rvosa · 2015-10-18T14:57:21Z

And, consequently, the chances of getting the IRB's permission to publish
such a table (sans names) would be low?

Op Sun, 18 Oct 2015 om 16:54 schreef Hilmar Lapp [email protected]

Yes, absolutely. If we had had 10,000 participants and 100 of them are
Pacific Islanders, the chance of re-identification is low. But not so with
small numbers. Nobody voluntarily or otherwise stated their ethnicity on
the public pages.

—
Reply to this email directly or view it on GitHub
#29 (comment)
.

arlin · 2015-10-18T15:10:08Z

Rutger, I think this isn't as problematic as you are implying. We just publish all the data files except demographics.csv. It is OK to publish conclusions based on data that are not released due to privacy concerns. This happens all the time in medicine and social sciences. The people identified in demographics.csv have legal rights in keeping that data private, and we need to protect that right. If we don't believe we can protect that right by anonymizing the data, we should not try to anonymize the data. In fields where data are withheld for privacy reasons, researchers who want to get the data have to make an agreement, to the effect that they will also safeguard the privacy rights of the people identified. This is a 2-way agreement based on trust. The originator can refuse to share the data with someone that he doesn't trust to keep the agreement.

rvosa · 2015-10-20T15:34:52Z

Ok, good to know. I simply don't know how this works as I've never published anything that involves human subjects. If the approach you're describing is how it's done, then let's do that.

arlin · 2015-11-04T16:27:40Z

I'm closing this. We have a plan expressed in #41 (get IRB approval), #50 (clean up hip_hack_howto) and #51 (move data files into public repo).

arlin added the ready label Oct 13, 2015

arlin added in progress and removed ready labels Oct 16, 2015

rvosa self-assigned this Nov 4, 2015

arlin closed this as completed Nov 4, 2015

arlin removed the in progress label Nov 4, 2015

hlapp unassigned rvosa Nov 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make plan for supplementary data package #29

Make plan for supplementary data package #29

arlin commented Oct 13, 2015

rvosa commented Oct 16, 2015

arlin commented Oct 16, 2015

hlapp commented Oct 18, 2015

rvosa commented Oct 18, 2015

hlapp commented Oct 18, 2015

rvosa commented Oct 18, 2015

hlapp commented Oct 18, 2015

rvosa commented Oct 18, 2015

hlapp commented Oct 18, 2015

rvosa commented Oct 18, 2015

arlin commented Oct 18, 2015

rvosa commented Oct 20, 2015

arlin commented Nov 4, 2015

Make plan for supplementary data package #29

Make plan for supplementary data package #29

Comments

arlin commented Oct 13, 2015

rvosa commented Oct 16, 2015

arlin commented Oct 16, 2015

hlapp commented Oct 18, 2015

rvosa commented Oct 18, 2015

hlapp commented Oct 18, 2015

rvosa commented Oct 18, 2015

hlapp commented Oct 18, 2015

rvosa commented Oct 18, 2015

hlapp commented Oct 18, 2015

rvosa commented Oct 18, 2015

arlin commented Oct 18, 2015

rvosa commented Oct 20, 2015

arlin commented Nov 4, 2015