Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make plan for supplementary data package #29

Closed
arlin opened this issue Oct 13, 2015 · 13 comments
Closed

Make plan for supplementary data package #29

arlin opened this issue Oct 13, 2015 · 13 comments

Comments

@arlin
Copy link
Contributor

arlin commented Oct 13, 2015

We have a data repo to go with this manuscript. Currently it is private because of the private demographic info. Make a list of what we need to share, and make a plan for creating a supplementary data package, or a repository for sharing the required files.

Then write that up as a separate ticket.

@arlin arlin added the ready label Oct 13, 2015
@rvosa
Copy link
Contributor

rvosa commented Oct 16, 2015

So the steps should involve something like:

  • make it so that the demographics table has foreign keys to the participants table (right now we are repeating names and event IDs rather than person IDs)
  • come up with some acceptable way of hiding or encrypting people's demographic data. My first inclination was to think that we could simply have foreign keys from the demographic data pointing to a small table that explains what each demographic category is - but an astute puzzler would probably be able to figure out what the categories are just by looking at people's names (e.g. male vs female, 'murican vs forun) so probably the whole table should be encrypted.
  • decide how to make the cleaned, ready repository available. My vote would be to link the repo with Zenodo so that we can created releases that have resolvable DOIs slapped onto them.

@arlin arlin added in progress and removed ready labels Oct 16, 2015
@arlin
Copy link
Contributor Author

arlin commented Oct 16, 2015

So far as I can see, only demographics.csv has sensitive information. Yes, anonymized demographics data could be hacked. Apparently this is common. I think it is a standard practice to have a license for sharing such data where the user agrees not to try to identify persons. As I see it, we can (1) keep demographics.csv private except under a confidentiality agreement, (2) anonymize demographics.csv (e.g., remove the names) and share it under a license prohibiting hacking.

Regarding how to make the data public. Can we just move all the public data into the public hackathon manuscript repo?

@hlapp
Copy link
Member

hlapp commented Oct 18, 2015

This will be difficult. A license won't work (licenses grant rights that someone would not otherwise have - these data are facts, so licenses don't apply), and a Data Use Agreement will be difficult to enforce agreement to.

I suggest simply not to publish data that we can't openly and publicly deposit. I'm also afraid that any personal data, whether they were previously public or not, will need to go through IRB approval.

@rvosa
Copy link
Contributor

rvosa commented Oct 18, 2015

So does this mean that we are inviting acrimony from reviewers because we
are discussing data that we don't make available? Do we actually need to
identify people at all to say something about demographics? I.e. what can't
we say if we make the demographics table available without the names or
person ID columns? Would that be private enough?
Op Sun, 18 Oct 2015 om 04:50 schreef Hilmar Lapp [email protected]

This will be difficult. A license won't work (licenses grant rights that
someone would not otherwise have - these data are facts, so licenses don't
apply), and a Data Use Agreement will be difficult to enforce agreement to.

I suggest simply not to publish data that we can't openly and publicly
deposit. I'm also afraid that any personal data, whether they were
previously public or not, will need to go through IRB approval.


Reply to this email directly or view it on GitHub
#29 (comment)
.

@hlapp
Copy link
Member

hlapp commented Oct 18, 2015

So does this mean that we are inviting acrimony from reviewers because we
are discussing data that we don't make available?

I don't think so. For example, this is the norm more than the exception in many social science fields - for obvious reasons. We just need to be clear why we are not publishing the data that we aren't, and ideally there'd be a way to get access to them, for example by request and signing a DUA. (Zenodo supports this reasonably.)

@rvosa
Copy link
Contributor

rvosa commented Oct 18, 2015

OK, but could we actually make it available anyway after dropping the
columns with names (I recall there are no person IDs in that table anyway,
right? Can't check right now). I don't think we'd lose anything.
Op Sun, 18 Oct 2015 om 16:35 schreef Hilmar Lapp [email protected]

So does this mean that we are inviting acrimony from reviewers because we
are discussing data that we don't make available?

I don't think so. For example, this is the norm more than the exception in
many social science fields - for obvious reasons. We just need to be clear
why we are not publishing the data that we aren't, and ideally there'd be a
way to get access to them, for example by request and signing a DUA.
(Zenodo supports this reasonably.)


Reply to this email directly or view it on GitHub
#29 (comment)
.

@hlapp
Copy link
Member

hlapp commented Oct 18, 2015

Any data that can potentially be re-identified needs to be cleared with the IRB unless we're keeping it under wraps.

@rvosa
Copy link
Contributor

rvosa commented Oct 18, 2015

Right, but providing a table that lists that at hackathon X there was a
Pacific Islander in attendance (say) - does that have the potential for
re-identification? More so than the actual names we provide, and which are
public data?

Op Sun, 18 Oct 2015 om 16:44 schreef Hilmar Lapp [email protected]

Any data that can potentially be re-identified needs to be cleared with
the IRB unless we're keeping it under wraps.


Reply to this email directly or view it on GitHub
#29 (comment)
.

@hlapp
Copy link
Member

hlapp commented Oct 18, 2015

Yes, absolutely. If we had had 10,000 participants and 100 of them are Pacific Islanders, the chance of re-identification is low. But not so with small numbers. Nobody voluntarily or otherwise stated their ethnicity on the public pages.

@rvosa
Copy link
Contributor

rvosa commented Oct 18, 2015

And, consequently, the chances of getting the IRB's permission to publish
such a table (sans names) would be low?

Op Sun, 18 Oct 2015 om 16:54 schreef Hilmar Lapp [email protected]

Yes, absolutely. If we had had 10,000 participants and 100 of them are
Pacific Islanders, the chance of re-identification is low. But not so with
small numbers. Nobody voluntarily or otherwise stated their ethnicity on
the public pages.


Reply to this email directly or view it on GitHub
#29 (comment)
.

@arlin
Copy link
Contributor Author

arlin commented Oct 18, 2015

Rutger, I think this isn't as problematic as you are implying. We just publish all the data files except demographics.csv. It is OK to publish conclusions based on data that are not released due to privacy concerns. This happens all the time in medicine and social sciences. The people identified in demographics.csv have legal rights in keeping that data private, and we need to protect that right. If we don't believe we can protect that right by anonymizing the data, we should not try to anonymize the data. In fields where data are withheld for privacy reasons, researchers who want to get the data have to make an agreement, to the effect that they will also safeguard the privacy rights of the people identified. This is a 2-way agreement based on trust. The originator can refuse to share the data with someone that he doesn't trust to keep the agreement.

@rvosa
Copy link
Contributor

rvosa commented Oct 20, 2015

Ok, good to know. I simply don't know how this works as I've never published anything that involves human subjects. If the approach you're describing is how it's done, then let's do that.

@rvosa rvosa self-assigned this Nov 4, 2015
@arlin
Copy link
Contributor Author

arlin commented Nov 4, 2015

I'm closing this. We have a plan expressed in #41 (get IRB approval), #50 (clean up hip_hack_howto) and #51 (move data files into public repo).

@arlin arlin closed this as completed Nov 4, 2015
@arlin arlin removed the in progress label Nov 4, 2015
@hlapp hlapp unassigned rvosa Nov 18, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants