-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable datasource-specific fields for pseudonymisation #438
Comments
The remove author procesor was modified to allow users to choose which fields to address. It allows you to choose any field for a CSV, but with JSON you have to list possible fields in the It is non trivial to actually update the JSONs by field. I think the most important thing is to ensure data is secure in 4CAT and that researchers are well aware of what is contained in their data before they publish it. It is, unfortunately, trivial in most cases to de-anonymize a post with URL, ID, or even the combination of text body and timestamp. And, if only psuedonymise was used, you can use the author hash to find everything else from the author (this is why I added the option to redact information instead of just hashing it). |
Thanks for the suggestions. I meant more pseudonymisation when the dataset is created; 4CAT should handle pseudonymisation as a default instead of having to rely on users to go through the (sometimes complicates) hoops of doing so. And not offering the possibility to pseudonymise data from the NDJSON is also a problem. I.e. I'm suggesting that a Another alternative is to run the remove authors processors straight after the dataset is created. |
Pseudonymise is already the default option on 4CAT created datasets, but it looks like the hook for Zeeschuimer works differently than 4CAT created datasets. The Zeeschuimer hook works by running the remove authors info processor after creation. I guess the 4CAT created datasets was never updated to use the The existing code searches through the nested dicts for any key containing the provided term (e.g., |
One final consideration: |
Thanks! I'll try to integrate this soon. And being greedy in what to pseudonymise is definitely the way to go here imo; if relevant, non-personal data is mistakenly hashed, we can always specify the fields later. |
Should we want to anonymise JSONs after all? Often we only have a vague idea of what the JSON will look like; we assume some part of its structure is guaranteed (the parts Some options:
|
For me, ethically and legally speaking, not allowing pseudonymisation of all data that is collected is a real no-go, even if encrypted (though that feature should also be implemented). Just personally speaking, I'm not comfortable with my phone number being stored in a JSON by what should be an "ethics by design" research tool. Option 2 could indeed work -- I can't think of many realistic scenarios where users would want to retain the NDJSON. Then the pseudonymise option would become something like "Pseudonymise to new CSV and discard original JSON/CSV". But we more or less have already accepted the maintenance cost of checking changes in the NDJSON data with the introduction of Allowing both option 2 and option 3 is also a possibility. |
But is it ethics by design to promise anonymised data when we cannot guarantee that it actually is anonymised? There is indeed a maintenance cost already in keeping If we decide we do need to know that, to be able to effectively anonymise it, we now need to understand the full data object rather than just the 'interesting' parts, and proactively check if the data that gets uploaded from Zeeschuimer has a different structure or has added fields that we should consider. Otherwise we again cannot guarantee that our predefined list of 'anonymisable fields' is actually correct and the promise it has of properly anonymising the data cannot be relied on. |
I agree that Option 3 is more maintenance (especially when thinking of LinkedIn), but if we don't go that way I'd vouch for Option 2 as the default when creating/importing datasets; at least then we can guarantee pseudonymisation. ...which does raise the question of why we decided to save 'raw' NDJSONs in the first place, but that's another story! |
(Option 2 would still benefit from datasource-specific sensitive fields btw, but then it can just be a simple list of column names) |
Two reasons:
|
These are good reasons, but:
|
I definitely think we could and should encrypt the data. Most users do not anonymize their data even with it as the default. I was just discussing with students over summer school who did not understand why they did not have usernames and pointed out that they did not need them for what they were trying to accomplish. They re-collected their data. Decoupling data collection and mapping allows researchers to decide what part of the data is interesting and relevant to their research. None of the data is important to me. I add fields to I think what is most important is that we provide robust tools for researchers to make the decision about what needs to be removed and anonymized. We should also make recommendations as we are familiar with the data and those can be the defaults. To Stijn's point, we really cannot guarantee that we are effectively removing all relevant information. In my opinion, it takes very little information to being de-pseudonymzing the datasets. |
I mean, we can do our best to guarantee there's no clearly identifiable data by default, and I think we should. Of course, with effort, almost any data can be de-anonymised, but that doesn't mean that 4CAT shouldn't help in facilitating this as the standard option. I don't see the re-collecting of data as a huge problem in case new stuff needs to be added to So if I'm correct, the consensus seems: option 2, which 1) works with |
I don't think we're agreed on all details, let's discuss this offline... |
Datasource scripts should be able to register what csv/ndjson fields are sensitive and should be considered when pseudonymising a dataset. Simply hard-coding
author
fields doesn't cut it; e.g. the Tumblr data source has a field calledpost_url
that can easily be used for de-anonymisation.The text was updated successfully, but these errors were encountered: