Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement of DMI-Tcat making it facilitate GDPR-Compliance #362

Open
fredrikjonasson opened this issue Apr 25, 2019 · 13 comments
Open

Enhancement of DMI-Tcat making it facilitate GDPR-Compliance #362

fredrikjonasson opened this issue Apr 25, 2019 · 13 comments

Comments

@fredrikjonasson
Copy link

I am working with the tool as a part of the VirtEU European project https://virteuproject.eu/. We have drawn some conclusions about how the DMI-Tcat can be enhanced to facilitate GDPR-Compliance.

My suggestions focus on at first hand three different areas of the tool.

  1. Pseudonymization of the data.
  2. The right to see the data that is stored.
  3. The right to get one's data deleted.

We would like to integrate our solution to the main stream and think it could be of interest for the community.

@ErikBorra
Copy link
Member

Hi @frederickjansen,

thanks for your suggestions. Of course we would like DMI-TCAT to be compliant with GDPR. Maybe you could elaborate on how you envision this to be facilitated by TCAT?

My first thoughts: point one could be enforced at the level of capture or at the level of export. E.g. hashed user names could be stored or user names could be hashed when an export (data table or network) is requested. User and tweet ids could of course still be used to deanonymize a user. A graphical representation of a mention network, however, would not show personal information while the structure of the graph could still be presented.

As for points two and three: how does a user know that his or here data has been captured by a system interfacing with Twitter's API? Should a researcher contact all users captured? I guess the most elegant solution is to not assume that the user knows his data is captured and to not oblige researchers to contact each user. Instead, TCAT could periodically check whether a tweet has been deleted from Twitter and remove it from the TCAT database if it is.

What are your thoughts on this?

Best,

Erik

@bernorieder
Copy link
Member

Hey guys,

Interesting discussion and I do think that it makes sense to think how TCAT can be adapted to facilitate GDPR compliance. But I also think that what 'GDPR compliance' means in a research context is not fully established - the regulation specifies a number of exceptions for research in the public interest, but they are pretty vague. I would consider 'overcompliance' to be as much of a risk as 'undercompliance'.

But I would be very interested in seeing the solution you mention, @frederickjansen - and hear more about the conclusions you have drawn and the thought process.

In the end, I think that the mentioned provisions would allow researchers and admins to make these decisions in their particular contexts and that would be great.

cheers,
Bernhard

@magnanim-old
Copy link

magnanim-old commented Apr 25, 2019

Hi everyone, thank you for the quick reactions and the nice discussion. I am also part of the EU project mentioned above, where I lead the unit working on this task - a task that was added to the project following some requests we got during the intermediate project review.

@fredrikjonasson will share more details about the specific user stories and tasks he has in mind, but as a more general comment, I agree that a tool cannot be GDPR compliant by itself, in many cases a Data Protection Officer will still have to assess the specific case. So we certainly cannot pre-determine which processes must be executed by the system. At the same time, some processes and safeguards are specifically mentioned in the regulation (such as pseudonymization), and if they can be easily enabled by the user performing the data collection, this can greatly simplify many things: from the definition of the data management plan to its enforcement to the impact assessment at institutional level.

The spirit is exactly the one mentioned by @bernorieder, where the researchers will be able to "activate" some processes depending on context and needs.

Regarding @ErikBorra's comment, yes, I agree that the question of if and how to notify users is a difficult one, that also goes beyond purely legal considerations. The GDPR does not provide a clear answer, because the right for notification depends on the effort to do so, which at this point is a quite vague concept. We have tried some options, from the actual notification to all Twitter users involved in the collection - with our account promptly blocked by Twitter... to the regular posting of tweets with the monitored hashtag and a note about the monitoring process (these and a few other more general considerations are here: https://arxiv.org/abs/1903.03196). This last might be an option that the user collecting data can turn on, but in no way we believe that there is one specific solution that should be valid for all cases and enforced by the software, or that the software should be over-compliant by design.

@fredrikjonasson
Copy link
Author

Hi everyone,
Thanks for interest, discussion and good input.

My vision is that besides the interfaces for analysis(analysis/index.php) and capture(capture/index.php), there shall be an interface for the GDPR options(gdpr/index.php). A thought is to make it reachable from the capture/index.php interface and acting as a step between the capturing interface and the starting of the process.

As an example there could be a checkbox saying "Show GDPR options" which have the consequence that after filling the information on capture/index.php you will be redirected to the GDPR interface where you fill out certain information depending on which parts you want to activate (as an example; a checkbox for having the collected data pseudonymized).

This will make sure that the activation of the GDPR "tools" are clearly voluntarily which goes in line with what @bernorieder and @magnanim mentioned above.

The thought of having the tool periodically check whether a tweet is deleted or not is certainly interesting @ErikBorra and something i will look into more.

Right now my next step will be to start developing. I will try to use some of the existing functions and might modify them a bit (for example the export functions and the tweetqueue class). In that way i am striving to get a good balance between introduced complexity and reusing code.

P.S. In all good sense, please note that my username (and name) is @fredrikjonasson and not frederickjansen so that we don't drown anyone innocent in notifications.

All the best,

Fredrik

@fredrikjonasson
Copy link
Author

fredrikjonasson commented May 2, 2019

Hi again everyone,

as asked by @ErikBorra i have tried to precise my envision for the GDPR-enhanced Tcat.

For the pseudonymization enhancement i envision following changes to the Tcat tool:

First, add a checkbox to the capture/index.php admin page where you can check the box if you want to use the pseudonymization alternative.

When it comes to the added checkbox and its functionality my suggestions are:

  • have a table that lists all the pseudonymized bins. When the checkbox is checked a modified function create_bin() in capture/query_manager.php adds the bin to the list of pseudonized bins.
  • Then i create a restricted table (pseudonymisation table) in the database that will contain the information that gives the possibility to de-psudonize the data residing in the query bin. Every pseudonynomised bin will have it owns pseudonymmisation table.
  • The pseudonymisation table will have two columns, original userID and a pseudonized userID(which will also be a primary key). The table where the tweets are stored(tweettable) will have the same pseudonized userID in both the screenname and the userID column.
  • Further, the pseudonymised userID will act as a foreign key to the tweettable making sure that no tweets are added without being given a pseudonymised identification.

For the functions that are doing the work i have the following suggestions.

  • Extend capture/query_bin.php with functionality that checks wether pseudonymisation is activated, if that is the case, create the table that keeps track of the pseudonynomised bins if it is not already created.
  • Extend the class tweetqueue in capture/common/functions.php. I will make a modified version of the function insertDB() that maps the identification variables to the newly created pseudonymisation table and replace the table:s removed identification variables with a corresponding pseudonymized ID from the pseudonymisation table.
  • Another possible approach i see is to make an extension of the function processtweets() in the same file. The change there would consist in extending the function to be able to splitting up the data with mapping as described above.
  • A third thought is to make modifications to the class tweet, also residing in capture/common/functions.php. The modifications i have thought of in that case would be an extension of the function fromJSON() who according to the documentation is used to “Map Twitter API result object to our database table format”. Again the mapping would be something like above.

When the pseudonymisation is done my thought is to use the existing functions in the same way that the program is doing right now.

From a legal standpoint, my opinion is that a modification of the tweet class (third suggestion) would be best since we then make the pseudonymization as early as possible (right when we get the tweet from the API).

As you can see mostly of my modifications will reside in the the files capture/query_manager.php and capture/common/functions.php.

What are your thoughts on this?

@dentoir
Copy link
Contributor

dentoir commented May 7, 2019

Hi everyone,

First to chip in on the broader discussion, adding Let's Encrypt support to the auto-installer would also be a great improvement to general data protection. As it stands, using HTTPS is now left up to the sysadmin.

Second, @fredrikjonasson, for your pseudonymisation prototype, the easiest approach would probably be to extend the tcat_query_bins table with a new pseudonymisation attribute, which could potentially have multiple values for different types of pseudonymisations you would like to support. The Twitter Election Integrity dataset uses a mechanism which is similar to the one described by @ekborra
@fredrikjonasson I'm interested what you mean by depseudonymisation - a table which maps 'true user name' to 'hashed user name' for example? If such a table exists inside TCAT, wouldn't it make more sense to do pseudonymisation on the analysis side only?

Best,

Emile

@fredrikjonasson
Copy link
Author

Hi!

Thanks for your input @dentoir. When it comes to your suggestion regarding extending tcat_query_bins table i think you are right, i would probably be the best approach.

For depseudonymisation i mean the information required to do the mapping will be a table containing the data to translate between 'true username' to the 'hashed username'. To be able to do the depseudonymisation you will need to have access to both the table that is pseudonymised and the table that is translates 'true username' to the 'hashed username'.

To be able to have the data pseudonymised at all times i still think that the best is to pseudonymise it during the capturephase as we are inserting it to the database.

I will post more about my work here. Feel free to come with input and suggestions.

Best regards,

Fredrik

@ErikBorra
Copy link
Member

Hi all,

we just discussed this with our team and propose to implement the following:

  1. auto-install will include lets-encrypt so that DMI-TCAT installations with a proper domain name have SSL support
  2. a script to be called by crontab that can regularly check for deleted tweets. For huge bins this will lead to rate limits of API keys
  3. provide a check-box in the frontend (/analysis) to allow hashing (or other pseudonomyzation) of user ids and user names in any output. Hashing of user names can happen on the fly in the analysis scripts and will lead to little overhead.

We do not see the necessity to change anything to the capture scripts or database tables. When we agree that the tweet id should always be legible, that id (or even parts of a tweet text) can easily be used to deanomyze a user. Changing anything to the backend will thus unnecessarily complicate things without leading to any benefit.

Looking forward to hear your thoughts.

Best,

Erik

@fredrikjonasson
Copy link
Author

Hi,

I would be glad to develop both of your suggested implementations, with the priority on the pseudonymisation part. I still think that there should be a check-box placed on the capture/index.php to flag the bin as a pseudonyminised bin.

A user shouldn't be able to export a bin marked for pseudonymisation, with the info de-pseudonyminised from the analysis page whithout having proper clearance for depseudonyminisation. So later on we can think of how to develop a permission system for a specific user.

To sum up; If a user captures a bin and choses it to be pseudonynomised, he or she can export the bin with the hashed identifiers from the analysis page. But if you want to de-pseudonymise the data you have to enter some credentials etc (The permission system).

Best regards,

Fredrik

@ErikBorra
Copy link
Member

ErikBorra commented May 9, 2019

Hi @fredrikjonasson,

thanks for your offer to help implement this!

Could you please explain why you would like to refrain all users from exporting data as it was retrieved from Twitter? Why would they only be allowed to be able to export pseudonyminized data?

Our philosophy, in line with what @bernorieder has mentioned in this thread previously, has always been that end-users are the researchers who want to, and should be able to, be in charge of their own research methodology. Similarly, and as @magnanim has mentioned, "the mentioned provisions [should] allow researchers and admins to make these decisions in their particular contexts." According to me, the admin (who is in control of /capture) should thus not oblige any researchers that are solely accessing the front-end (/analysis) to follow particular actions. In other words, all researchers should be able to make their own decisions on whether to pseudonymize or not. (Not to forget that if we only pseudonyminize usernames, the user ids and tweet ids can still be used to depseudonyminize those usernames).

Best,

Erik

@fredrikjonasson
Copy link
Author

Hi,

Thanks for good questions, when it comes to your philosophy I agree with you.

First, the implementation won't refrain all users from exporting data as it was retrieved from Twitter. What the implementation will do is introduce the possibility for an admin to, only if he or she likes, choose to present the data to the other non-admin users as pseudonymized. The admin will always see the original data without any values pseudonymized.

An example:
An admin-user with acess to the capture part decides to collect data. Due to he or she being careful not to use more personal identification data than necessary the pseudonymize check box is ticked when creating the capture bin.

The program collect tweets as usual. But now the table tcat_captured_bins contains an attribute (a column) stating that the bin shall be pseudonymized for non-admin users.

We then have a non-admin user who only has access to the analysis page. Since admin-user didn't feel comfortable with exposing personal identification data the non-admin user can only access the data without the personal identification information. The data is pseudonymized. If a situation occurs where the non-admin user need or want access to the pseudonymized part, then he or she will need to take some action involving contact with the admin.

If admin-user doesn't care about exposing personal data he or she won't check the depseudonymized box and everything will be done as it is right now, without the extension.

When it comes to the information that I think should be pseudonymized it is not only the username. After a check on the data that the tool is capturing my preliminary plan is to pseudonymize the following fields taken from the class Tweet in capture/common/functions.php:

  • id
  • id_str
  • from_user_name
  • from_user_id
  • from_user_realname
  • to_user_id
  • to_user_name
  • in_reply_to_status_id
  • in_reply_to_status_id_str
  • in_reply_to_screen_name
  • in_reply_to_user_id
  • from_user_realname
  • to_user_id
  • to_user_name
  • in_reply_to_status_id
  • quoted_status_id
  • retweeted_status
  • retweeted
  • retweet_id
    Plus
  • mentions of a user in a tweet text

The above list is a first try to cover every field that could be viewed as a possible identifion variable. However, I havent tried to start pseudonymized in the database yet so the list could be changed by some field(s).

I have done some analysis of how the implementation of your suggestion regarding pseudonymized on the analysis-side could be done and I have some remarks:

  • The pseudonymization at analysis-time brings the consequence of efficiency. We have to pseudonymize the data everytime we use the analysis.
  • There is also a matter of consistency, since we can't store the translation between the pseudonymized values and the real ones we have to use some kind of hash function to map the values we want to pseudonymize. Since the hashing has to be done every time we use the analysis tool we have a small risk of the hashing resulting in two users being given the same hashed ID. With hashing there is also an efficiency penalty.

As an example
Admin-user has captured data with identification values pseudonymized. When user1, who only has acess to the analysis part exports the data he or she gets the data in a pseudonymized fashion, as described in the example above. user1 exports a selection of tweets.

The program now runs the same hashfunction that maps the values to be pseudonymized to a hashtable and replaces the identifiyng values with the hashed value.

user1 gets the data and when he or she wants another piece of data another export is taking place.

The program now again runs a hashfunction that pseudonymized the values as above.

user1 gets the data in a pseudonymized fashion.

Suggestion
As a solution to the consequences regarding efficiency and consistency it is my recommendation that we store a table in the database that holds the translations between the already pseudonymized values and the real ones. We could then do a lookup when analysing to see if the value we want to pseudonymize already has an established pseudonymization ID.

I hope this clarified my viewpoint.

Best regards,

Fredrik

@ErikBorra
Copy link
Member

Hi @fredrikjonasson,

thanks for your elaborate answer.

You state that "What the implementation will do is introduce the possibility for an admin to, only if he or she likes, choose to present the data to the other non-admin users as pseudonymized." I, frankly, still do not see a use case for this. As discussed before, the responsibility for pseudonomyzation should be with the researcher (the end-user) and not with the admin. Furthermore, it creates a lot of coding overhead and maintainability issues.

As for the pseudonomyzation, you can pseudonomyze all you want but a simple text search for a tweet in Twitter's search interface wil depseudonomyze any user instantly. I think we can agree that it does not make sense to pseudonomyze the tweet itself. As such, again, I do not think we should maintain separate tables for pseudonomyzation.

As for on-the-fly pseudonomyzation I don't suspect the hashing to be very costly (when implemented smart) and that the overhead will be small. As for consistency, a simple hash function may - but most often will not - lead to consistency problems.

Best,

Erik

@magnanim-old
Copy link

Hi @ErikBorra, I think there might be some terminological misunderstanding, as I can see that in our internal discussions we were using terms in a different way - for example assuming that a researcher may have an admin account, which I can now see is not how you are using the term.

I would suggest to explicitly use the terms in the GDPR (controller and processor) in the discussion, and also to use the system's account types (admin and user).

From the GDPR side, "the controller shall ... implement appropriate technical and organisational measures, such as pseudonymisation" (Art. 25.1), and pseudonymisation means that "the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person" (Art. 4.5).

While I agree that the case of publicly available data (such as a tweet, that one can still lookup online) is different from other cases, to implement pseudonymization as defined in the GDPR we must enforce some technical and organisational measures, and in our opinion this could be done by having a role in the system who has the responsibility to de-pseudonymize. We chose the Admin to exploit the current user division in the system without adding one more user type, but of course we are open to other options as long as they are in line with the GDPR.

Do you see any other approaches? We'll be happy to consider alternatives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants