Our first anonymizer source code now available
Please find here the source code of our first anonymizer, tried on the Simuligne forums, to systematically transform all strings representing personal data or real names representing the participants of a learning situation. This short article presents the principles and provides a link to download the documented source code.
Our first anonymizer source code now available
What? Java source code for anonymisation: This software (including the source code in java) helps a researcher in the anonymization of a document (set of interaction mail/chat/forum). The document must be in a specific XML format. This software is no longer maintained.
Who? (Author : E. Gasche): This software has been developed in 2006 by the LIUM (Computer Science laboratory of the Université du Maine, France) with the LIFC (Computer Science laboratory of the Université de Franche-Comté, France) in a national project named « ODIL ». The main scientific contributors are C. Reffay (LIFC) and P. Teutsch (LIUM). The author of the source code released here is Emmanuel Gasche (LIUM, France).
- Whom for? Software coders…
- What for? …to reuse a part or the whole code in order to build new anonymization tools.
What does this anonymiser software actualy do?
The following texte is a set of arranged parts from the following communication:
Teutsch P, Piat F, Reffay C . Anonymizing and sharing corpora of online training courses. CSCL’2009 Workshop "Interaction Analysis and Visualization for Asynchronous Communication", Rhodes (Greece), 9-13 June 2009
This work addresses the issue of the necessary (for ethical and legal concerns) anonymization process to be applied on a corpus to share it with a larger research community. This contribution looks for the right borderline permitting to save the social and cultural context and hide efficiently the identity of the actor in order to protect his privacy. The principles and tools presented in this article are applied to a corpus of textual interaction in language learning.
On a theoretical basis, the current models of learning contexts distinguish three kinds of data: those relating to the individual’s identity (first/last name, picture,...), those relating to social characteristics (gender, age, mother tongue,...) and those relating to his/her learner profile (target-language proficiency level and skills, academic profile or history, current situation, ...). Among these data, only the first kind has to be systematically modified, while both of the others may need to remain unchanged for subsequent analyses.
Regarding the individual’s identity, we distinguish the identifying data handled by the training platform on one hand, and on the other hand those used in the messages themselves.
The former refer directly or indirectly to the actors: first/last name, login, id, IP address and so on... All appear as a uniquely-defined character string, easy to automatically search for and replace. This is the case of the name of the author of a message posted in a forum, of the automatic signature of an email, or of the initials preceding the message in a chat.
The latter can be found in the midst of texts produced by the actors themselves: signature, calling, answer or reference to one or several other actors. Processing of this information is in this case much more complex, given that the names cited inside the messages can be subject to many, and sometimes very different, morphological variations. Indeed, in the case of collaborative, remote on-line training courses, learners usually use nicknames when signing or calling each other, and it is important to analysts to recognise these. In a language learning context, first and last names can be socially and culturally marked or they can carry a meaning discussed about in the interaction.
The search for and the processing of the callings of other people, spread in the midst of all messages, show that anonymization goes way beyond a purely information-processing technical issue, getting to more semantic issues. Modeling anonymization does not appear to be so straightforward. After all identity markers have been defined, we have to choose which techniques to use to find and process them in the corpus. We can them imagine several anonymization strategies:
Change names into other first and last names, for instance by attributing a masking name, by keeping first while deleting the last names, by harmoniously modifying them, by keeping only initials,... This kind of anonymization aims at making the corpus accessible while maintaining the specific role of each identity.
Transform the identities into codes directly linked to the characteristics or to the role of the actor (e.g., Tutor, Learner#1, Learner#2, ...). This kind of anonymization focuses on a particular aspect of the corpus and pushes the reader towards a particular interpretation.
Modify the names and complete them with profile information (mother tongue for instance)
The rest of this text presents the anonymization process used by ViCoDiLi for the Simuligne Corpus. This processing of the corpus is multi-phased and relies on the definition of the identifying data to protect, and on a conversion table associating a mask substituted to each identifying data. Downstream, the anonymized corpus is produced from a list of associations between the original character string of the identity and their replacing forms.
Upstream, to prepare the conversion table, the corpus owner relies on all the individual data available while taking into account his knowledge of the actors, of the content of interactions, and of the analysis requirements. This process allows the owner to keep the complete profile of the actors which can be useful for three tasks: restoring at any time the link to some of the characteristics, defining the logic behind the equivalence between the real-life identifying data and the masking names, and if needed defining further equivalence for expressions spotted in the exchange. The conversion principle implies the replacement of first and last names, pseudos or other nicknames spotted by the operator with new appropriate masking names.
Figure 1. Screenshot of the anonymizer interface: Forms and conversion table.
Figure 1 shows the interface available to the operator in charge of the anonymization, which describes the association between original and modified identity. At first the system displays the list of the actors known from the corpus (the list is extracted from the on-line training platform through an XML file). The user can complete this list, adding nicknames and altered forms found in the corpus.
The system warns the user when doubles appear in the conversion table. These doubles can refer to actual original homonyms, it is then recommanded to substitute to their name the same masking name so as to maintain the original ambiguity. The doubles can also appear by accident (two identical masking names associated to different original data in the original corpus), in which case the system displays the different forms used so that the operator can check his choices of masking names.
A set of forms comes along with the conversion table between original identities and masking names. Each form contains the real characteristics of the actor of the training course: complete identity, age, location... This information, only known from the owner, can be useful to help him choose for the actor a masking name that will take into account some of the characteristics of his profile such as his role, gender, language, culture and so on...
The anonymization process in itself consists of modifying the original corpus (XML file) in two phases: modifying the actors’ identifiers first in the prompts before their messages, then inside the body of all messages. This process alters the XML file’s content while preserving its structure, so that ViCoDiLi can also display the new corpus.
Articles by this author
- Une nouvelle méthode systématique d’anonymisation
- Premiers codes sources pour l’anonymisation des interactions en ligne
- 5/11, EIAH 2011, Atelier "Partager des données d’observation pour la recherche en EIAH traces d’activité d’apprentissage"
- CSCL 2011 Analyse de la cohésion de groupe à partir de données Mulce
- CSCL 2011 "Productive re-use of CSCL data and analytic tools to provide a new perspective on group cohesion"