Purpose & Goals
As academic libraries expand their capacity and participation in learning analytics (LA) data collection and analysis, the datasets produced by these activities increasingly pose potential ethical and privacy-related risks. Library LA datasets are often presented as “deidentified” after direct identifiers (e.g. name, email address, or student ID) to the individuals represented have been removed. However, combinations of demographic information commonly retained in LA datasets produce potentially unique “quasi-identifiers” that might allow reidentification of large numbers of individuals within these data. Such quasi-identifiers can therefore render any associated confidential data publicly visible, a substantial risk to the privacy of research participants and a potential violation of ethical research conduct.
Design & Methodology
Using the pigeonhole theorem, this study evaluated combinations of demographic variables contained in a dataset of approximately 40,000 students and calculated the number of individuals that are theoretically likely to be uniquely identifiable. These findings were then validated using cell counts of demographic combinations from the dataset.
Findings
This study found that information frequently retained in library learning analytics datasets renders a majority of individuals identifiable, and that the burden of this reidentification risk falls disproportionately on minority groups. Since these groups are often already subject to higher levels of discrimination and surveillance, these findings question whether learning analytics datasets meet the justice standard of ethical research with human participants.
Action & Impact
This presentation will suggest data collection and aggregation approaches that limit reidentification risk as well as synthetic data analysis techniques that enable the statistical substitution of quasi-identifiers to remove reidentification risk.
Practical Implications & Value
This paper provides a practical approach for assessment practitioners for evaluating privacy and reidentification risk in analytical datasets prior to data collection and procedures to minimize this risk. By providing an empirical assessment of the potential of reidentification contained in a real student dataset, in contributes to conversations about data ethics and justice with verifiable examples.