EDGI Coauthors Data Risk Categorization Paper with ESIP – Environmental Data and Governance Initiative

By: Kelsey Breseman, EDGI

The Environmental Data & Governance Initiative (EDGI) formed as an emergency effort to ensure public climate data stayed available in 2016, when it designed and organized 48 “Data Rescue” events together with the University of Pennsylvania where volunteers and activists saved 200 Terabytes of government data (EDGI 2018). These events prompted bigger-picture questions about public data: what data is worth saving? Where can data go to be safe? How safe can it be made? As part of Earth Science Information Partners (ESIP), EDGI has started to answer these questions in a new paper entitled “Risk Assessment for Scientific Data.”

The paper outlines a data risk matrix to help data managers understand risks to a given dataset and select mitigation strategies. EDGI’s more recent (2019) archiving work was assessed as a case study for the new metrics. The assessment of the repository has lessons that may be helpful for the data manager (some of the structured data saved is at risk from file format obsolescence, less so from lack of metadata, for example) — but the major value of the assessment is its contribution to the data risk matrix tool, which can be used across many repositories to determine the degree to which the data they contain is endangered.

All data is at risk of loss or deletion. Just as paper files can take water damage or get lost in enormous stacks, digital data can be lost to file corruption, non searchability due to lack of metadata, or file format obsolescence. Given that no data can be secured with an absolute guarantee, a major role of a data manager is to perform risk mitigation.

The risk matrix maps risk factors against various “categorizations”, which separate out different aspects of the risk, such as the length of recovery time should the risk factor come into play, which entity is best positioned to mitigate the risk, and the likelihood of the given risk.

From the paper, which can be accessed in preprint on EarthArXiv, the full list of risk factors identified are:

Lack of use
Loss of funding for archive
Loss of funding for specific datasets
Loss of knowledge around context or access
Lack of documentation & metadata
Data mislabeling
Catastrophes
Poor data governance
Legal status for ownership and use
Media deterioration
Missing files
Dependence on service provider
Accidental deletion
Lack of planning
Cybersecurity breach
Over-abundance
Political interference
Lack of provenance information
File format obsolescence
Storage hardware breakdown
Bit rot and data corruption

And the categorizations used are:

Severity of risk
Likelihood of occurrence
Length of recovery time
Impact on user
Who is responsible for addressing the problem
Cause of problem
Degree of control
Proactive vs reactive response
Nature of mitigation
Resources required for mitigation

The paper includes complete descriptions of the risk factors and categorizations, as well as a collection of use cases ranging from physical artifact collections to born-digital data troves.