By: Kelsey Breseman, EDGI
The Environmental Data & Governance Initiative (EDGI) formed as an emergency effort to ensure public climate data stayed available in 2016, when it designed and organized 48 “Data Rescue” events together with the University of Pennsylvania where volunteers and activists saved 200 Terabytes of government data (EDGI 2018). These events prompted bigger-picture questions about public data: what data is worth saving? Where can data go to be safe? How safe can it be made? As part of Earth Science Information Partners (ESIP), EDGI has started to answer these questions in a new paper entitled “Risk Assessment for Scientific Data.”
The paper outlines a data risk matrix to help data managers understand risks to a given dataset and select mitigation strategies. EDGI’s more recent (2019) archiving work was assessed as a case study for the new metrics. The assessment of the repository has lessons that may be helpful for the data manager (some of the structured data saved is at risk from file format obsolescence, less so from lack of metadata, for example) — but the major value of the assessment is its contribution to the data risk matrix tool, which can be used across many repositories to determine the degree to which the data they contain is endangered.
All data is at risk of loss or deletion. Just as paper files can take water damage or get lost in enormous stacks, digital data can be lost to file corruption, non searchability due to lack of metadata, or file format obsolescence. Given that no data can be secured with an absolute guarantee, a major role of a data manager is to perform risk mitigation.
The risk matrix maps risk factors against various “categorizations”, which separate out different aspects of the risk, such as the length of recovery time should the risk factor come into play, which entity is best positioned to mitigate the risk, and the likelihood of the given risk.
From the paper, which can be accessed in preprint on EarthArXiv, the full list of risk factors identified are:
- Lack of use
- Loss of funding for archive
- Loss of funding for specific datasets
- Loss of knowledge around context or access
- Lack of documentation & metadata
- Data mislabeling
- Catastrophes
- Poor data governance
- Legal status for ownership and use
- Media deterioration
- Missing files
- Dependence on service provider
- Accidental deletion
- Lack of planning
- Cybersecurity breach
- Over-abundance
- Political interference
- Lack of provenance information
- File format obsolescence
- Storage hardware breakdown
- Bit rot and data corruption
And the categorizations used are:
- Severity of risk
- Likelihood of occurrence
- Length of recovery time
- Impact on user
- Who is responsible for addressing the problem
- Cause of problem
- Degree of control
- Proactive vs reactive response
- Nature of mitigation
- Resources required for mitigation
The paper includes complete descriptions of the risk factors and categorizations, as well as a collection of use cases ranging from physical artifact collections to born-digital data troves.