Environmental Data & Governance Initiative (EDGI) Response to the Request for Public Comment on Draft Desirable Characteristics of Repositories for Managing and Sharing Data Resulting From Federally Funded Research
Submitted by: Gretchen Gehrke, Grace Poudrier, Steven Gentry, Rob Brackett, and Kelsey Breseman
The Environmental Data & Governance Initiative (EDGI) is a North American network with members from more than thirty different academic institutions and ten nonprofit or grassroots organizations, as well as caring and committed volunteers who come from a broad spectrum of work and life backgrounds. EDGI promotes open and accessible government data and information along with evidence-based policy making. EDGI supports this OSTP effort to make data from federally funded research more available and accessible. Our comments focus mostly on the importance of version-control, data accessibility, and facilitation of data utilization. We address specific RFC elements as enumerated in the published Request.
I.A. Persistent Unique Identifiers: EDGI strongly supports the adoption of persistent unique identifiers (PUIDs) and additionally suggests version control for datasets. Version control would include a PUID for each version of a dataset and allow for automatic identification of changes made between versions, including checksums to identify changes or errors made in the processing of the dataset. This would assist researchers conducting secondary analyses to ensure they use the most accurate data, and would support scientific research integrity efforts emerging across the country. To facilitate collaboration and efficient field-wide research progress, EDGI also supports the creation of preliminary data repositories with PUIDs, which would allow researchers to share work in a timely manner throughout the research process, and spur related work without the risk of being scooped. Below (I.C.) we also recommend that research aims and methods be included alongside produced data, both to contextualize the data and to further support researchers sharing their progress without risking their work being improperly appropriated.
I.B. Long-term sustainability: Long-term sustainability is critical. Particularly crucial is the establishment of contingency plans so that researchers maintain access to data in the context of unforeseen events (such as a government shutdown), and that research outcomes can be verified and further utilized for decades into the future. Plans for long-term maintenance of repositories can be checked against a data risk matrix, as all data without plans for risk management must be assumed to be at risk. Safeguarding repositories begins with simple steps such as plans for automated backup of all data. As outlined in FAIR, metadata should be maintained and available even if the data itself is no longer retained.
I.C. Metadata: In addition to being structured and using a community-standard schema, repositories should anticipate any metadata crosswalks data users might need to do to use a repository’s data, and provide metadata in simplified expressions accordingly. Metadata should include a brief, plain language description of the data; description of research study purpose and design that created these data, highlighting topic, specific research aims, keywords, and any significant constraints (e.g. specific geographic location, specific geologic age); a full description of methods for data collection, with linked SOPs where possible; exact instruments used for qualitative methods; uncertainties for data points where appropriate (such as standard deviations where multiple measurements were taken to produce one recorded data point); data dictionaries and other tools necessary to fully understand and contextualize data. Repositories should support the archiving of code, and the code used to process or create data should be available and citable.
I.E., I. F., I.G. Free and Easy to Access and Reuse: EDGI strongly supports free and easy access to datasets. All data, metadata, and supporting information (e.g. methods) should be freely available via web-based search and download. Data should, at a minimum, be searchable by keyword, topic, location, dataset size, funding agency, and year of completion. Publications that cite a given dataset should be findable from the dataset site. Those citations should be available immediately, and full access to publications should be available to citing publications after one year post-publication. Data repositories should be navigable from agency websites and from publications that cite them. Where data can’t be published in the public domain, using a menu of standard licenses should be preferable to custom licensing terms. As described above (I.C.), extensive metadata is also crucial for accessibility and reuse, especially descriptions of research design such that other researchers and the public can gauge the appropriateness of a dataset for their aims and be aware of any data quality issues that may hinder their reuse of a given published dataset. As outlined in the FAIR principles, any data not open and accessible to the public should have openly accessible metadata which includes clear protocols and contact information for gaining access to the data.
I.J. Common Format: EDGI strongly supports the use of common data formats, and particularly non-proprietary formats. The Library of Congress has listed several formats and their descriptions that could serve as a basis for format requirements. For rare or specialized data formats, the rendering software should be indicated and (where possible) included in the repository.
I.K. Provenance: Provenance is critical. A repository with good provenance maintains a detailed logfile of changes to datasets and metadata, including date and user, beginning with creation/upload of the dataset, to ensure data integrity. Please see the comment in I.A. suggesting unique PUIDs per version with changes between versions identified and tabulated, including changes due to preservation actions. Unique versions of datasets should be linked from each other version. It also would be helpful to see links to archived copies of the data management plans for projects that contributed data to a repository; even after a study concludes, this makes it easier to verify methods used.
II.G. Retention Guidelines: Retention guidelines are important not just for repositories storing human data, but for all data repositories. These should be developed and applied universally. OSTP or lead funding agencies should institute periodic system-wide monitoring processes to ensure data and any requisite software remain functional and available.
In sum, EDGI supports OSTP’s formation of desirable characteristics for management and sharing of data from federally funded research. The FAIR guidelines are an excellent starting point. EDGI draws on a rich experience across environmental science research, website monitoring, and use of federal data sites to suggest further desirable characteristics of repositories that would aid in research, reuse, archiving, and community access to data.