Archiving Data

We seek to preserve publicly accessible and potentially vulnerable scientific data and archive web pages from EPA, DOE, NOAA, OSHA, NASA, USDA, DOI, and USGS. Our archiving work has primarily occurred through more than 30 DataRescue events, also called “archive-a-thons,” at cities across the U.S. and Canada. With the help of the Internet Archive and DataRefuge, these efforts directly contributed to the largest recorded End of Term harvest:

Between Fall 2016 and Spring 2017, the Internet Archive archived over 200 terabytes of government websites and data. This includes over 100TB of public websites and over 100TB of public data from federal FTP file servers totaling, together, over 350 million URLs/files. This includes over 70 million html pages, over 40 million PDFs and, towards the other end of the spectrum and for semantic web aficionados, 8 files of the text/turtle mime type. Other End of Term partners have also been vigorously preserving websites and data from the .gov/.mil web domains.

In addition, the EDGI volunteer tech team of over 30 contributors has built open source and freely available tools and projects for grassroots archiving, all available to the public on our GitHub. We are moving into the next phase of archiving: Data Together, you can read more on our blog.

 

Vulnerable Data

The End of Term harvest is over, but there are still ways to preserve programs or datasets you are concerned about:

  1. Check whether it has already been preserved
  2. Nominate web pages or datasets using the Government Web & Data Archive bookmarklet or our Chrome Nomination Extension
  3. Securely contact us with information about vulnerable programs or data via EnviroDGI@protonmail.com ( learn more about sending secure email to us ).

Number of Unique Webpages Seeded (i.e. to Internet Archive using our Chrome Extension): 63,076
Number of Pages with Datasets in our archiving workflow*: 21,798 (34.6% of Unique Webpages Seeded)

 

Join Us – This is a Movement

Building on this incredible outpouring of support, EDGI convened an online townhall meeting to discuss “lessons learned” from previous events and develop new strategies for data archiving.

We welcome ongoing contributors to our preservation tools and website change monitoring project:

  1. The EGDI GitHub project overview is the place to start!
  2. We aim to have an inclusive community, review our Contributor Guidelines and Code of Conduct
  3. Jump on the Archivers chat (archivers.slack.com), anyone can request an invite from archivers-slack.herokuapp.com slackin
    • Introduce yourself in #introductions, key starting places for conversations are #general, #datatogether and #community-building
    • Ping one of the EDGI coordinators (@kelsey or @lightandluck) with your questions!

*This number includes web pages that were previously referred to as “Uncrawlable”, but also includes other types of datasets or repositories of information to be archived. Here is the breakdown:
Number of Pages with Many Files: 6,792 (31.2%)
Number of Pages with Visualization/Interactive Features: 7,129 (32.7%) Number of Pages with a Database: 3,576 (16.4%)
Number of FTP pages: 693 (3.2%)