Written by Janak Raj Chadha.
Identifying and Prioritizing Important Website Changes
Introduction
The Environmental Data & Governance Initiative (EDGI) is an international network of academics and non-profits addressing potential threats to federal environmental and energy policy, and to the scientific research infrastructure built to investigate and inform.
EDGI monitors thousands of government webpages to identify important changes in data or information. Each week, hundreds of changes take place, which a team of analysts reviews to identify changes worthy of public reporting. The goal of this project is to improve the efficiency of this process by algorithmically identifying and prioritizing important changes, allowing the analysts to devote their time to understanding these changes and creating impactful reports.
The changes worth reviewing have some characteristic features which can be classified with the help of machine learning algorithms.
My project provides the following solutions:
- A set of filters which can identify insignificant changes (which may be a major share of all changes), eliminating the need to spend time reviewing them.
- Machine learning models which can prioritize changes and present them to the analyst team for careful analysis.
Work Report
Phase 1
The website monitoring project involves working with data from multiple sources. The three primary sources are Versionista, PageFreezer, and Internet Archive. An important initial task was researching and creating clear documentation of the differences in data formats.
Another initial task was reviewing the changes in detail to identify patterns that would allow filtration of insignificant changes. This included looking at the changes between versions in different modes (rendered, plain text, source) and looking for frequently occurring, similar changes.
Related GitHub Issues:
- Document the differences in the data format of the different sources (PageFreezer, Versionista).
- Analysis of insignificant changes dictionary
- Add functionality to get cabinet ID of a specific URL
Related GitHub Pull Requests:
Phase 2
After identifying patterns in insignificant changes, I worked on filters to label these changes.
The project currently relies on Versionista’s services to view changes. There are plans to use services created by our own developers to compute the differences between versions and the groundwork for these services has already been completed. I worked on adding and improving these “diff” (page comparison) services.
Related GitHub Issues:
- Determine key features in diffs that could be used for filtration
- Create functions that ID/characterize page elements for later use in filtration
- Text diff includes some HTML code
- Add HTML diff for rendering
- Fix tests and CI builds
Related GitHub Pull Requests:
Phase 3
After categorizing insignificant changes, I moved on to important changes. This requires creating machine learning models that are able to classify (and, at a later stage, prioritize) changes.
I am working on two tasks:
- Code to download the content stored in our database which will act as input to the models.
- Framework for the machine learning models which can be used once the data is ready.
Related GitHub Issues:
Related GitHub Pull Requests:
Work Status
Completed
- Documentation of different data formats, with demos
- Identification of patterns in insignificant changes and creation of filters
- Addition and improvement of method for computing differences
In Progress
- Dataset creation
- Machine learning groundwork
Future Scope
- Explore different types of classifier models
- Create a mechanism to automatically add correctly prioritized changes to the training dataset once a minimum number is collected
- Automate retraining of model after dataset is updated
GitHub Links
My Thoughts
I had an amazing time working on this project. It was a really exciting ride and I could not have completed it without the constant help and support from my mentors and all the other wonderful people I interacted with. This was my first experience working with such a large community and I believe I have become a better developer. I’m humbled to be a part of such an incredible network of people and I will forever be grateful to EDGI for this opportunity.