Google Summer of Code 2017: Janak Raj Chadha – Web Monitoring Project

Posted on Posted in Blog

Written by Janak Raj Chadha.

Identifying and Prioritizing Important Website Changes

Introduction

The Environmental Data & Governance Initiative (EDGI) is an international network of academics and non-profits addressing potential threats to federal environmental and energy policy, and to the scientific research infrastructure built to investigate and inform.

EDGI monitors thousands of government webpages to identify important changes in data or information. Each week, hundreds of changes take place, which a team of analysts reviews to identify changes worthy of public reporting. The goal of this project is to improve the efficiency of this process by algorithmically identifying and prioritizing important changes, allowing the analysts to devote their time to understanding these changes and creating impactful reports.

The changes worth reviewing have some characteristic features which can be classified with the help of machine learning algorithms.

My project provides the following solutions:

  1. A set of filters which can identify insignificant changes (which may be a major share of all changes), eliminating the need to spend time reviewing them.
  2. Machine learning models which can prioritize changes and present them to the analyst team for careful analysis.

 

Work Report

Phase 1

The website monitoring project involves working with data from multiple sources. The three primary sources are Versionista, PageFreezer, and Internet Archive.  An important initial task was researching and creating clear documentation of the differences in data formats.

Another initial task was reviewing the changes in detail to identify patterns that would allow filtration of insignificant changes. This included looking at the changes between versions in different modes (rendered, plain text, source) and looking for frequently occurring, similar changes.

Related GitHub Issues:

Related GitHub Pull Requests:

 

Phase 2

After identifying patterns in insignificant changes, I worked on filters to label these changes.

The project currently relies on Versionista’s services to view changes. There are plans to use services created by our own developers to compute the differences between versions and the groundwork for these services has already been completed. I worked on adding and improving these “diff” (page comparison) services.

Related GitHub Issues:

Related GitHub Pull Requests:

 

Phase 3

After categorizing insignificant changes, I moved on to important changes. This requires creating machine learning models that are able to classify (and, at a later stage, prioritize) changes.

I am working on two tasks:

  1. Code to download the content stored in our database which will act as input to the models.
  2. Framework for the machine learning models which can be used once the data is ready.

Related GitHub Issues:

Related GitHub Pull Requests:

 

Work Status

Completed

  • Documentation of different data formats, with demos
  • Identification of patterns in insignificant changes and creation of filters
  • Addition and improvement of method for computing differences

 

In Progress

  • Dataset creation
  • Machine learning groundwork

 

Future Scope

  • Explore different types of classifier models
  • Create a mechanism to automatically add correctly prioritized changes to the training dataset once a minimum number is collected
  • Automate retraining of model after dataset is updated

 

GitHub Links

 

My Thoughts

I had an amazing time working on this project. It was a really exciting ride and I could not have completed it without the constant help and support from my mentors and all the other wonderful people I interacted with. This was my first experience working with such a large community and I believe I have become a better developer. I’m humbled to be a part of such an incredible network of people and I will forever be grateful to EDGI for this opportunity.