At DataRescue, students collaborate to protect vital government data

Hackers fear Trump administration will attempt to tamper with life-saving information

DataRescue Boston at MIT, a day-long hackathon focused on preserving federal data at risk of manipulation or removal by the Trump administration, took place Saturday in Walker Memorial.

Two hundred sixty people registered for the event, and between 100 and 200 showed up.

The hackathon featured four different “tracks”: surveying, seeding, harvesting, and storytelling. This workflow was developed by Environmental Data and Governance Initiative and the University of Pennsylvania’s DataRefuge.

The seeding track required the least technical expertise: the seeders’ task was to click through a set of assigned links, marking every few links for mirroring by the Internet Archive, and flagging links that contained databases, PDFs, spreadsheets, or other formats the Archive doesn’t back up. The flagged links were sent to a special app developed by EDGI that coordinates them for the harvesting track. About 80,000 total links were seeded at or after the hackathon.

The harvesting track drew the largest number of participants. The harvesters focused their efforts on four organizations: the DOE, NOAA, NASA, and the EPA. They used their technical knowledge to scrape data from webpages identified by the seeders, harvesting 53 datasets and uploading 35 gigabytes of data over the course of the hackathon.

Lead organizer Jeff Liu G noted that at previous events participants had been split closer to half-and-half between harvesting and other tracks. Liu attributed the greater skew towards harvesting at the MIT event to the wealth of technical expertise of the MIT community.

In his introductory speech at the hackathon, Liu noted that the goal was not to create “verbatim versions” of data from government websites, just research-quality formats. For example, if the data in question existed in a series of HTML webpages, the harvesters and others processing it for storage could import it into a database instead.

Harvesters collaborated with each other, sharing code on GitHub and exchanging ideas. The volunteer coordinators urged more experienced programmers to help their less experienced peers.

At one of the EPA tables, a volunteer coordinator asked what programming languages people were using. R, the person next to her said, and someone across the table nodded assent. “We should switch seats,” the coordinator said. Later, the two R users could be seen hunched over each others’ screens in conversation.

The storytelling track, according to coordinator Renee H. Bell G, aimed to “show why the work actually matters.” Storytellers profiled participants at the hackathon. They also researched who the stakeholders are: who uses the data archived at the hackathon? Several storytellers milled around the room interviewing participants. They also produced in-depth stories covering the data sets of the National Water Information System in the USGS, the Alternative Fuels Data Center in the DOE, and the Global Historical Climate Network in NOAA.

In the surveying track, participants researched government organizations and wrote primers about their structure and function. The surveyors focused on the Departments of Labor, Justice, and Health and Human Services: organizations to which the DataRescue movement is hoping to expand its efforts.

During the hackathon, the surveyors managed to write five main agency primers and 16 agency sub-primers covering these departments along with the Department of Housing and Urban Development and the Federal Communications Commission.

Participants at the surveying tables discussed their motivations with the storytellers and the press. One student from Harvard Medical School, who asked to remain anonymous, expressed fear the Trump administration will create fake CDC data to link vaccines and autism. In his research, he is working to develop a tool to detect autism at birth and “show that [it] exists way before the vaccination.”

“I don’t want to see measles killing 1,000 children a year like it used to,” he said.

Another M.D./Ph.D. student at the Health and Human Services table described herself as “not a very political person,” but said she thought it was important to preserve the data.

At another of the surveying tables, Alex V. Konradi G, a master’s student in CSAIL, was writing up a primer on the Executive Office of Immigration Review in the Department of Justice.

“I read the news and it scares me,” he said. To him, the possibility that scientific data could be manipulated by the government behind citizens’ backs is “Orwellian.”

Next to Konradi sat Michael Altman, director of research at MIT Libraries. Trained in social sciences and experienced in methods of research, he was a natural fit for the Department of Justice table. He spoke about the role of MIT Libraries in helping to organize the hackathon. “The library has a long history of engaging in stewardship and preservation,” he said, and preserving the at-risk government data is now a part of that effort.

DataRescue Boston at MIT was hosted by MIT Libraries, the Association of Computational Science and Engineering Students, and EDGI, with support from the MIT Environmental Solutions Initiative, the Center for Computational Engineering, the Department of Civil and Environmental Engineering, and the Graduate Student Council.

The event is part of a larger national movement organized by EDGI and DataRefuge. EDGI is a U.S.-based organization formed in November to monitor changes in federal agencies and archive federal environmental data to ensure it remains publicly accessible.

Organizer Jeff Liu spoke about the origin of the MIT event. Articles related to the DataRescue mission “popped up in one of the news feeds that I read,” he said, and “I wanted to find a way to contribute my background.” He contacted national organizers of the movements to see if there was anyone organizing an event at MIT. As it turned out, no one was doing so, but a number of EDGI members were located in Boston. They joined together to form DataRescue Boston and plan events in the area.    

DataRescue Boston previously hosted a hackathon event at Harvard Feb. 1, and they plan to host another at Northeastern Mar. 24. They also host a weekly MIT working group on Thursday evenings from 5–8 p.m. in room 5-233.