decision tree

The two watercooler chats this week tackled the ins and outs of open source projects and an overview of a text classification project at NASA.

Intro to Open Source Projects

Our guest presenter on open source projects was Tom Caswell, a physicist by training who is currently working in the Data Acquisition Management and Analysis group at Brookhaven National Lab in New York. When he’s not wearing an associate computational scientist hat, Tom is the lead developer of Matplotlib, one of the most popular visualization libraries in Python.

Tom dipped his toe in the open source pool by first answering questions and then addressing bug reports in Stack Overflow. When he started putting in pull requests to fix real bugs, he discovered that a lot of the skills involved in contributing to open source projects translated to his day job—programming, project management, and the social aspect of providing constructive code reviews.

Datanauts who have contributed or developed their own open source projects agreed that the key word of constructive criticism is ‘constructive.’

 “As a contributor, you have the right to decide how you spend your time.  Someone saying, ‘This doesn't meet our standards and X needs to be changed’ is fine.  ‘You're a moron’ isn't,” said Datanaut Wendy Edwards (Spring 2017 Class). “In my experience [as a coder], you don't want to take feedback too personally—often people are right about stuff….That said, you want to find a project that's not toxic.”

Datanaut Julia Silge (2016 Class) shared developer Brian Anderson’s blog post about being a “Minimally Nice Open Source Maintainer.”

Tom recommended Brett Canon’s talk from this year’s JupyterCon about the implicit social contracts between the different roles that people play in an open source project. Users and contributors have a set of expectations.

Another social aspect of working with open source projects involves the trust amongst maintainers of the same the project. Early on, Tom had specialized in certain type of bug reports on Stack Overflow for the purpose of efficiency. He recommended carving out a region of the project that is relevant to the overall problem being solved, focusing on that, and trusting that the other parts of the code will work the way they’re intended to work. 

Ultimately, Tom encouraged the Datanauts to jump into contributing to open source projects—“No contribution in open source is too small!”

Using Data Science to Solve Federal Records Challenges

A few of our watercooler chats during this Datanauts class have talked about natural language processing, both from an R and from a Python approach. Dr. Andrew Adrian, a member of the Data Analytics Lab team at NASA Headquarters, shared with Datanauts the preliminary results from a project near and dear to his heart: using machine learning to solve federal records challenges at NASA.

As a federal agency, NASA is required to retain certain kinds of official records, including email communications from certain NASA officials, and transmit them to the National Archives and RecordsAdministration (NARA). Non-records, like personal emails, however, do not need to be transmitted. So the challenge is: How do you take terabytes of email data and sort them automatically into a record or a non-record?

Enter the NASA data science team!

The team’s initial approach involved a Bernouli-Bayesclassifier that predicted whether an email was business related given the frequency of specific terms. This approach turned out to be a good method for identifying spam emails, but not so much to sort records from non-records.

The revised approach applied a random forests classifier to CSV file containing data about the presence or absence of given terms to build decision trees that ultimately designated emails as records by asking a series of yes/no questions.

The method was then trained and tested on about 10,000 of the ENRON email, one of the only large sources of business email that is also publicly available. Then they were ready to apply the model to NASA email. A subset of about 10,000 email was manually classified over the course of a week for testing and training the model and the team wrote code to extra certain features from the email.

Roughly 75 percent of manually examined email were considered records of agency business, erring on the conservative side of over-classifying. The random forest model turned out a weighted, cross-validated average accuracy of about 92 percent—which is about as good as a human can do manually—and a relatively low false positive rate of 0.083. Of the 10,000 email, about 1 percent were encrypted and some contained multiple attachments, plaintext passwords, or other privileged information outside of the scope of the project—and can’t be transmitted to NARA 

To improve the model in the future, the team would need more email to generalize the model, which right now is pretty NASA-specific. The ultimate objective is to incorporate the model into a full-stack prototype the other agencies that can also use to classify their email. 

veronicaPhillips

About Veronica

Ronnie has been enthusiastically showcasing NASA data as a member of NASA's Open Data team since 2013. She supports NASA's open source efforts by helping to curate and administrate datasets on NASA's Open Data Portal and Open Source Code Catalog, managing citizen and internal requests for NASA data, contributing to the Space Data Daily Open NASA blog, teaching Datanauts courses, and coordinating logistics and data support for the International Space Apps Challenge hackathons.