This week featured two awesome watercooler chats: an intro to natural language processing (NLP) with NASA’s Anthony Buonomo and the continuation of the series all about R with NASA’s David Meza.
Anthony started us off with an overview of natural language processing that will set the stage for future watercooler chats on using python for NLP and using NLP to solve data science problems, both with NASA’s Yulan Lin.
Natural language processing is a field concerned with enabling machines to better “understand” human language.
“Considering how often humans misunderstand each other, it’s even harder for machines,” says Anthony. Unlike humans, machines lack the context and lingual experience needed to interpret language.
Language is different from other machine learning tasks because it involves a non-physical, arbitrary signifier (a word) mapping to a signified idea or thing. Also, language consists of many forms: vocal, gestural, and written.
So how do you mathematically express language?
One way is through word vectors—using a Cartesian plane to visually represent relationships between terms. The vector map draws on a user-specified schema that helps quantify these relationships.
Anthony demonstrated the use of displaCy, a dependency visualizer—that is, it guesses at the syntactic structure (think 7th grade sentence diagramming) of a user-input sentence by drawing on spaCy, a python library.
The second watercooler chat was a continuation of David Meza’s R Series. Following up on his chat last week that provided an overview of RStudio, this chat covered the capabilities of ggplot2, a plotting system for R that implements the grammar of graphics, a coherent system for describing and buildings graphs. The package was created by Hadley Wickham, the Chief Scientist at RStudio.
Most of David’s demonstration was derived from R for Data Science by Garrett Grolemund and Hadley Wickham. David loaded Wickham's tidyverse package, which includes ggplot.
Many of the RStudio packages come pre-loaded with datasets that you can use to practice data analysis techniques. We looked at one of those datasets, mpg data frame, which provides data to answer the question: What does the relationship between engine size and fuel efficiency look like?The ggplot tool allows you to create basic visualizations that you can manipulate in order to imbibe the visualization with more meaning for the end user.
By following this function template:
You can enhance the plot with geo metric functions and “aesthetics” that dictate the color, shape, size, and transparency of the plotted data points. You can also layer these effects—but be careful not to introduce visual noise to the plot that might obscure the data.
Help files in each of the data packages show which arguments can be used in the ggplot function as well as what order they need to be written, and what aesthetics can be changed. The help files also contain examples of aesthetics on sample visualizations, so that you can preview how a given aesthetic will affect your output.
All of this and more will be captured in a tutorial that David will upload to his GitHub—so keep an eye out for that. David’s next R-related chat is scheduled for next week and will focus on machine learning basics.
Ronnie has been enthusiastically showcasing NASA data as a member of NASA's Open Data team since 2013. She supports NASA's open source efforts by helping to curate and administrate datasets on NASA's Open Data Portal and Open Source Code Catalog, managing citizen and internal requests for NASA data, contributing to the Space Data Daily Open NASA blog, teaching Datanauts courses, and coordinating logistics and data support for the International Space Apps Challenge hackathons.