logo with stars

The first was an overview of graph databases given by Datanaut Karen Lopez (2016 Class).

“Graphs are everywhere,” says Karen, a data architect and evangelist who speaks and writes often on the topic of applying data management principles.

“A lot of data lends itself to graphing—even hierarchies. Humans want to believe that hierarchies exist and that the world is structured, but it’s not; it’s really unstructured. Graph databases are designed to depict the unstructured nature of life.” 

In her overview, Karen walked through types of graphs such as networks (think Kevin Bacon and degrees of separation) and common architectures (e.g. neo4j, Triplestore, IBM’s graph database, Titan).

neo4j matrix example To get your feet wet with graph databases, Karen recommends picking up Seven Databases in Seven Weeks by Eric Redmond and Jim R. Wilson. The book goes over PostgreSQL, Riak, Hbase, MongoDB, Neo4j, CouchDB, and Redis.

When making design decisions about your data architecture, “every design decision should include cost, benefit, and risk,” says Karen. “Bringing those together means a fit to the data problem you’re trying to solve.” 

Graph databases solve a limitation of relational databases, which, despite their name (relational in this case refers to its mathematical definition) are not well-designed for querying relationships between data. For decades, “we’ve started with relational [databases] and don’t do much else, and we try to wedge every data challenge into a relational database. But we don’t have to do that anymore,” says Karen.

Graph databases work best when:

·      The data are in high-recursive data structures—e.g. hierarchies, networks—that connect data to one another.

·      There are reference data that are shared but don’t involve a lot of transactional back-and-forth—e.g. master data management. (https://en.wikipedia.org/wiki/Master_data_management)

·      There is virtualization, clouds, layers of data processing—e.g. networks and IT operations.

·      The data connect the dots of otherwise unrelated interests—kind of like the real-time recommendation engines on Amazon that display things I might be interested in buying that other people who bought the item I am viewing have also bought.

For our second chat, former Datanauts  community manager and new Datanaut Elyssa Dole shared her nearly-infinite wisdom in organizing community events. Elyssa developed the toolkit that Datanauts leverage when organizing their events. Her chat walked through the various pathways a Datanaut can take when hosting an event and what makes a Datanauts event successful.

Datanauts who have hosted their own events shared their experiences well. They all stressed starting small—with a groups of friends or family—and trying to reach people who don’t normally work with data.

“That’s one of the things with Women in Data,” says Beth Beck, NASA's Open Innovation Program Manager and the founder of the Datanauts program, referring to NASA’s initiative to encourage women to participate in data science (more on that here). Hosting community events reaches “the women who wouldn’t normally see [data]. We want to model them into the data science world.”

Datanaut Cindy Chin (2016 Class), who has hosted three Datanauts events so far ranging from informal to formal, also recommended starting small. “Play towards your strengths and also towards your network and your interests,” says Cindy.

Cindy has spent the last few years traveling to international events and speaking about data, data science, and using space as a common platform for reaching people at all levels of an organization. “That’s my world,” Cindy says. “Find out where you’re world is and start with [your] level of comfort.”

Continuing on with a series of chats about natural language processing, NASA’s Yulan Lin picked up where Anthony Buonomo left off last week with a chat about text classification.

Yulan, a data scientist with NASA’s Data Analytics Lab, started with a conceptual introduction to supervised and unsupervised text classification, likening text classification to sorting waste into recycling and trash categories (except, of course, that you’re sorting text rather than waste).

For the presentation, Yulan used scikit-learn, a machine learning library with a ton of datasets and tutorials.

For supervised classification, the user defines for the computer the characteristics for each class. For example, continuing with the waste metaphor, you would tell the computer that paper or plastic products are recycling and food waste is trash. Once the computer knows how to sort, the user can hand the computer a new piece of waste and have the computer sort it out. 

One method of supervised text classification for is NaïveBayes, a set of algorithms based on an assumption that the characteristics the describe a word or term are independent from one another—for example, a soccer ball is both “round” and “leather” but these are considered as independent characteristics when classifying a given object as a soccer ball.  For a computer to be able to sort text, you must convert that text to numbers. One way to do that is by applying a CountVectorizer object with a fit_transform method to a set of raw text documents. After some additional processing, you can get an idea of how often a term or word is use, both relative to other terms or words in a set of documents and in an absolute sense.

By contrast, with unsupervised classification, you might know how much waste you have but nothing about the waste itself. When you’re given a piece of waste to examine, you might make guesses about what that piece represents.LDA_watercooler_NLP

 

One method for unsupervised text classification is latent dirichlet allocation, which also involves converting text to numbers, but instructs the computer to sort objects (like documents) into a given number of classes, leaving it up to the computer to figure out what the classes are. Once the computer has sorted a given set of documents, you then have to manually evaluate or inspect its results to score its accuracy. In this case, it’s customary to apply the unsupervised classification technique to ~80 percent of a dataset and, once the computer is trained to sort accurately, introducing the remaining 20 percent.

Yulan will be back again with a follow-up chat next week about using NLP to solve data science problems. You can also find her on Twitter @Y3l2n.

veronicaPhillips

About Veronica

Ronnie has been enthusiastically showcasing NASA data as a member of NASA's Open Data team since 2013. She supports NASA's open source efforts by helping to curate and administrate datasets on NASA's Open Data Portal and Open Source Code Catalog, managing citizen and internal requests for NASA data, contributing to the Space Data Daily Open NASA blog, teaching Datanauts courses, and coordinating logistics and data support for the International Space Apps Challenge hackathons.