data visualization of data.nasa.gov

Executive Summary


Data.nasa.gov is home to tens of thousands of different datasets available to the public. 

Problem

Members of the public attempting to find data for the SpaceApps hackathon have previously given NASA staff feedback that it is both difficult to know what sorts of data exist in data.nasa.gov and overwhelming how much data is there. This lack of context impacts their ability to effectively use the search bar to find datasets that map well to their needs. 

Solution

To give the public a better understanding of the different types of datasets available and where they are sourced from, a data visualization page has been created that is dedicated to helping our users gain a better understanding of what kind of data they can expect to find. It shows in aggregate what is on data.nasa.gov in terms of the most common sources, categories, and keywords.

Potential Reusability for Other Federal Agencies

The data for this data visualization is extracted from a data.json file that sits at https://data.nasa.gov/data.json. As this JSON file follows the same schema used across all federal agencies for public data catalogs that feed into https://data.gov , any other government agency could potentially use the same code to generate a visualization page of their own public data catalog.

Work Done on Data.nasa.gov

Data.nasa.code’s visualization page contains a treemap that displays an aggregate data visualization of contents of data.nasa.gov data catalog. The data is represented by rectangles scaled by the number of datasets. Each rectangle reflects a unique combination of source, category, and keyword. 


Holding your cursor over one of the rectangles will bring up a hover box with source, category, and keyword information. If interested, a user may click on a rectangle which will then take them to the data.nasa.gov data catalog search page results for that combination of category, source, and keyword. 


On top of that, users may also click on the legend keys (which display either sources or categories depending on the grouping order the user has selected) to take them to the data catalog search page results as well. 

How it Works

The first step in creating the visualization is to process your data into a specific format, shown below, required by D3.js (the library used to create the visualization).

This is done by running data_processing.py, a python script, on the data.json, and should take no more than a few seconds. The script will then produce “processed_data.json,” which can be used for many different types of visualizations apart from the treemap. 


We pre-process the data into a new form and save the data as a new file instead of doing the data transformation live on the website as downloading the original NASA data.json takes several minutes on most internet connections due to its size. We don’t want the website to hang while this large file is first downloaded. By preprocessing it beforehand into the smaller ”processed_data.json”, the data visualization can load quickly. If another agency’s data.json is small, you might be able to do this step on-the-fly with JavaScript.


Visualizations that use this very same data structure include:

Depending on the distribution of sources, categories, and keywords in the data catalog, one of the visualizations choices above might work better than a treemap at showing the distribution of datasets to an end-user. For example, USDA’s data.json is visualized better by a collapsible tree rather than a treemap as so many of its datasets come from a single source.

How Other Government Agencies May Use this Code


Before other agencies can use this code, there are a few things that must be changed:

  1. Update the URL where the initial data.json resides here.

  2. Update keyword_count_threshold, which sets the minimum number a keyword count must be to be added to the final processed data.

  3. Update ignoreData.json with any sources, categories or keywords you wish to leave out of the final processed data. An example can be found here.

  4. Update duplicates.json with any duplicate sources you wish to be grouped together in the final processed data. An example can be found here.

  5. Update acronym.json with acronyms you wish to be expanded for the purpose of displaying them in the treemap legend. Each acronym must have a type (either source or category), and a name (the acronym's expansion). An example can be found here.

  6. Update the URL users get sent to when clicking on the treemap rectangles here.

  7. Update the URL users get sent to when clicking on the legend keys here.

  8. Consider whether the automatically generated search functionality will work for your data.<agency>.gov site. If not, take it out or adjust so it makes sense for your data sites search functionality.


Where is the Code?

The code for this visualization is baked into the front-page of data.nasa.gov. Hence, you can grab it from the Github repository: https://github.com/nasa/data-nasa-gov-frontpage . Please let us know if you find it useful by leaving an issue comment on the repository.