Recently the government has mandated that federal agencies, such as NASA, improve the public's ability to find and use federal data for artificial intelligence. As a result, the OMB (Office of Management and Budget) requires that public datasets that may be used for A.I. on data.nasa.gov and code.nasa.gov be tagged with their specified artificial intelligence tags (e.g. “usg-artificial-intelligence,” “usg-ai-training-data,” and “usg-ai-testing-data”) to help make them more easily discoverable by the public.
However, identifying which public datasets would be suitable for artificial intelligence at scale has proven to be a challenge.
In this post, we summarize a small investigation into whether we could identify machine-learning suitable datasets via programmatic methods applied to dataset metadata from:
data.nasa.gov, an aggregator site for public NASA datasets, and
NTRS, a data site for NASA abstracts, reports, and other publications.
These methods largely did not work. However, we hope sharing our experiences to help others avoid repeating our approaches. Additionally, we hope to encourage different government groups trying to solve similar problems to share their results with us.
There are two main reasons why it is difficult to locate datasets on data.nasa.gov suitable for machine-learning via programmatic methods:
Programmatically interpreting features of a dataset that might correlate to suitability like file type, number of rows, etc. is difficult to do programmatically on data.nasa.gov because most dataset files are external (not hosted on data.nasa.gov). Many datasets require you to click through to a different data site or fill out a form before you may access the dataset. Due to the number of clicks and forms, this is difficult to set up programmatically as the variance in these tasks across the range of data sources makes it impossible to do programmatically over thousands of datasets.
Dataset description quality in the dataset metadata can vary drastically. What metadata does exist tends to focus on what a dataset represents or how it was acquired and not how the data is represented. How the data is represented is important information for understanding whether a dataset is suitable for machine-learning.
Data.nasa.gov has a mixture of structured data, including files like CSV and HDF files, and unstructured data files like reports. NTRS holds documents approved for public release. There is partial overlap between the two. Some of the NTRS documents also exist on data.nasa.gov.
Our approaches described below focused on working programmatically with the metadata of the structured datasets.
The initial plan to compile a list of data that matched for keywords such as “machine learning” and “learn”.. Our hypothesis was that at least some datasets suitable for machine-learning would mention that in their descriptions. However, this produced a list of entries that were for the most part research papers and project assessments. Only about 5% of the entries were structured datasets.
Most of the data from phase 1a were publications and not the structured data files we wanted. In this phase, we used string matching to generate a large list of files with the data format we wanted. As file types are not often reported, we used string matching on the metadata. From observations we noticed that the types of files we wanted to examine had a large amount phases such as:
The majority of entry titles that included “version” or version strings in the format “Vx.x” or “Vxx” (i.e. V4.1 or V02) were structured datasets.
The majority of entry titles that included a date range in the format “xxxx-xxxx” (i.e. 2007-2016) were structured datasets.
The majority of entry titles that included “Phase” followed by roman numerals (i.e. Phase II) were project assessments, not structured data.
Many of the structured dataset’s descriptions included the words “this dataset” or “this data set.”
Following these conditions, the updated script produced a much improved list of entries, with structured datasets comprising roughly 80%-90% of the list.
What we found when reviewing a random selection of datasets from this group was that the description quality for the structured data sets varied extensively. Some descriptions were very short with little information of any kinda. Almost all of them lacked any words in their metadata field for “description” that talked about what could be done with the data. The much more common focus was on what was done to produce the dataset or what the dataset represented.
Hence, we concluded we couldn't identify datasets suitable for machine-learning from this method.
Previously, a member of the NASA TDD Data Analytics team had run the STI Tagger (a machine-learning model described in this blog post) over all datasets in data.nasa.gov. Specifically, it was run on text pulled from several of the mandatory fields in data.nasa.gov metadata. The STI tagger is a model that predicts keywords from a standardized list of several thousand keywords. A list of datasets were found that had predicted keywords that related to machine-learning and artificial intelligence. A visualization of the how many times each keyword was predicted is below.
Of these, 5 tags were selected with high frequencies:
Real time operation
A list was then compiled of the datasets that had 3 or more of these predicted keywords. Only matching for 1 word produced far too many results (17k+), matching for 2 produced about 1.7k, but matching for 3 or more produced a more manageable list of about 70 abstracts.
Unfortunately, after reviewing all the datasets, it was seen that the metadata was again talking about data acquisition, or overall context, but never in a way that would allow us to identify "this is a dataset that could be used for machine-learning" or "has been used in machine-learning".
For phase 2, the plan was to programmatically examine the Scientific and Technical Information (STI) NASA Technical Reports Server (NTRS) instead of data.nasa.gov directly. Our hypothesis was that papers might detail their machine-learning process and explicitly state which datasets they had used as training datasets, which could then be traced back to a dataset on data.nasa.gov.
A list of abstracts were compiled by matching “machine learning,” “supervised learning”, and “unsupervised learning”. We also tried “label” as a term but found the vast majority had no relation to machine-learning. More than 300 abstracts were identified programmatically and reviewed by hand. Only 11 abstracts were actually about machine-learning training and explicitly mentioned the datasets used. Of these 11, only 3 explicitly mentioned datasets that could be found on data.nasa.gov:
National Agriculture Imagery Program
Landsat Global Land Survey (GLS)
C-MAPSS Aircraft Engine Simulator Data
....at least on data catalogs that aggregate other data catalogs
Most datasets in data.nasa.gov lack metadata that describes their structure and format. It is not a mandatory requirement to exist. Programmatically determining it is an impossibility due to the fact that the “download link” field in the metadata doesn’t often go to an actual file but rather to another NASA data site, which then requires human-level browser navigation skill to get to the actual file.
Datasets’ descriptions are almost always focused on what data is represented or how it was acquired. Dataset metadata almost never includes what it could be used for or what it has been used for. Additionally, when papers and abstracts due describe machine-learning algorithms and training, they either don’t describe an exact dataset, as the paper is describing a method for a type of data rather than something specific, or they don’t describe the dataset in such a way that it can be easily found.
Therefore, one could conclude that:
(1) Data.nasa.gov does not currently have the right metadata to allow for programmatic identification of machine-learning suitable datasets.
(2) We will likely have to rely on sourcing datasets suitable for machine-learning by asking large amounts of people for the datasets that they personally know about.