finding the needle (ML suitable data) in the haystack (data.nasa.gov)

Summary

Recent government mandates direct federal government agencies, including NASA, to improve the ability of the public to use federal data for machine-learning and artificial intelligence. The open-innovation websites provide a lot of NASA data for the public, but finding datasets suitable for machine-learning and artificial intelligence purposes within the thousands of possible datasets available on data.nasa.gov is a little like the classic problem of finding a needle in a haystack.

A major factor limiting the public’s use of NASA open-data for machine-learning is the time it takes the public to discover potential problem appropriate datasets, evaluate datasets for the characteristics needed to be useful for machine-learning work, and then compile data into a form useful for machine-learning. These activities occur before any feature creation or actual machine-learning training and represent what we'll refer to here as the "time-to-start".

You can split time-to-start into two main factors.

1. Is the dataset nicely prepackaged for machine-learning and can an end-user quickly determine that?

2. How easy is it for potential interested parties to discover an appropriate dataset?

Many of NASA's datasets are organized and presented as individual datasets for specific domain users. They aren't necessarily presented as large collections of thousands of datasets in a machine-learning ideal prepackaged state. Nor are the systems that hold the datasets treating characteristics important to machine-learning tasks as first-level criteria for users to filter on.

How could the open-innovation program, and NASA as a whole, minimize A.I. developers' "time-to-start" and therefore maximize the number of times a NASA dataset is used? 

We discuss a few possible ideas in this blog post. Do you have any ideas? Please add a comment below. We want to hear from you!


Context

Recent Government Mandates Requiring Improvements to the Public's Ability to Use Federal Data For A.I.

Both the H.R.4174 – Foundations for Evidence-Based Policymaking Act of 2018 and the Executive Order on Maintaining American Leadership in Artificial Intelligence have sections that relate to improving how the public makes use of federal code and datasets for Artificial Intelligence. A few relevant mandates or directives that impact NASA open-innovation websites include: 

  1. OMB (Office of Management and Budget) requirement that datasets on data.nasa.gov and code.nasa.gov be tagged with OMB specified A.I. tags.
  2. Request for public feedback posted in the Federal Register, “Identifying Priority Access or Quality Improvements for Federal Data and Models for Artificial Intelligence Research and Development (R&D), and Testing
  3. Federal Data Strategy Action Plan includes agency requirements and due dates, some of which are specific to open data. By Feb 2020: “…. Specifically, agencies shall improve data and model inventory documentation to enable discovery and usability, and shall prioritize improvements to access and quality of AI data and models based on the AI research community’s user feedback…. “

Relevant Characteristics of the NASA Data Universe

There are a few aspects of how NASA supplies data to the public that need to be taken into account when thinking about how to improve public use of NASA data for A.I. 

1. There isn't just one place to find NASA data. There are numerous sites that share public data. Some are very large, others fairly small. Often the smaller sites are run by teams that focus on the unique metadata & search interface needs of their particular users. The larger sites include more generic metadata and search interfaces.

2. According to various legislative and executive mandates, all of these smaller sites are supposed to be harvested into data.nasa.gov, which in turn is harvested by data.gov. 

Many of the datasets mentioned on data.nasa.gov aren't actually stored there. Often the only thing that is stored is the metadata. Sometimes a link to the actual dataset is included within the metadata, but often further navigation is required after following the "download" link. These metadata records are often programmatically uploaded in bulk by admins who control the other NASA open-data sites. 

What these harvesting relationships mean for trying to improve datasets availability for A.I?

  1. It is difficult to programmatically interrogating data in bulk (i.e., search by dataset characteristics that correlate to datasets suitable for machine-learning) as the "download" links on data.nasa.gov frequently don't lead directly to a downloadable dataset but rather to another webpage that requires a human to sign-in or at least navigate around a page before you get to the point of downloading actual data.
  2. It is often not possible to ask individual dataset owners for new metadata as the owners may just be the bulk uploaders, no longer work on the dataset, or even work for NASA. 

You can learn more about the NASA data universe and its structure in this presentation created for people looking to find NASA data for the annual SpaceApps Hackathon.

What Is a Machine-Learning Friendly Dataset?

Determining whether a dataset is machine-learning friendly can take up a lot of time for an end-user. 

It is difficult to define all the characteristics that matter for machine-learning as the answer can be variable depending on the problem you're trying to solve, the machine-learning task you're executing, and the data types you're working with. 

In general, making datasets friendlier for machine-learning often boils down to reducing end-users needs to inspect or compile the dataset themselves. 

The less time required to compile individual datasets into uniformly organized collections for machine-learning or confirm dataset characteristics necessary for machine-learning, the shorter the "time-to-start".

Issues that have made NASA datasets hard to work with for machine-learning: 

  • Unclear what the file types were unless you downloaded, unzipped, and inspected a dataset.
  • Unclear what the file sizes involved where unless you started the download process.
  • Unclear in documentation if a collection of datasets is uniform enough for machine-learning.
  • No easy option to download a collection of datasets in bulk or all the parts of a collection between some range.
  • No easy option to know the size, type, variability, file format, etc. of collections unless you downloaded it. 
  • Lack of documentation for changes in file formats or folder structure within parts of a larger collection of datasets.
  • Lack of labels existing in an easy downloadable format tied directly to potential training data.
  • Lack of ability to search based on file types and population sizes.

Potential Suggestions for Improvement

Some areas for improvement that seem relatively common:

Bulk download interfaces built with ML in mind

Many datasets are available as individual per-dataest downloads. For example, a dataset from 2018 and a dataset from 2019 might be separate downloads. This works great for many users, but users who want to do machine-learning typically want to do a single download, not a thousand separate downloads and then stitch all the datasets together later. Additionally, if the dataset is continuously being collected, they'll want an easy way to have others programmatically come back and download the exact dataset as them again in the future even if the underlying data collection has expanded. For instance, a data scientist in 2020 might want to run his new machine-learning code against the same dataset downloaded in 2018 that was only 1992-2018 and didn't include 2019-2020.

Enabling bulk downloads where the end-user can programmatically specify the extent of the bulk download makes dataset more machine-learning friendly.

Documentation written with ML in mind

Machine-learning end-users want to not just know what data was collected but also the file type, file sizes, any labels, folder structure, and any potential problems with the uniformity of the data like NANs, missing data, or changes in file structure. Without clear documentation up front, they will be forced to download & interrogate the dataset to figure all this out. There is always a chance they'll find out after hours or days of work that the dataset isn't useful for machine-learning at all or getting the dataset properly cleaned would take weeks. 

Metadata files written with dataset traveling in mind

The same documentation mentioned above in a dataset description on a webpage should also be available in a downloadable metadata file or files. A dataset that is useful for machine-learning will likely travel. It might originally existed on data.nasa.gov, but the next user might find it in someone's code repository on GitHub. A link back to the original location may not be included with the dataset in its new location. Putting all the relevant metadata in a file(s) as well, increases the chance the metadata will travel with the data.

Sharable code for working with specific datasets

How best to load, compile, or clean a dataset is not always straight forward. Shared code for working with a particular dataset can dramatically reduce the time it takes end-users to start doing machine-learning with that dataset.

Who Makes Datasets More Machine-Learning Friendly?

Dataset Suppliers?

You can't expect all suppliers of potential machine-learning datasets to make their datasets more machine-learning friendly as the dataset supplier may know nothing about machine-learning. Additionally, the dataset might have been supplied years ago. The supplier may no longer work with the dataset or even no longer work at NASA. 

Crowdsourced?

There may be code that the public has written for loading, cleaning, and/or compiling a specific dataset into more machine-learning friendly collection. However, it is complicated for NASA to share that code that the public has written due to rules that restrict NASA from "endorsing" public works. 

Open-data Site Admins?

At minimum some suggestions of what to include in documentation and metadata of datasets that are well suited to machine-learning could be shared widely with dataset suppliers. In fact, this is already done for more general data standards. However, that's also not as easy as it might seem due to the variations in ideal characteristics between different types of machine-learning tasks and different data types.

How To Make Datasets Easier To Discover?

Many NASA open-data sites have search interfaces that leverage string matching on dataset titles or categorizations relevant to the domain specialists. Although these features are useful for many users, they don't provide a lot of help in filtering possible datasets to those suitable for machine-learning.  

Two constructs to make it easier to discover datasets suitable for machine-learning are Tags and Lists.

With tags, datasets exists as they currently are on data.nasa.gov, or their original data archive site, but additional tags are added to flag them as suitable for machine-learning prediction or training. Tags enable better search results as end-users can filter results to just show datasets that have these tags. 

This type of approach has recently been started! New OMB guidelines require federal agencies to tag datasets that are suitable for A.I. training. NASA has started to apply them, but there are currently very few.

With lists, new compilations of datasets are made around Machine-learning tasks, making them easier to find for people less interested in a specific science domain or spacecraft and more interested in datasets suitable for machine-learning. These machine-learning specific lists could live a variety of places and be created or edited by a variety of people and things.

Who Creates the tags and lists?

Programmatic Creation?

Theoretically, it might be possible under certain conditions to programmatically describe datasets and then use those characteristics to categorize their potential for machine-learning. If the score is high enough, a dataset might be tagged as A.I. suitable or added to a list. Code exists to profile data stored in relatively common file types like JSON and CSV. However, it is more difficult for less common file formats and even more difficult with multiple files and folders where it is often difficult to tell the difference between actual data and accompanying metadata. Additionally, it can be difficult to automate dataset download and interrogation due to data download interfaces built for humans. 

Dataset Owners?

Asking dataset owners to tag their datasets with tags to flag whether it might be suitable for machine-learning seems like a natural ask, but there are also complications here. What to do about all the thousands of datasets that already exist and their owners haven't touched in years? Who will tag them? Additionally, many of the dataset owners might not know if the dataset is really suitable for machine-learning. 

Crowdsourced?

Asking the public to tag datasets directly quickly becomes awkward as open-data sites are commonly built such that only dataset owners can change anything for that dataset. 

Lists of machine-learning suitable datasets with links back to original data sites are easier to crowdsource. Hosting such a list on NASA infrastructure would still be problematic, but there are models that are more social network based. For example, there are Awesome lists, repositories on GitHub that are just lists of excellent resources on a given topic. Why not one for NASA datasets suitable for machine-learning? It could have sections for topic modeling, speech-to-text, image recognition, object detection, image segmentation, etc.

Accessibility & Contests

Closely related to discoverability is accessibility. This comes into play the most with machine-learning contests. There are some datasets that have been compiled with machine-learning in mind in order to use them for contests. Although contests are great for getting a lot of engagement over the short-term, they're lousy for discoverability and accessibility over the long-term. 

Sign-up pages, and especially sign-up pages that require waiting for a human to give approval, end up being roadblocks to data access. Currently, contest datasets tend to only exist on the contest webpage. The original separate datasets may exist on data.nasa.gov or another open-data site, but there has been a poor record of actually moving these machine-learning ready datasets back to an open-data site.

What Are Your Ideas?

Do you have any ideas for how to help people discover datasets that are applicable to artificial intelligence?

Many complexities were listed above. Do you have experiences surmounting them that you could share from your own work?

Is there a dataset you know NASA has but isn't yet made public that would be great for machine-learning training. 

What would enable you to better use NASA data for machine-learning projects? 

Please add your comments below!