GBDX Notebooks and Amazon SageMaker for systematic mining of geospatial data

DigitalGlobe’s 100-plus petabyte archive of high-resolution imagery is a rich source of information about our changing planet. But to fully explore and mine those riches, requires an efficient way to manage and analyze all the data. We set out to find a solution.

Our first step to unlock the power of the DigitalGlobe image library was to load our data on to Amazon Web Services (AWS), a compute-friendly environment that manages data efficiently and enables analysis at scale. The culmination of our efforts was the launch of our Geospatial Big Data platform called GBDX, a horizontally scalable compute environment for analyzing satellite imagery. But even with a great compute environment and a growing set of analytical methods and algorithms, truly harnessing our data takes a lot of work. This is where machine learning becomes critical—to analyze vast amounts of data and extract meaningful intelligence quickly and efficiently.

Orchestrating a robust machine learning platform can be challenging, even for a data-centric company like DigitalGlobe. That’s why we turned to Amazon SageMaker, which helped by fluidly packaging training data access, a training service and a model-hosting service. With these powerful services available in the same compute environment as our data, doors opened for moving fast and innovating.

We knew the linchpin to successfully generating high-quality results from a machine learning initiative would require investing in solid training data.

To provide a foundation for creating training data, the GBDX team built a new data access pattern for DigitalGlobe imagery called RDA (Raster Data Access). Satellite imagery is heavy data. A single strip of imagery can be 20 GBs and 40 GBs after pansharpening. Moving around chunks of data that big can be time-consuming and expensive. To make satellite imagery data more consumable, RDA breaks these big strips into small chips of imagery with relevant data at a size that can be streamed and used more efficiently.

Imagery chips are also a great substrate for creating training data. We can dynamically generate small chips of imagery with labeled GeoJSON vectors of the objects we’d like to detect with an inference algorithm. The image below highlights a few examples of satellite imagery training data: docked ships (green), ships underway (blue) and airplanes (red).

A few examples of satellite imagery training data: docked ships (green), ships underway (blue) and airplanes (red)

Unlike standard photographs, satellite imagery requires a decent amount of sophisticated post processing to be visually appealing and useful for analysis. The need to implement remote sensing techniques like orthorectification, pansharpening and atmospheric compensation can scare off many potential data scientists.

Examples of remote sensing techniques

RDA dynamically performs these processing steps on the fly to deliver to users the specific image product needed. We do this by trading storage for compute in AWS. From a machine learning perspective, this rocks because we can combine SageMaker and RDA to dynamically fetch imagery into a model training environment. This means we now have access a far more varied and vast corpus of data to build better models.

We’re excited to be able to leverage SageMaker in the context of a dynamic training data environment. This provides the potential for DigitalGlobe to systematically extract intelligence from our imagery. We love to see virtuous loops in machine learning and now all of those ingredients are in place.

Virtuous loops in machine learning

Using SageMaker’s training and model hosting service we can programmatically find objects of interest in our imagery, and use the verified and validated results to enhance our training data. This means our inference of the next satellite collect gets subsequently better over time. We can look at this from an architecture perspective by breaking the process into exploration of imagery, orchestration of training and models and the consumption of results.

Explore-Orchestrate-Consume architecture

The resulting well-tuned models can then help us scrub across our 100-petabyte archive of data to find interesting data and put current results in historical context. We can see this in action in the example below, applying a building detection model created with SageMaker to a current satellite image of Las Vegas. We then replicate the analysis across seven years and 300 images to put the result in context.

Las Vegas example of applying building detection with Amazon SageMaker

Las Vegas example of applying Amazon SageMaker part 2

Since the hosted SageMaker models scale fluidly, we can provide an interactive user experience in GBDX Notebooks for customers looking to perform a variety of object detections and segmentations. And there are many more uses we’ve yet to discover. In what other ways can SageMaker empower our customers to answer new and timely questions with satellite imagery? Our exploration continues.

Be among the first to test-drive GBDX Notebooks today.