Processing big data with GeoWave (part 1)

What happens when Big Data repositories become too big to analyze? Collecting billions of observations is useless if it takes weeks to search through them or to simply visualize the information. More and more, customers depend on accurately processed information to effectively do their jobs often from billions of heterogeneous observations being delivered to them in real-time.

These are particular issues for geospatial big data and what GeoWave is designed to solve. Development of GeoWave began at the National Geospatial-Intelligence Agency (NGA). GeoWave was open sourced on June 9, 2014 under the Apache 2.0 License and is under active development by DigitalGlobe developers on GitHub.  At its core, GeoWave is a software library that connects the scalability of distributed computing frameworks and key-value stores with modern geospatial software to store, retrieve and analyze massive geospatial datasets. GeoWave takes multidimensional data, such as spatial or spatial-temporal, and indexes it into a key-value store such as Apache Accumulo or Apache HBase. These distributed storage technologies, in addition to complementary distributing processing frameworks such as Apache Hadoop and Apache Spark, have proven capabilities to unlock the potential of massive datasets across a variety of domains.

However, the complexities of geospatial data carry implicit challenges that must be overcome when working in a distributed environment. One of these concerns is that you can lose locality (“like” values being stored close together) of the spatial and temporal aspects of your data. This will ruin your future performance and defeat the entire purpose of using these technologies. GeoWave overcomes this by storing data in a way that ensures values close in multidimensional space are still close in the single dimensional keys of the datastore and consequently stored physically close together in a clustered environment (i.e., similar latitudes are close together, similar times are close together.). This indexing allows an analyst to quickly search through and process only the needed pieces of data. Instead of having to process billions of geospatial records spread throughout a country to gain insights about one city, you can draw a polygon around that city and GeoWave will only use the part of the data that is relevant for that area. Now, searches that would have taken hours can be done in seconds, and large datasets can be processed and analyzed in a fraction of the time.

Figure 1: Heatmap of 1.3 billion taxi rides taken in New York City
Figure 1: Heatmap of 1.3 billion taxi rides taken in New York City
(Data provided by the NYC Taxi & Limousine Commission)

When you’re ready to render all of the data you’ve collected, you will run into another common choke point. If you are trying to visualize millions or billions of data points you will be putting a massive load on your rendering engine and you are going to be waiting a long, long time. GeoWave solves this by effective use of spatial subsampling. Each pixel on a map can only represent a finite amount of data, so GeoWave transforms the pixel space on the map using the underlying datastore to restrict the amount of data rendered onto a single pixel. In the example below, we are using this spatial subsampling to display 52 million GDELT data points. In the world of Big Data this is a relatively small amount of data, but it would still be more than enough to cause headaches for any analyst while he or she waits hours to view the data. Using GeoWave to help visualize data turns this into a process that takes a second. The subsampling is then performed again as you zoom in so that the best possible visualization of the data is maintained.

Figure 2: 52 million GDELT data points mapped in seconds using GeoWave subsampling
Figure 2: 52 million GDELT data points mapped in seconds using GeoWave subsampling
Figure 3: The subsampling is re-performed as the zoom level is changed
Figure 3: The subsampling is re-performed as the zoom level is changed

GeoWave also offers a suite of functionality to help you analyze your dataset, and we do truly mean “your data.” We released GeoWave version 0.9.3 in December with full HBase support, initial Bigtable support and major renovations to our documentation, including a Quickstart Guide for new users. As our customers’ needs continue to grow and change over time we continue to add features and functionality to GeoWave. We look forward to adding support for additional datastores, advancing our analytics, and further usability enhancements in the near future.

GeoWave is currently packaged with ingest formats for many common geospatial data types, and any other data type can be easily added as a plugin. GeoWave is packaged with a set of statistics such as: ranges over an attribute, cardinality of the number of stored items, histograms over the range of values, and much more. Available analytics include kernel density estimation (as seen in Figure 1), nearest neighbors and KMeans analysis using multiple clustering methods. Look for examples of some of these analytics in future GeoWave blog posts.