The product of an efficient database and a little SQL
One of the most prominent maxims amongst the decision makers of today is to be data-driven. To take a strategic decision of significant impact in today’s world without the support of evidence from data is a risky endeavour. To be data-driven, however, is not just a modern phenomenon; throughout history, decision makers the world over have used evidence from data to base their actions upon.
So why has the phrase ‘data-driven’ become so prominent?
It is perhaps that only in the past couple of decades, with advances in the internet and technology, from satellites and drones to cloud computing, has the supply of and ability to gather data taken off and been growing at such an exponential rate. It is difficult to estimate how much data the world generates on a typical day, however in 2019 Raconteur produced a helpful infographic indicating that 463EB of data is expected to be produced each day by 2025.
With the coming to market of autonomous vehicle technologies, and advances in communications technology, the realm of geospatial data has perhaps seen some of the most prolific growth amongst data types.
What is the value of all this data? The value of data is not in the data itself, but in the unlocking of information and insights from that data.
In the words of Carly Fiorina, former Chief Executive Officer at Hewlett Packard:
The goal is to turn data into information and information into insight.
The efficient elicitation of insights from a dataset, or combination of datasets, is key to bringing value to that data, and thus realising benefit for business and society. As the amount of data increases, accessing and processing that data in order to gain insights becomes increasingly computationally expensive.
Key to unlocking value from today’s very large spatial datasets is intelligent indexing of the data in order to maximise the efficiency of data retrieval. At GeoSpock, we have developed a geospatial database, GeoSpock DB, hosted on AWS, and tailored around the efficient retrieval of geospatial data. To query datasets, we can use standard (ANSI) Presto SQL.
Real world examples with quantified impact and cost
Below, we show four real-world use cases, querying a 370Bn row (64.8 TB) one month global ad-tech dataset to understand customer journeys and behaviour. The dataset consists of GPS positions of device ‘check-ins’. As well as location, each row of the dataset contains anonymised device IDs, information about the ad publisher, and browsing category (e.g. music, dating, shopping). The spatial extent of the dataset is shown below (the colour simply represents the longitude of the data points):
The data points are distributed across the human-populated regions, and are particularly concentrated in the USA, western Europe and SE Asia. This reflects the fact that the dataset is sourced from an ad-tech company whose business is based around internet-connected device use primarily in these regions.
In the examples shown, to highlight the efficiency of insight retrieval with GeoSpock DB, as well as the SQL code, we show the fraction of the dataset scanned, along with the actual query cost and typical run time using a 10 node query cluster.
Use case 1: Single device behaviour
[Query cost: $0.0036]
In our first example, we consider the behaviour of a single device. There are many possible use-cases where this may be desired; some examples are:
- Vehicle maintenance; understanding where a vehicle may be malfunctioning.
- Understanding the behaviour of a vehicle involved in an accident.
- Law enforcement officials may want to understand the behaviour of a device associated with criminal activity.
In particular we wish to ask:
Where are all the points where a particular device (in this case in California, USA) was seen on 1st May 2017?
To implement this we can write a short query as follows:
SELECT latitude, longitude, timestamp FROM geospock.default.sspsadrequestsv1 WHERE device_id = '00ba10ad3dd68e4bde7b3c8d7f06d77bfadf55bc' AND timestamp BETWEEN timestamp '2017-05-01 00:00:00 UTC' AND
timestamp '2017-05-01 23:59:59 UTC'
Rows scanned: 47.6M (0.013%) - Typical run time: 6s - AWS cost: $0.0036
Here we can see the route this device took. Our client may wish to follow up by making further queries around the locations where points are recorded, or where the device spent longer on the journey.
Use case 2: Visitors to a retail store
[Query cost: $0.0048]
Next, our client wishes to understand the journeys of customers visiting a retail store before and after their visits on a particular day in order to best place adverts for future customers. In particular we wish to know:
Where did customers at IKEA Austin TX on 1st May 2017 go during the hour before and after their visits?
To answer this question, we can define a polygon around the IKEA store (shown in the inset below), and write some SQL.
In order to make the query more efficient and reduce the amount of data scan, we pre-filter the dataset to only points within 100km of the store on the day in question (‘windoweddata’), create a list of devices visiting the store on that day (‘devices’), and then join these two tables on the timestamp condition shown.
WITH windoweddata as ( SELECT device_id, latitude, longitude, timestamp FROM geospock.default.sspsadrequestsv1 WHERE gs_great_circle_distance_within( -97.6901982, 30.5575332,
longitude, latitude, 100000 ) AND timestamp BETWEEN timestamp '2017-05-01 00:00:00 US/Central'
AND timestamp '2017-05-01 23:59:59 US/Central'
AND device_id IS NOT NULL ),
devices AS ( SELECT device_id, min(timestamp) mintimestamp, max(timestamp)
maxtimestamp FROM windoweddata WHERE ST_Within( ST_Point( longitude, latitude),
ST_GeometryFromText('POLYGON((<points>))') ) GROUP BY device_id, date(timestamp))
SELECT windoweddata.* FROM windoweddata JOIN devices ON ( windoweddata.device_id = devices.device_id AND windoweddata.timestamp BETWEEN ( devices.mintimestamp - interval
'1' hour ) AND ( devices.maxtimestamp + interval '1' hour ) ) ORDER BY windoweddata.device_id, windoweddata.timestamp
Rows scanned: 99.5M (0.027%) - Typical run time: 8s - AWS cost: $0.0048
As well as local visitors, we see that the I35 is a key route for those travelling from further afield. The traffic status on the I35 may therefore be a key indicator as to throughput of out-of-town customers at the store.
By making use of the SQL functions min_by() and max_by() we can also obtain the origin and destinations for each device:
"startLat", min(windoweddata.timestamp) "startTime", max_by(windoweddata.longitude,windoweddata.timestamp,1) "endlon", max_by(windoweddata.latitude,windoweddata.timestamp,1) "endLat", max(windoweddata.timestamp) "endTime"
Rows scanned: 99.5M (0.027%) - Typical run time: 14s - AWS cost: $0.00835
Having understood where people are travelling from/to around their visits to the store, our client can now have a much richer understanding about the catchment of the store and the modes of travel used by customers, for example. Our client can therefore target their ads accordingly. In addition, they may wish to use the information to work out where there may be value in building a new store, and the local planning authorities may wish to understand the impact of the store on road traffic volume.
Use case 3: Device use by local authority area and time of day
[Query cost: $0.097]
We now cross the Atlantic to London, UK. In this case we want to assist our client in where and when to target their attention across Greater London by considering the device check-in counts by London borough through the day. In particular we want to ask the question:
What is the number of distinct device check-ins in each London borough per hour over the course of one day (1st May 2017)?
In this case we use the same check-in dataset as in the first example, in conjunction with a table of local authority boundary polygons for the UK.
As before, we can write a little SQL query to ask this question:
polygons AS (SELECT lad19nm,geometry
FROM geospock.default.uklocalauthorities WHERE lad19nm in (<list of London Boroughs>))
SELECT LAname,Polygon,hr,count(distinct("device_id")) "devcnt" FROM (select polygons.lad19nm "LAname",polygons.geometry "Polygon", ssp.latitude,ssp.longitude, date_trunc('hour',ssp.timestamp at time zone 'Europe/London') "hr",
ssp.device_id FROM geospock.default.sspsadrequestsv1 ssp JOIN polygons
polygons.geometry))) WHERE ssp.timestamp BETWEEN timestamp '2017-05-01 00:00:00
Europe/London' AND timestamp '2017-05-01 23:59:59 Europe/London') GROUP BY 1,2,3
Rows scanned: 727M (0.2%) - Typical run time: 163s - AWS cost: $0.097
Visualising the results, we see that at around 4am, there is little activity in the London suburbs, but a concentration in the central boroughs. Whereas at 12 noon, there is activity spread across the city as people go about their daily business.
Our client may reasonably conclude that at night time it is worth focussing business activities on the central areas, whereas in the daytime the suburbs are more profitable, especially the strip of boroughs from Barnet in the North West to Croydon in the South.
Use case 4: Out of home value analysis
[Query cost: $0.042]
In our final example, again focussed around London, our client wishes to understand where devices are located when they are away from their home location. There is a large number of use-cases for which this kind of query may be used, for example:
- An ad company may wish to understand the value of advertising on transport networks
- Public authorities may wish to understand crowd density on roads
In particular, here we wish to ask the question:
What is the device count where devices are more than 1km away from their daily start positions for each geohash cell (level 7) across the London area?
For this query, we define Presto UDFs gs_encodegeohash() and gs_decodegeohash() to compute the geohash string and the associated polygon. We structure the query so as to only compute the geohash strings and polygons on the minimum number of points, rather than computing them for the whole 370Bn row dataset. To filter to only the ‘away from home’ points, we make the assumption that a device starts its day at home, and then anywhere that is more than 1km from the starting location is ‘away from home’.
We first select all the points in a rectangular box around London on the day in question (‘ssp’), then compute a table of first points in the day for each unique device (‘startpoints’), before filtering the points in ssp which are more than 1km from the start point for each device. Note that to filter on distance, the function implemented ST_Distance() here uses degrees of arc as radius (where 0.01 degrees ~1km). The assumption that the radius represented by a fixed degree of arc is invariant to direction is an approximation since the distance represented by 1 degree of longitude scales with latitude. The ST_Distance() function is much more computationally efficient, and is sufficient for purpose here, however if accuracy is critical, the gs_great_circle_distance() function used earlier should be used instead.
ssp AS (SELECT *,gs_encodegeohash(latitude,longitude,7) "ghash" FROM geospock.default.sspsadrequestsv1 WHERE longitude<0.29 AND longitude>-0.55 AND latitude>51.255 AND
latitude<51.72 AND timestamp BETWEEN timestamp '2017-05-01 00:00:00 Europe/London'
AND timestamp '2017-05-01 23:59:59 Europe/London'),
startpoints AS (SELECT device_id,min_by(latitude,timestamp,1)
"latitude", min_by(longitude,timestamp,1) "longitude" FROM ssp
GROUP BY 1)
SELECT ssp.ghash, gs_decodegeohash(ssp.ghash,6) "ghashpolygon",
count(ssp.device_id) "count" FROM ssp join startpoints
ON ((startpoints.device_id=ssp.device_id) AND
ts.longitude,startpoints.latitude))>0.01)) GROUP BY 1,2
Rows scanned: 682M (0.18%) — Typical run time: 71s — AWS cost: $0.042
Geohashes can be computed for arbitrarily small scales. In this case we have computed geohash cells where the geohash string is of length 7 characters, meaning that the resolution is around 80m, as illustrated in the zoomed section below where the cell dimension is overlaid onto part of the road structure. In our example here, 153,698 cells are computed across the London area.
Our client can optimise their service delivery or analysis on the scale of a city block across the whole of the Greater London area.
It is commonly the case in today’s data-rich world that the biggest barrier to obtaining insights from a dataset is the size of the dataset itself. In the examples shown, however, with a little SQL, our client can obtain strategy-guiding insights from their 370Bn row dataset within minutes.
It is an often-stated maxim in the world of geospatial ‘big data’ that ‘80% of data is geospatially referenced.’ While this figure doesn’t appear to be grounded in any robust study, and it is genuinely hard to obtain a proper estimate, Hahmann and Burghardt (2013) published a study based on Wikipedia entries, estimating that about 60% of all data is geospatially referenced. With the growth of communications and transportation technology and infrastructure, and with the explosion of the IoT in recent years, it is probable that this percentage has grown since 2013. It is certain, however, that both the number and size of massive geospatial datasets has grown, as has the degree to which such datasets are relied on for driving decisions of importance for businesses, governments, and society, and such growth can only be expected to continue at an increasing rate.
The ability to extract insights rapidly from what, until a few years ago, would have seemed impossibly large datasets, is becoming ever more important for the success of a business and the quality of life for every citizen. Using such datasets with an efficient spatio-temporally indexed database, GeoSpock DB, means that any analyst with a small amount of SQL knowledge can easily elicit and communicate such insights to stakeholders and the public at large.
Dr Alan Roberts is Senior Data Scientist at GeoSpock
Stefan Hahmann & Dirk Burghardt (2013) How much information is geospatially referenced? Networks and cognition, International Journal of Geographical Information Science, 27:6, 1171-1189, DOI: 10.1080/13658816.2012.743664