January 2021 By Felix Sanchez-Garcia

Complex analytics on massive datasets: Creating a massive synthetic connected vehicle dataset

Connected vehicle datasets offer incredibly rich potential for organisations seeking to understand more about a variety of location intelligence use cases.

Analyses of such datasets provide insights into customer behaviour, journey efficiency, and infrastructure management, which can be used to improve operational performance and create value-added services. However, this richness also creates challenges. The sheer size of connected vehicle datasets makes enabling rapid analytics that capitalise on the granular detail they contain a daunting task - and one that many of our customers struggle with.

At GeoSpock, we believe that dealing effectively with these types of massive datasets requires a fundamental rethink in database design. To address this, we recently released GeoSpock DB, a space-time analytics database optimised specifically for queries on real-world, connected device data. 

Whilst testing GeoSpock DB, we knew we needed a dataset of sufficient size and complexity on which to benchmark our performance. Densely sampled, city-scale connected vehicle datasets may include millions or even billions of individual journeys undertaken by drivers. In these cases, it is not unusual to accumulate data volumes of upwards of 100 terabytes, spanning over a trillion rows. To our knowledge, there are no equivalent datasets available in the public domain. Equally, whilst traffic management modelling software such as Matsim can provide highly realistic simulation of vehicle behaviour, using this type of approach to generate data volumes of the scale required for our tests would have required an infeasible amount of computational cost and time

This left us only one option - to build our own simulated version of a massive connected vehicle dataset. Complete with unique vehicles, multiple journey profiles, and representative space-time extents, we would be able to use it to test GeoSpock DB performance.

We sited our simulated connected vehicle dataset in Singapore. The city-state is working to replace its current gantry-operated Electronic Road Pricing (ERP) system with satellite-based vehicle tracking which will enable more detailed traffic analysis and dynamic congestion charging. Whilst not yet in use, the scale and scope of data collected by the updated system is exactly the type of dataset we wished to simulate for testing of GeoSpock DB - making Singapore an ideal proving ground. To populate our synthetic dataset and give it an appropriate sense of scale, we generated 8 million unique vehicle IDs (registration numbers) across 6 different vehicle classes, weighting the vehicle types according to the distribution seen in Singapore

One of the most challenging aspects of generating a large scale, simulated connected vehicle dataset is the creation of realistic route profiles for a large number of journeys. In our case, we wished to model data for 4 million individual journeys per day, over a total time frame of one year, a total of 1.46 billion total trips!

Existing routing engines typically work in a serialised manner, meaning each individual route is generated one at a time. However, this serialised approach would have been highly inefficient for generating the large number of journeys we required. To solve this, we parallelised the trip creation process by embedding an open source routing engine into Spark. We then used this to create 1.4 million prototype trips, a seed dataset 54 GB in size. For each day in our year of simulated data, we then performed the following actions:

  • Randomly choose a selection of 4 million vehicles for which to assign journeys; 
  • Randomly assign a timestamp to the start of each journey undertaken (to help create more natural pattern of variation within the data, the start time assignation was weighted using recorded traffic volume information from a busy interchange in another, North American city);
  • For each individual journey, randomly assign one of the 1.4 million profiled prototype trips;
  • Upsample each journey to a 1 second sampling interval, to replicate the data density planned in the ERP 2.0 and similar GNSS tracking datasets.

The end result of this process was a synthetic dataset with a total size of 108 TB, or 1.3 trillion rows, representing 1.46 billion independent journeys made throughout Singapore over a total time period of one year. 

Of course, the synthesized dataset is not a typical traffic model used for congestion analysis. Whilst each journey follows a realistic path, and overall traffic volumes fluctuate over the course of a day, each trip is made independently and is unconnected to those before or after it. There is also no notion of congestion. However, the spatial and temporal variation, along with the overall size and density of the dataset produced, is ideal for testing the performance of GeoSpock DB on the typical queries common in spatial analytics and location intelligence use cases.

As well as helping us to provide indicative performance reports for queries on massive datasets in GeoSpock DB, our creation showcases to customers what we really mean when we talk about big data at GeoSpock – and why there's a need to think differently about how we deal with this volume of data to ensure we are able to manage and analyse it effectively in the future. 

In a future blog, we plan to detail some of the problem statements and query types we commonly encounter from customers performing analyses on connected device datasets like the one we have produced here. In the meantime, you can explore the dataset yourself through GeoSpock DB Discovery, the free to access version of our query engine, which includes the Singapore synthetic as well as a library of other publicly-available, sensor-generated datasets to be used for exploration and your own performance testing.

Happy querying!

Dr. Felix Sanchez-Garcia is Head of Data Science at GeoSpock

Back to Blog