R analytics blog - The German fuel prices data set

R analytics blog

The German fuel prices data set

R Analytics Blog – 2017 / 07 / 28

The data set on German fuel prices contains the fuel prices, but not the sales, from more than 14000 fuel stations in Germany since June 2014. It is made available by the webservice Tankerkoenig as a Postgres dump (from June 2014 onwards) under CC4.0.

This is a particularly interesting data set

from the general public point of view as
- it deals with a topic under strong public scrutiny and is already intensively studied by economists
- the data is also constantly updated and allows for continuous monitoring for changes in trends or new phenomena
from the data science point of view as
- this is a real life data set with all the little quirks and bugs whose correction forms large part of an analysts working hours
- the data is small enough (~ 3GB in the original data dump) to be effectively prepared and analysed on a single computer, but big enough to allow to test different techniques for scaling and speeding up the analysis in a cluster
- it allows for the integration of many other (open) data sources (socio-economic data, weather, traffic, holidays). See the graph below.
- interesting results can already be obtained using purely descriptive methods, but the data can also be analysed and modelled under spatial–, time series–, panel– and many other aspects.

Overview of the available data and structure

The original data is available at Tankerkoenig, the current status of the code for the data preparation, some of the additional external data, and some simple first models are available as a “workshop” at my GitHub-account.

Structure of the GitHub project

A) Collection of regional data tables from destatis

Reading, cleaning and consolidating multiple socio-demographic data files
B) Preparation and cleaning of the fuel station master data

Tidying, creation of brand and highway markers using regular expressions, parsing of json-information on opening hours
C) Geo-operations on the station master

Identification of NUTS 3-region per station, station distance matrices to competitors, highways, traffic-counters etc.
D) Preparation of price data

Reading from Postgres, cleaning strange prices, imputing and aggregating the price data (see also the blog entry about efficient missing line imputation). Calculation of competitor prices
E) Creation of models

Moving to AWS, test of different (Linear, Panel, Spatial) models, collection of results
F) Analyses, preparation and visualisation of results
G) Presentation creation (not yet on GitHub)

You can find the print-version of my UseR!2017-talk on this project here. There is also a video, courtesy of Channel 9, that you can watch here:

Back to blog overview

About me

I am a consultant and project manager in marketing and business analytics. Having worked in the area for more than 15 years and having led the Data Science and Analytics teams at IRI Germany from 2009 to 2016, I am now again working as an independent consultant focusing on applications of Big Data and AI in marketing.