TSDF - Data storage for scalable processing of heterogeneous and geospatial time series

MartΓ­ Bosch (CEAT); Gionata Ghiggi (LTE); Son Pham-Ba and Charlotte Weil (ENAC-IT4R)

May 28, 2024.
Funded by the ETH Domain Open Research Data (ORD) Program

Motivation: dealing with (spatial) Time series data

Example spatial time series (TS) πŸ• data:

  • Consider 1 month of weather stations observations every 10 mins
  • Which data structure would you use?

station 1 … 33
variable temperature water_vapour precipitation … temperature water_vapour precipitation
time
2021-01-01 00:00:00 2.2 99.0 0.2 … 3.4 92.0 0.1
2021-01-01 00:10:00 2.3 99.0 0.2 … 3.2 92.0 0.1
2021-01-01 00:20:00 2.4 99.0 0.1 … 3.2 92.0 0.2
… … … … … … … …
2021-01-31 23:30:00 6.1 99.0 0.2 … 6.9 80.0 0.0
2021-01-31 23:40:00 6.1 98.0 0.3 … 6.9 81.0 0.0
2021-01-31 23:50:00 6.1 99.0 0.3 … 6.8 82.0 0.2

4464 rows Γ— 99 columns

Wide data frame

  • Pros:
  • πŸ‘ Efficient TS πŸ• operations on index, e.g., df.resample
  • Cons:
  • πŸ‘ŽRequires aligned TS πŸ•
  • πŸ‘ŽCannot add station attributes, e.g., β€œgeometry” 🌍 column
  • Alternatives?

Long data frame

variable temperature water_vapour precipitation
station time
1 2021-01-01 00:00:00 2.2 99.0 0.2
2021-01-01 00:10:00 2.3 99.0 0.2
2021-01-01 00:20:00 2.4 99.0 0.1
2021-01-01 00:30:00 2.4 99.0 0.2
2021-01-01 00:40:00 2.5 99.0 0.2
… … … … …
33 2021-01-31 23:10:00 5.6 100.0 0.1
2021-01-31 23:20:00 5.6 100.0 0.0
2021-01-31 23:30:00 5.7 100.0 0.2
2021-01-31 23:40:00 5.4 100.0 0.1
2021-01-31 23:50:00 5.3 100.0 0.4

147312 rows Γ— 3 columns

Long data frame

  • Pros:
  • πŸ‘ Flexible for unaligned TS πŸ•
  • Cons:
  • πŸ‘ŽTS πŸ• operations require a groupby approach
  • πŸ‘Ž station attributes, e.g., β€œgeometry” 🌍 column would result in many repeated values
  • Alternatives?

Combine two objects

  • A wide time series data frame
  • A station attributes data frame/series, e.g., β€œgeometry” 🌍

Vector data cubes: xvec

xvec dataset

Vector data cubes: xvec

# e.g., stations within 10 km of Lausanne's center
query_geom = gpd.tools.geocode("Lausanne").to_crs(ds.station.crs).buffer(10e3)
ds.xvec.query("station", query_geom)

Vector data cubes: xvec

  • Pros:
  • πŸ‘ Efficient TS πŸ• operations on time index, e.g., ds.resample
  • πŸ‘ Efficient spatial 🌍 operations on spatial index
  • Cons:
  • Requires aligned TS πŸ•
  • How to store to disk πŸ’Ύ? xvec/issues/26
  • pickle, joblib: Python only.
  • GIS formats: pros and cons of wide/long tables.

Summary

We could not find a tool to deal with:

  • unaligned time series
  • reliable disk storage, e.g., long term, cross-platform, cloud optimized…
  • station (sample) attributes, e.g., β€œgeometry” 🌍

Proposed solution: Enter TStore

What is TStore

TStore is a Python library for flexible storage and processing of (spatial) TS data. Two key features:

  • TS πŸ• encapsulation: TS, TSDF, TSLong and TSWide objects to organize hetereogeneous (spatial) time series data into Python data frames
  • TS πŸ• storage: TStore is a hierarchically-structured specification to reliably and efficiently store (spatial) TS data based on Parquet (and GeoParquet)

Time series encapsulation

Consider a TS object representing a time-series. Then the long data frame becomes:

data
station
1 TS[shape=(4464, 3),start=2021-01-01 00:00:00,e…
2 TS[shape=(4464, 3),start=2021-01-01 00:00:00,e…
3 TS[shape=(4464, 3),start=2021-01-01 00:00:00,e…
… …
31 TS[shape=(4464, 3),start=2021-01-01 00:00:00,e…
32 TS[shape=(4464, 3),start=2021-01-01 00:00:00,e…
33 TS[shape=(4464, 3),start=2021-01-01 00:00:00,e…

Advantages

  • Flexibility:
  • πŸ‘ each station can have its own TS, e.g., useful with different temporal resolution, periods of maintenance (no data)…
  • πŸ‘ each TS object may be univariate or multivariate

temperature water_vapour precipitation
station
1 TS[shape=(4464,),start=2021-01-01 00:00:00,end… TS[shape=(4464,),start=2021-01-01 00:00:00,end… TS[shape=(4464,),start=2021-01-01 00:00:00,end…
2 TS[shape=(4464,),start=2021-01-01 00:00:00,end… TS[shape=(4464,),start=2021-01-01 00:00:00,end… TS[shape=(4464,),start=2021-01-01 00:00:00,end…
3 TS[shape=(4464,),start=2021-01-01 00:00:00,end… TS[shape=(4464,),start=2021-01-01 00:00:00,end… TS[shape=(4464,),start=2021-01-01 00:00:00,end…
… … … …
31 TS[shape=(4464,),start=2021-01-01 00:00:00,end… TS[shape=(4464,),start=2021-01-01 00:00:00,end… TS[shape=(4464,),start=2021-01-01 00:00:00,end…
32 TS[shape=(4464,),start=2021-01-01 00:00:00,end… TS[shape=(4464,),start=2021-01-01 00:00:00,end… TS[shape=(4464,),start=2021-01-01 00:00:00,end…
33 TS[shape=(4464,),start=2021-01-01 00:00:00,end… TS[shape=(4464,),start=2021-01-01 00:00:00,end… TS[shape=(4464,),start=2021-01-01 00:00:00,end…

TSDF object

  • cells TS are pandas ExtensionDtype
  • columns TSArray are pandas ExtensionArray

Advantages

GeoPandas compatible:

  • No geometries are repeated

Time series storage

  • TStore: hierarchically-structured specification to efficiently store time series using Apache Parquet
  • 🚧 when geometries are present, use GeoParquet

Consider k years of temperature and precipitation data form n stations. Then, the TStore looks like:

<base_tstore_dir>
β”œβ”€β”€ <station-id-1>
β”‚   β”œβ”€β”€ <temperature>
β”‚   β”‚   β”œβ”€β”€ <year-1>
β”‚   β”‚   β”‚   ...
β”‚   β”‚   └── <year-k>
β”‚   └── <precipitation>
β”‚       β”œβ”€β”€ <year-1>
β”‚       β”‚   ...
β”‚       └── <year-k>
...

...
        
└── <station-id-n>
    β”œβ”€β”€ <temperature>
    β”‚   β”œβ”€β”€ <year-1>
    β”‚   β”‚   ...
    β”‚   └── <year-k>
    └── <precipitation>
        β”œβ”€β”€ <year-1>
        β”‚   ...
        └── <year-k>

Flexibility of TStore

We can …

  • have multiple temporal partitioning, e.g., by month, year/month…, different partitioning by variable…
  • TStore structure, e.g., β€œvariable-station” instead of β€œstation-variable”

Advantages

  • TS objects are loaded into the Apache Arrow memory format
  • ➑️ zero-copy conversion to pandas or polars dataframes.

Example

5 years of 10 min observations from the 33 Agrometeo stations1 in the Canton of Vaud, Switzerland:

variable temperature water_vapour precipitation
station time
1 2019-06-01 00:00:00 17.0 57.0 0.0
2019-06-01 00:10:00 16.5 60.0 0.0
2019-06-01 00:20:00 16.3 59.0 0.0
… … … … …
305 2024-04-30 23:30:00 14.9 74.0 0.0
2024-04-30 23:40:00 15.3 69.0 0.0
2024-04-30 23:50:00 15.3 67.0 0.0

8534361 rows Γ— 3 columns

import tstore

tstore_dir = "agrometeo-tstore"
variables =...

tslong = tstore.TSLong(long_ts_df)
tslong.to_tstore(
    tstore_dir,
    variables,
    # TSTORE options
    partitioning="year",
    tstore_structure="id-var"
)

Resulting TStore directory structure:

agrometeo-tstore/
β”œβ”€β”€ tstore_metadata.yaml
β”œβ”€β”€ _attributes.parquet
β”œβ”€β”€ 96/
β”‚   └── temperature/
β”‚       β”œβ”€β”€ _common_metadata
β”‚       β”œβ”€β”€ _metadata
β”‚       └── year=2020/
β”‚           └── part-0.parquet
β”‚       └── ...
...
...
└── 27/
    └── precipitation/
        β”œβ”€β”€ year=2019/
        β”‚   └── part-0.parquet
        β”œβ”€β”€ year=2021/
        β”‚   └── part-0.parquet
        β”œβ”€β”€ year=2022/
        β”‚   └── part-0.parquet
        β”œβ”€β”€ year=2023/
        β”‚   └── part-0.parquet
        └── year=2024/
            └── part-0.parquet
        

Some stats

  • CSV: write in 133 s, file size 310.5 MB, read in 7.9 s
  • TStore: write in 4.6 s πŸš€, file size 249.9 MB (snappy compression, with 194.4 MB). Reading:
  • whole TStore into long data frame: 23 s
  • single variable: 2.28 s
  • single variable, single station: 0.03 s πŸš€

Roadmap

  • geopandas and geoparquet support
  • time filters to read only the required data
  • implement more advanced TS/spatial operations, e.g., groupby/apply, regularize, resampling, rolling window…

Thank you

CEAT LTE ENAC-IT4R EPFL
CEAT LTE ENAC-IT4R EPFL

Footnotes

  1. Data from Agrometeo belongs to the Swiss Federal Administration, see the terms and conditions for more information.