Skip to content

Extensive collection of tabular time-series anomaly detection datasets.

Notifications You must be signed in to change notification settings

OliverHennhoefer/awesome-ts-anomaly-detection-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

32 Commits
ย 
ย 
ย 
ย 

Repository files navigation

Awesome Time-Series Anomaly Detection Datasets Awesome

Extensive collection of publicly available time-series datasets for anomaly detection with a focus on real-world data or synthetic data that is representative of real-world data.

Before using any of the listed datasets for your experiments you may want to read:

(At least skim through the presentation "Irrational Exuberance: Why we should not believe 95% of Papers on Time-Series Anomaly Detection" for a tl;dr.)

Many of the popular datasets for benchmarking in anomaly detection suffer from:

  • Triviality
  • Unrealistic Anomaly Density
  • Mislabeled Ground Truth
  • (Run-to-Failure Bias)

Among those are e.g. Yahoo! S5 (ironically listed first here) and the Numenta Benchmark (NAB). Do not rely [only] on these kinds of datasets for evaluation and experimentation!

1 Datasets

1.1 Univariate

This dataset has to be requested for access.

S5 is designed to benchmark anomaly detection algorithms using time-series data with tagged anomalies, including outliers and change-points, representing various Yahoo! services and synthetic variations.

The Chinese AIOps Competition series challenges participants to develop innovative solutions that can detect and diagnose IT system issues using large-scale datasets. The competitions involve tasks like anomaly detection, root cause analysis, and predictive maintenance.

Official Repository: NetManAIOps

Login is needed for access. Data in the linked repository is publicly available.

Repository: KDDCup2021

Datasets composed of synthetic Mackey-Glass time series with non-trivial anomalies. In contrast to other synthetic benchmarks, it is very hard for the human eye to distinguish the introduced anomalies from the normal (chaotic) behavior.

Official Repository: MGAB Related Publications: Time Series Encodings with Temporal Convolutional Networks

The provided code allows for generating an own version of the data with different parameter settings.

1.2 Multivariate

Repository with a total of eight SCADA datasets of various wind farms and additional links to related datasets.

Some listed sources need either a platform registration or an application.

Real-world SCADA data from three wind farms.

Corresponding Publication: CARE to Compare: A real-world dataset for anomaly detection in wind turbine data

The TEP dataset is designed for anomaly detection in industrial process control settings. It simulates a complex chemical production process with multiple operating conditions and potential faults, providing time-series data that includes normal operations as well as various types of faults or anomalies.

Detailed information available here. The Python package PyTEP allows for customized simulation scenarios and setups. The package requires an activated MATLAB/Simulink license.

The SMD is a dataset used for anomaly detection in the context of server operations. It consists of several time-series collected from different server machines, capturing various metrics such as CPU load, memory usage and network traffic. The dataset includes labeled anomalies, such as spikes or drops in performance.

Introducing Publication: Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network

The GECCO 2018 Industrial Challenge invites participants to develop an event detection system for predicting changes in a time series of drinking water composition data, utilizing a real-world dataset provided by Thรผringer Fernwasserversorgung (Germany).

The ASD dataset contains data of 12 application servers in a large Internet company.

Corresponding Publication: Multivariate Time Series Anomaly Detection and Interpretation using Hierarchical Inter-Metric and Temporal Embedding

SMAP (Soil Moisture Active Passive satellite) and MSL (Mars Science Laboratory rover) are two public datasets from NASA.

Related Repository: Telemanom and OmniAnomaly
Related Publications: Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding and Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network
Corresponding Download Versions: OmniAnomaly

This dataset has to be requested for access.

This collection of datasets provided by the Singapur University of Technology and Design and the iTrust Centre for Research in Cyber Security contains 5 different datasets suitable for benchmarking anomaly detection algorithms derived from the two available main datasets SWat and WADI (see blow).

The Secure Water Treatment (SWaT) dataset is a collection of data from a water treatment testbed, covering 11 days of continuous operationโ€”7 days under normal conditions and 4 days with deliberate attack scenarios. The dataset includes network traffic and readings from 51 sensors and actuators, with labels indicating normal and abnormal behaviors During the 4 days of attacks, 41 different attack scenarios were executed based on a cyber-physical system (CPS) attack model developed by the research team.

The Water Distribution (WADI) dataset captures data from a water distribution testbed over 16 days of continuous operationโ€”14 days under normal conditions and 2 days featuring deliberate attack scenarios. The dataset includes readings from 123 sensors and actuators, with the attack scenarios based on a cyber-physical system (CPS) attack model developed by the research team. During the 2 days of attacks, 15 distinct attack scenarios were executed.

The Helicopter Vibration Measurement Dataset is provided by Airbus SAS to automate the validation of vibration data and detect abnormal sensor behavior. Vibration measurements are collected from accelerometers placed at various positions on helicopters, measuring in three directions: longitudinal, vertical, and lateral.

The multivariate PSM dataset comprises 90 key performance indices (KPIs) from eBay, capturing per-minute cart volumes across various sub-dimensions like user location, device type, and cart types, making it suitable for analyzing temporal and spatial dependencies that reflect business availability and health.

Related Publications: Practical Approach to Asynchronous Multivariate Time Series Anomaly Detection and Localization and Real-Time Synchronization in Neural Networks for Multivariate Time Series Anomaly Detection

3W (Petrobas)

The 3W dataset consists of instances from three different sources containing undesirable events occurring in oil wells. Accompanying this dataset is the 3W Toolkit, a software package designed to facilitate experimentation with the dataset for specific problems related to oil well operations.

Related Publications: A realistic and public dataset with rare undesirable real events in oil wells

Data was collected for normal bearings, single-point drive end and fan end defects.

Dataset of current, voltage, and vibration measurements of an electromechanical driving system. The system is a three-phase asynchronous motor that drives a gearbox.

Dataset of speed, current, voltage and vibration measurements of an electromechanical drive system. The system is a three-phase asynchronous motor.

The HAI dataset was collected from a realistic industrial control system (ICS) testbed augmented with a Hardware-In-the-Loop (HIL) simulator that emulates steam-turbine power generation and pumped-storage hydropower generation.

A collection of three datasets regarding power systems, gas pipelines and water storage tanks.

Data contains recordings of five people performing different activities. Each person wore four sensors while performing the same scenario five times.

Refactored Version: Kaggle

The EDEN ISS 2020 Telemetry Dataset consists of equidistant sensor readings stemming from 97 sensors in the EDEN ISS research greenhouse.

Related Publications: Unraveling Anomalies in Time: Unsupervised Discovery and Isolation of Anomalous Behavior in Bio-regenerative Life Support System Telemetry

2 Benchmark Collections

Archive of time-series data for anomaly detection that compensate shortcomings of other available datasets for anomaly detection as stated in the corresponding publication(s).

Corresponding Publication: Current Time Series Anomaly Detection Benchmarks are Flawed and are Creating the Illusion of Progress

Skoltech Anomaly Benchmark (SKAB) ๐Ÿ’ผ

The SKAB is a comprehensive framework designed for evaluating anomaly detection algorithms, focusing on outlier and changepoint detection in multivariate time series data. SKAB includes datasets, leaderboards, evaluation modules, and Python tools to support algorithm testing. The dataset consists of 35 files of time series data from sensors monitoring a testbed, with each file containing a single experiment and associated anomaly. SKAB provides both single-point and collective anomaly labels, making it useful for benchmarking various detection algorithms.

Numenta Anomaly Benchmark (NAB) ๐Ÿ’ผ

The NAB is a comprehensive framework designed to evaluate anomaly detection algorithms specifically for real-time, streaming data applications. It includes over 50 labeled time-series datasets from both real-world and synthetic sources, along with a novel scoring mechanism tailored for real-time detection scenarios. NAB provides tools for testing algorithms, a leaderboard for competitive results, and encourages contributions and collaboration from the community. The benchmark and its associated resources support the development and assessment of algorithms in unsupervised real-time anomaly detection.

The CATS dataset is a simulated dataset designed for benchmarking anomaly detection algorithms in multivariate time series. It includes 17 variables representing sensor readings, control commands, and external stimuli, with 200 precisely injected anomalies across 5 million timestamps. The dataset offers fine control over ground truth, context for anomalies, and a pure signal without noise, making it ideal for evaluating the performance, robustness, and explainability of anomaly detection methods in a complex dynamical system.

3 Related Datasets

NYC Taxi Traffic ๐Ÿ–‡๏ธ

Numbers of New Yoek taxi passengers, with five anomalies occurring during the NYC marathon, Thanksgiving, Christmas, New Years day, and a snow storm.

Hourly Interstate 94 Westbound traffic volume for MN DoT ATR station 301, roughly midway between Minneapolis and St Paul, MN. Hourly weather features and holidays included for impacts on traffic volume.

4 Data Hubs

EDP Open Data ๐ŸŒ

Platform providing open datasets in context of solar photovoltaic, wind and thermal technology.

Lists univariate and multivariate time series anomaly detection datasets used in the experimental evaluation paper.

The repository provides free access to a large collection of medical research data, supporting biomedical research and education through the availability of physiological and clinical data alongside related open-source software.

IEEE Dataport ๐ŸŒ

Public hub for dataset sharing in context of IEEE publications.

Zenodo ๐ŸŒ

Open Science platform for dataset sharing and more.


Inspiration ๐Ÿ’ก

Now, what to do with all these datasets? Here are some highly competitive methods (you probably never heard of) for inspiration:

Anomaly detection based on time-series discords is a 20-year-old and widely unknown and parameter-light (1) technique, outperforming a wide range of contemporary anomaly detection methods. The method is able to find the most subtle anomalies in time-series, as this method is said to yield superhuman result.