This is a list of weather and climate datasets preprocessed for AI research. This can include benchmarks, competitions or ML papers with published data. The list is in alphabetical order.
- Code and Data: https://github.com/NCAR/ai4ess-hackathon-2020
- Source: NCAR, Lawrence Berkeley Lab, and NOAA
- Description: 5 challenge problems related to prediction and emulation. GOES challenge problem focuses on predicting lightning from GOES-16 satellite imagery. GECKO-A challenge problem focuses on emulating the GECKO-A chemistry model from a large set of model time series. Microphysics challenge problem focuses on emulating the TAU bin microphysics scheme. HOLODEC challenge problem focuses on estimating rain drop distribution properties from synthetic holographic diffraction patterns. ENSO challenge problem focuses on predicting ENSO from gridded model output.
- Code and data: https://www.kaggle.com/c/ams-2014-solar-energy-prediction-contest
- Source data: GEFS forecasts and Mesonet solar observations
- Description: Predict total daily solar irradiance from GEFS and Oklahoma Mesonet Data
- Paper: Betancourt, C., Stomberg, T., Roscher, R., Schultz, M. G., and Stadtler, S.: AQ-Bench: a benchmark dataset for machine learning on global air quality metrics, Earth Syst. Sci. Data, 13, 3013–3033, https://doi.org/10.5194/essd-13-3013-2021, 2021.
- Code and data: https://gitlab.version.fz-juelich.de/toar/ozone-mapping and https://doi.org/10.23728/b2share.30d42b5a87344e82855a486bf2123e9f
- Source data: Database of the Tropospheric Ozone Assessment Report (TOAR)
- Description: Aggregated air quality data from the years 2010–2014 and metadata at more than 5500 air quality monitoring stations all over the world. A well-defined task, a suitable evaluation metric and baseline scores are provided.
- Data: https://ral.ucar.edu/solutions/products/camels
- Paper: https://ncar.github.io/hydrology/datasets/CAMELS_timeseries
- Source data: Weather models (Daymet, NLDAS, Maurer), streamflow observations (USGS), catchment attributes (USGS, MODIS, Daymet, STATSGO, Global Lithological Map (GLiM), GLobal Hydrogeology Maps (GLHYMPS))
- Description: Weather drivers, streamflow observations, and catchment attributes for 671 catchments across the continental US.
- Papers using this dataset: https://doi.org/10.5194/hess-22-6005-2018, https://doi.org/10.1029/2019WR026793
ClimateNet: an expert-labelled open dataset and Deep Learning architecture for enabling high-precision analyses of extreme weather
- Paper: https://gmd.copernicus.org/preprints/gmd-2020-72/
- Code and data: https://portal.nersc.gov/project/ClimateNet/
- Source data: Climate Model simulations and expert labels
- Description: Detect atmospheric rivers and tropical cyclones from climate model simulations. Tool for labeling along with dataset of expert labelled data.
- Paper: http://doi.org/10.1109/JSTARS.2021.3062936
- Code and data: https://vision.eng.au.dk/cloudcast-dataset/
- Source data: Meteosat-11 with cloud types annotated on a pixel-level
- Description: The CloudCast dataset contains 70080 images with 10 different cloud types for multiple layers of the atmosphere annotated on a pixel level. The dataset has a spatial size of 928 x 1530 pixels recorded with 15-min intervals for the period 2017-2018, with a 3.0 km resolution.
- Paper: https://arxiv.org/abs/1911.04227
- Code and data: https://github.com/FrontierDevelopmentLab/CUMULO
- Source data: Moderate Resolution Imaging Spectroradiometer (MODIS) from Aqua satellite and 2B-CLDCLASS-LIDAR
- Description: the dataset provides the global 1km-resolution imagery of the MODIS aligned with the accurately measured cloud properties of the CloudSat products. It contains three years of 1354 x 2030 pixel hyperspectral images combined with pixel-width ‘tracks’ of cloud labels, corresponding to the eight World Meteorological Organization genera.
- Paper: http://doi.org/10.1109/JSTARS.2020.3011907
- Competition: https://www.drivendata.org/competitions/72/predict-wind-speeds/page/274/
- Code and data: http://registry.mlhub.earth/10.34911/rdnt.xs53up/
- Source data: GOES
- Description: A collection of tropical storms in the Atlantic and East Pacific Oceans from 2000 to 2019 with corresponding maximum sustained surface wind speed. This dataset is split into training and test categories for the purpose of a competition. The train set consists of 70,257 images and the test set consists of 44,377 image, each one being 366 x 366 pixels
- Paper: https://arxiv.org/abs/2012.06246
- Code and data: https://www.earthnet.tech/
- Source data: Sentinel 2
- Description: Curated dataset containing target spatio-temporal Sentinel 2 satellite imagery at 20 m resolution, matched with high-resolution topography and mesoscale (1.28 km) weather variables. With over 32000 samples it is suitable for training deep neural networks.
- Paper: https://arxiv.org/abs/1612.02095
- Code and data: https://github.com/eracah/hur-detect, https://extremeweatherdataset.github.io/
- Source data: CAM5
- Description: Consists of 768 × 1152 images of the global atmospheric state with a spatial resolution of 25 km and separated by 6 hour intervals from 1979 to 2005. There are 16 channels of images that correspond to different variables such as surface pressure, surface temperature and humidity of the reference altitude. In addition, there are boundary boxes and class labels for 4 types of extreme weather events: Tropical Depressions, Tropical Cyclones, Extratropical Cyclones and Atmospheric Rivers.
- Paper: https://arxiv.org/abs/2012.11154
- Code and data: https://flow-forecast.atlassian.net/wiki/spaces/FF/pages/33456135/FlowDB+Dataset (Not public)
- Source data: USGS, SNOTEL, NOAA, ASOS,EcoNet
- Description: An hourly river flow and precipitation dataset and a second subset of flash flood events with damage estimates and injury counts. Created for general stream flow forecasting and flash flood damage estimation.
- Code and data: https://www.kaggle.com/c/how-much-did-it-rain and https://www.kaggle.com/c/how-much-did-it-rain-ii
- Source data: US Radar and rain gauges
- Description: Estimate rainfall probability distribution from Dual Pol. radar data.
- Code and data: https://github.com/meteofrance/meteonet
- Source data: AROME/ARPEGE forecasts, radar reflectivity and ground stations over France
- Description: Multi source dataset of forecasts and observations over France spanning 3 years
- Paper: Rasp and Lerch 2018
- Code and data: https://github.com/slerch/ppnn
- Source data: TIGGE forecasts and station observations over Germany
- Description: Ensemble temperature postprocessing of station observations over Germany. 9 years of data at 500 stations. Predictors include temperature as well as a range of other variables.
- Paper: https://arxiv.org/abs/2012.09670
- Code and data: https://github.com/frontierdevelopmentlab/pyrain
- Source data: IMERG, ERA5 and SimSat
- Description: Multi-modal benchmark dataset for data-driven precipitation forecasting at 3 different spatial resolutions: 0.1deg (IMERG and SimSat) and 0.5deg (ERA5). Presented along an efficient dataloading pipeline: Pyrain
- Paper: NeurIPS
- Code and data: http://sevir.mit.edu/
- Source data: GOES-16 and NEXRAD over CONUS
- Description: Preprocessed satellite and radar data over the continental US, served in patches. For a range of challenges with baselines (check website for updates).
- Presentation: AMS
- Code and data: https://svrimg.org/
- Source data: GridRad (which in turn is sourced from NOAA NEXRAD Level II archives)
- Description: over 500,000 data rich, geospatial, radar reflectivity images centered on high-impact weather events. These images have consistent dimensions and intensity values on a grid with relatively low spatial distortion over the Conterminous United States. Also includes crowd-sourced labeling.
- Paper: https://doi.org/10.1038/s41597-020-0574-8
- Code and data:https://github.com/MPBA/TAASRAD19
- Source data: Official public meteorological agency of the civil protection department of the Autonomous Province of Trento (Italy)
- Description: Benchmark dataset for radar nowcasting with deep learning. The dataset contains 1,732 radar sequences labeled with precipitation type spanning from 2010 to 2019, for a total of 362,233 radar images. Image size is 480 x 480 at 500m resolution (UTM grid) covering a complex orographic area in the Italian Alps.
- Papers using this dataset: https://doi.org/10.3390/atmos11030267, https://doi.org/10.3390/rs11242922
- Paper: BAMS
- Code and data: https://www.kaggle.com/c/understanding_cloud_organization
- Source data: TERRA and AQUA MODIS visible images
- Description: Cloud classification challenge of 4 human-designed shallow cloud patterns of organization: Sugar, Flower, Fish and Gravel with 30,000 human labels
- Paper: https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/2014EF000259
- Code and data: http://www.value-cost.eu/
- Source data: Station observations
- Description: Framework for evaluating climate model downscaling methods. Validation observations are provided.
- Papers using this dataset: Many
- Paper: https://doi.org/10.1029/2020MS002203
- Code and data: https://github.com/pangeo-data/WeatherBench
- Source data: ERA5 and TIGGE for baselines
- Description: Benchmark dataset for medium-range (3 and 5 day) forecasting of global pressure, temperature and precipitation with preprocessed data (40 years), evaluation and baselines
- Papers using this dataset: https://arxiv.org/abs/2003.11927, https://arxiv.org/abs/2008.08626
- Paper: BAMS
- Code and data: https://renkulab.io/projects/aaron.spring/s2s-ai-challenge-template/
- Source data: S2S access via climetlab-s2s-ai-challenge
- Description: https://s2s-ai-challenge.github.io/
- Papers using this dataset:
- Paper:
- Code and data:
- Source data:
- Description:
- Papers using this dataset: