Skip to content
Quentin Mazars-Simon edited this page May 16, 2014 · 14 revisions

Dataset

Download it there: http://burnbit.com/download/276314/enwiki_20140203_pages_articles_xml_bz2 (⚠️ 40GB uncompressed)

On the server it is in /data/team11/wikipedia/

The filtered dataset is available here.

To test on a smaller dataset, export a few articles/categories from this page.

Remarks

There are no easy way to get all weather events related content. Categories may loop, or subcategories may diverge to unrelated topics (eg. drought -> drought resistant plants, floods in spain -> geography of spain, etc.).

Using it

Steps:
1 - filter articles by selecting only those part of a relevant category.
2 - Extract information from them.
3 - Store information in a predefined format

Relevant categories/keywords

  • blizzard (cold++, wind++)
  • cyclone / hurricane / typhoon (wind++, rain++)
  • derecho (wind++)
  • drought (temperature++, rain--)
  • nor'easter (temperature--, wind++)
  • storm (wind++, rain++)
  • tornado (wind++)
  • wave (heat / cold) (temperature++ / temperature--)
  • weather event

Output format

A CSV file (tab separated) with the following columns Title, Category, Start date, End date, Location(s) e.g. 2003 European heat wave \t Heat Wave \t 1 June 2003 \t 31 August 2003 \t Western Europe

Useful links

  • Wikihadoop: profides streaming interface to wiki dump (use compressed file, low memory use). How to use it (Too outdated to be useful, use XMLInputFormat instead)

todo

  • List relevant categories. This page is a good starting point (tip: type $("span[title="expand"]").click(); in the javascript console to quickly expand categories). Could be a more generic list, i.e. only keywords that should be in the article category (e.g. "Storm" will automatically include "European windstorms", etc.)
  • Extract information, either from title when possible (e.g 2003 European heat wave), or from article body when necessary (e.g. List of European tornadoes in 2012), or from infobox if available (e.g Hurricane Katrina)
Clone this wiki locally