Skip to content

RollerEtAl_EMNLP2012

jasonbaldridge edited this page Jul 16, 2012 · 6 revisions

Supervised Text-based Geolocation Using Language Models on an Adaptive Grid

This page explains the process of replicating the results of:

Stephen Roller, Michael Speriosu, Sarat Rallapalli, Benjamin Wing and Jason Baldridge. Supervised Text-based Geolocation Using Language Models on an Adaptive Grid. EMNLP 2012. Jeju, Korea.

Getting the code

The first step is to get the code. Check out or download the code from

 https://github.com/utcompling/textgrounder/commits/emnlp-release-candidate-same-results

Setting things up

You'll need to set up your environment as per the directions (step 2-3 in README.txt). Specifically, you must set the $TEXTGROUNDER_DIR variable to the root of the textgrounder source code, and add $TEXTGROUNDER_DIR/bin to your $PATH variable.

Getting the data.

Next you'll need the data. For Geotext and Wikipedia, follow step 4 in README.txt

For the UtGeo data set, follow the README.txt in

http://www.cs.utexas.edu/~roller/research/kd/corpus/

As suggested by this document, it is highly encouraged you contact the first author ([email protected]) when you begin this process, as obtaining the full data set may be difficult.

Compiling

Run textgrounder build-all from the $TEXTGROUNDER_DIR directory.

Running

To run the program, you'll need

$ textgrounder -memory 30g geolocate-document --corpus $PATH_TO_CORPUS/$CORPUS_NAME (--kd| --kdbs $BUCKET_SIZE --kdsm (median|halfway) --cm (center|centroid) --eval-set (dev|test)

where median/halfway correspond to the Friedman/Midpoint methods of splitting.

For example, to run on UtGeo large and evaluate on the dev set, using only a KD tree bucket size of 500; Friedman splitting; and centroid cell prediction, I personally use:

$ textgrounder -memory 30g geolocate-document --corpus $SCRATCH/corpora/utgeo-large --kd --kdbs 500  --kdsm median --cm centroid --eval-set dev

Your settings will vary depending exactly on your setup and which method you wish to test.

Support

Please contact Stephen Roller [email protected] for any questions pertaining to replicating results. This program can take some effort to get up and running, and so please feel free to ask for help.