Skip to content
This repository has been archived by the owner on Apr 16, 2024. It is now read-only.

Commit

Permalink
process geonames dump with logstash
Browse files Browse the repository at this point in the history
  • Loading branch information
panchicore committed Sep 1, 2017
1 parent 915318b commit 70c2da6
Show file tree
Hide file tree
Showing 7 changed files with 1,149 additions and 0 deletions.
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
allCountries.zip
allCountries.txt
.idea
.pyc
*.DS_Store
environment_variables.sh
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,8 @@ index on your host machine.

This Geonames index is meant to be used with OEDA's [full text geoparse,
Mordecai](https://github.com/openeventdata/mordecai).

ES-Geonames with Logstash
=========================

The same work for the python script but with posibilities to create the index based in the pipeline defined by the developer in the file `logstash/logstash-pipeline.conf`. See logstash folder.
75 changes: 75 additions & 0 deletions logstash/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# ES-GEONAMES transform and load with logstash
The same work for the python script but with posibilities to create the index based in the pipeline defined by the developer in the file `logstash-pipeline.conf`. This approach is good not only to send the data to ES but also to convert the Geonames CSV data to any [output](https://www.elastic.co/guide/en/logstash/current/output-plugins.html) we need. See at the end of this readme a JSON output sample. This method requires [logstash installed](https://www.elastic.co/guide/en/logstash/current/installing-logstash.html).

Some changes on the field names were performed, see the grok filter:

`
%{INT:GeonamesId} %{DATA:Name} %{DATA:ASCIIName} %{DATA:AlternateNames} %{DATA:Latitude} %{DATA:Longitude} %{DATA:FeatureClass} %{DATA:FeatureCode} %{DATA:CountryCode} %{DATA:CountryCode2} %{DATA:Admin1Code} %{DATA:Admin2Code} %{DATA:Admin3Code} %{DATA:Admin4Code} %{DATA:Population} %{DATA:Elevation} %{DATA:DEM} %{DATA:Timezone} %{GREEDYDATA:ModificationDate}"
`

According the pipeline, In some fields was improved the field type:

- `AlternateNames`: this list will be transformed from "alternatenames: comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)"
- `CountryCode2`: transformed from "cc2 : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters"


As well as some new fields based on the codes list http://www.geonames.org/export/codes.html:

- `FeatureClass`
- `FeatureCode`
- and `CountryCode3`: based on the dict from the script `geonames_elasticsearch_loader.py`

### Usage

On the logstash folder:
- Copy `environment_variables.sample.sh` to `environment_variables.sh` with your environment vars.
- Run the index creator `sh create_index.sh`
- Start logstash `logstash -f logstash-pipeline.conf` it will take a ~5 to ~10 minutes to start due big pipeline _(anyways it should take less, elastic should be improving this)_ but as soon it starts it will take just a few minutes to finish the ingest (2K per second in my i7 8G machine). The output log will look like `2017-09-01T12:09:14.354Z %{host} %{message}` as an info output for you to know when it finishes.
- Done, enjoy


### Todo
- Fine tune the mappings
- Load [premium data](http://www.geonames.org/products/premium-data-polygons.html) for example places as polygons are very interesting data to load as [ geo-shapes](https://www.elastic.co/guide/en/elasticsearch/reference/5.2/geo-shape.html).
- Dockerize
- Publish some Kibana stats


### JSON output sample
If looking for JSON output just comment the output > stout > `# codec => json`, sample:
```javascript
{
"Timezone":"Asia/Kabul",
"ASCIIName":"Pushtah-ye Amir Kushtah'i",
"Latitude":35.24667,
"FeatureCode":"MT",
"type":"place",
"AlternateNames":[
"Poshteh-ye Amirkoshteh'i",
"Poshteh-ye Amīrkoshteh’ī",
"Pushtah-ye Amir Kushtah'i",
"Pushtah-ye Amīr Kushtah’ī",
"Pusta-i-Amirkusta'i",
"Pusta-i-Amīṟkusta’i",
"پشتۀ امیر کشته ئی"
],
"Longitude":64.67254,
"FeatureClass":"T",
"DEM":"1847",
"Name":"Pushtah-ye Amīr Kushtah’ī",
"GeonamesId":1424592,
"@timestamp":"2017-09-01T09:17:31.824Z",
"FeatureClassName":"mountain, hill, rock, etc",
"ModificationDate":"2012-01-19T00:00:00.000Z",
"FeatureCodeName":"mountain",
"@version":"1",
"Admin1Code":"07",
"Population":0,
"location":{
"lon":"64.67254",
"lat":"35.24667"
},
"CountryCode":"AF",
"CountryCode3":"AFG"
}
```
17 changes: 17 additions & 0 deletions logstash/create_index.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
echo "Setting env vars..."
source environment_variables.sample.sh

echo "Downloading Geonames gazetteer..."
#wget http://download.geonames.org/export/dump/allCountries.zip
echo "Unpacking Geonames gazetteer..."
#unzip allCountries.zip

#echo "Starting Docker container and data volume..."
#sudo docker run -d -p 127.0.0.1:9200:9200 -v $PWD/geonames_index/:/usr/share/elasticsearch/data elasticsearch:5.1.2
#sleep 10s

echo "Creating mappings for the fields in the Geonames index:"
echo ${ES_GEONAMES_HOST}${ES_GEONAMES_INDEX}
curl -XPUT ${ES_GEONAMES_HOST}${ES_GEONAMES_INDEX} -H 'Content-Type: application/json' -d @geonames_mapping.json

echo "Done"
9 changes: 9 additions & 0 deletions logstash/environment_variables.sample.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/usr/bin/env bash

export ES_GEONAMES_HOST=http://localhost:9200/
export ES_GEONAMES_USER=elastic
export ES_GEONAMES_PASSWORD=changeme
export ES_GEONAMES_INDEX=geonames2

# dont modify this if you download the files in the default location:
export ES_GEONAMES_FILE=$(pwd)\/allCountries.txt
24 changes: 24 additions & 0 deletions logstash/geonames_mapping.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
{
"mappings": {
"place": {
"properties": {
"ModificationDate": {
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ss.SSSZ"
},
"Latitude": {
"type": "double"
},
"Longitude": {
"type": "double"
},
"location": {
"type": "geo_point"
},
"GeonamesId": {
"type": "integer"
}
}
}
}
}
Loading

0 comments on commit 70c2da6

Please sign in to comment.