Skip to content

Commit

Permalink
add first version of mkdocs-based documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
lonvia committed Sep 15, 2024
1 parent 2ed2b2b commit a6f7182
Show file tree
Hide file tree
Showing 38 changed files with 3,860 additions and 24 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@ contrib
osmium.egg-info
*.pyc
.ycm_extra_conf.py
.ipynb_checkpoints
docs/data/out
13 changes: 13 additions & 0 deletions docs/cookbooks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Overview

This section presents practical examples and use cases, where pyosmium can
come in handy. Each cookbook comes with three sections:

* __Task__ explains what should be achieved
* __Quick Solution__ gives you a ready to use script that solves the task
* __Background__ explains in detail what happens in the script and why it
is done the way it is done. This should help you to write similar scripts
on your own.

All cookbooks are available as Jupiter notebooks in the
[pyosmium source tree](https://github.com/osmcode/pyosmium/tree/master/docs/cookbooks/)
283 changes: 283 additions & 0 deletions docs/cookbooks/Adding-Node-Infos-To-Boundaries.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,283 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "403b95b7-e1db-4fec-8f7b-9e384a84a621",
"metadata": {},
"source": [
"# Adding Place Node Information to Boundary Relations\n",
"\n",
"How to merge information from different OSM objects.\n",
"\n",
"## Task\n",
"\n",
"Administrative areas often represented with two different objects in OSM: a node describes the central point and a relation that contains all the ways that make up the boundary. The task is to find all administrative boundaries and their matching place nodes and output both togther in a geojson file. Relations and place nodes should be matched when they have the same wikidata tag.\n",
"\n",
"## Quick solution"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "46ba251f-d3ba-41c1-98a1-92f7a56a8128",
"metadata": {},
"outputs": [],
"source": [
"import osmium\n",
"from dataclasses import dataclass\n",
"import json"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "ac6ecdca-5ba6-485f-9253-88b8f73f5041",
"metadata": {},
"outputs": [],
"source": [
"@dataclass\n",
"class PlaceInfo:\n",
" id: int\n",
" tags: dict[str, str]\n",
" coords: tuple[float, float]\n",
"\n",
"geojsonfab = osmium.geom.GeoJSONFactory()\n",
"\n",
"class BoundaryHandler(osmium.SimpleHandler):\n",
" def __init__(self, outfile):\n",
" self.places = {}\n",
" self.outfile = outfile\n",
" # write the header of the geojson file\n",
" self.outfile.write('{\"type\": \"FeatureCollection\", \"features\": [')\n",
" # This is just to make sure, we place the commas on the right place.\n",
" self.delim = ''\n",
"\n",
" def finish(self):\n",
" self.outfile.write(']}')\n",
"\n",
" def node(self, n):\n",
" self.places[n.tags['wikidata']] = PlaceInfo(n.id, dict(n.tags), (n.location.lon, n.location.lat))\n",
" \n",
" def area(self, a):\n",
" # Find the corresponding place node\n",
" place = self.places.get(a.tags.get('wikidata', 'not found'), None)\n",
" # Geojsonfab creates a string with the geojson geometry.\n",
" # Convert to a Python object to make it easier to add data.\n",
" geom = json.loads(geojsonfab.create_multipolygon(a))\n",
" if geom:\n",
" # print the array delimiter, if necessary\n",
" self.outfile.write(self.delim)\n",
" self.delim = ','\n",
"\n",
" tags = dict(a.tags)\n",
" # add the place information to the propoerties\n",
" if place is not None:\n",
" tags['place_node:id'] = str(place.id)\n",
" tags['place_node:lat'] = str(place.coords[1])\n",
" tags['place_node:lon'] = str(place.coords[0])\n",
" for k, v in place.tags.items():\n",
" tags['place_node:tags:' + k] = v\n",
" # And wrap everything in proper GeoJSON.\n",
" feature = {'type': 'Feature', 'geometry': geom, 'properties': dict(tags)}\n",
" self.outfile.write(json.dumps(feature))\n",
"\n",
"# We are interested in boundary relations that make up areas and not in ways at all.\n",
"filters = [osmium.filter.KeyFilter('place').enable_for(osmium.osm.NODE),\n",
" osmium.filter.KeyFilter('wikidata').enable_for(osmium.osm.NODE),\n",
" osmium.filter.EntityFilter(~osmium.osm.WAY),\n",
" osmium.filter.TagFilter(('boundary', 'administrative')).enable_for(osmium.osm.AREA | osmium.osm.RELATION)]\n",
"\n",
"with open('../data/out/boundaries.geojson', 'w') as outf:\n",
" handler = BoundaryHandler(outf)\n",
" handler.apply_file('../data/liechtenstein.osm.pbf', filters=filters)\n",
" handler.finish()"
]
},
{
"cell_type": "markdown",
"id": "8f8b0c34-a744-4f02-adcb-04ce5ba8526f",
"metadata": {},
"source": [
"## Background\n",
"\n",
"Whenever you want to look at more than one OSM object at the time, you need to cache objects. Before starting such a task, it is always worth taking a closer look at the objects of interest. Find out how many candidates there are for you to look at and save and how large these objects are. There are always multiple ways to cache your data. Sometimes, when the number of candidates is really large, it is even more sensible to reread the file instead of caching the information.\n",
"\n",
"For the boundary problem, the calculation is relatively straightforward. Boundary relations are huge, so we do not want to cache them if it can somehow be avoided. That means we need to cache the place nodes. A quick look at [TagInfo](https://taginfo.openstreetmap.org/keys/place) tells us that there are about 7 million place nodes in the OSM planet. That is not a lot in the grand scheme of things. We could just read them all into memory and be done with it. It is still worth to take a closer look. The place nodes are later matched up by their `wikidata` tag. Looking into the [TagInfo combinations table](https://taginfo.openstreetmap.org/keys/place#combinations), only 10% of the place nodes have such a tag. That leaves 850.000 nodes to cache. Much better!\n",
"\n",
"Next we need to consider what information actually needs caching. In our case we want it all: the ID, the tags and the coordinates of the node. This information needs to be copied out of the node. You cannot just cache the entire node. Pyosmium won't let you do this because it wants to get rid of it as soon as the handler has seen it. Lets create a dataclass to receive the information we need:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "2e2c1a5d-f978-4626-96ce-6cb271ce5ccc",
"metadata": {},
"outputs": [],
"source": [
"@dataclass\n",
"class PlaceInfo:\n",
" id: int\n",
" tags: dict[str, str]\n",
" coords: tuple[float, float]"
]
},
{
"cell_type": "markdown",
"id": "17f2ad09-f865-4c43-9aec-43887a27338e",
"metadata": {},
"source": [
"This class can now be filled from the OSM file:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "b1ab37d1-5e1a-4627-9ad7-09c4f47ba3d0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"29 places cached.\n"
]
}
],
"source": [
"class PlaceNodeReader:\n",
"\n",
" def __init__(self):\n",
" self.places = {}\n",
"\n",
" def node(self, n):\n",
" self.places[n.tags['wikidata']] = PlaceInfo(n.id, dict(n.tags), (n.location.lon, n.location.lat))\n",
"\n",
"reader = PlaceNodeReader()\n",
"\n",
"osmium.apply('../data/liechtenstein.osm.pbf',\n",
" osmium.filter.KeyFilter('place').enable_for(osmium.osm.NODE),\n",
" osmium.filter.KeyFilter('wikidata').enable_for(osmium.osm.NODE),\n",
" reader)\n",
"\n",
"print(f\"{len(reader.places)} places cached.\")"
]
},
{
"cell_type": "markdown",
"id": "8060aea2-5417-4c20-864a-da289250be08",
"metadata": {},
"source": [
"We use the `osmium.apply()` function here with a handler instead of a FileProcessor. The two approaches are equivalent. Which one you choose, depends on your personal taste. FileProcessor loops are less verbose and quicker to write. Handlers tend to yield more readable code when you want to do very different things with the different kinds of objects.\n",
"\n",
"As you can see in the code, it is entirely possible to use filter functions with the apply() functions. In our case, the filters make sure that only objects pass which have a `place` tag _and_ a `wikidata` tag. This leaves exactly the objects we need already, so no further processing needed in the handler callback.\n",
"\n",
"Next the relations need to be read. Relations can be huge, so we don't want to cache them but write them directly out into a file. If we want to create a geojson file, then we need the geometry of the relation in geojson format. Getting geojson format itself is easy. Pyosmium has a converter built-in for this, the GeoJSONFactory:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "2f911347-8b21-4dbe-bd3a-b62b1224680b",
"metadata": {},
"outputs": [],
"source": [
"geojsonfab = osmium.geom.GeoJSONFactory()"
]
},
{
"cell_type": "markdown",
"id": "690af49a-de2e-4568-aecc-c64576112398",
"metadata": {},
"source": [
"The factory only needs to be instantiated once and can then be used globally.\n",
"\n",
"To get the polygon from a relation, the special area handler is needed. It is easiest to invoke by writing a SimpleHandler class with an `area()` callback. When `apply_file()` is called on the handler, it will take the necessary steps in the background to build the polygon geometries."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "5de0aef9-9662-4219-97cf-ca515cd16120",
"metadata": {},
"outputs": [],
"source": [
"class BoundaryHandler(osmium.SimpleHandler):\n",
" def __init__(self, places, outfile):\n",
" self.places = places\n",
" self.outfile = outfile\n",
" # write the header of the geojson file\n",
" self.outfile.write('{\"type\": \"FeatureCollection\", \"features\": [')\n",
" # This is just to make sure, we place the commas on the right place.\n",
" self.delim = ''\n",
"\n",
" def finish(self):\n",
" self.outfile.write(']}')\n",
"\n",
" def area(self, a):\n",
" # Find the corresponding place node\n",
" place = self.places.get(a.tags.get('wikidata', 'not found'), None)\n",
" # Geojsonfab creates a string with the geojson geometry.\n",
" # Convert to a Python object to make it easier to add data.\n",
" geom = json.loads(geojsonfab.create_multipolygon(a))\n",
" if geom:\n",
" # print the array delimiter, if necessary\n",
" self.outfile.write(self.delim)\n",
" self.delim = ','\n",
"\n",
" tags = dict(a.tags)\n",
" # add the place information to the propoerties\n",
" if place is not None:\n",
" tags['place_node:id'] = str(place.id)\n",
" tags['place_node:lat'] = str(place.coords[1])\n",
" tags['place_node:lon'] = str(place.coords[0])\n",
" for k, v in place.tags.items():\n",
" tags['place_node:tags:' + k] = v\n",
" # And wrap everything in proper GeoJSON.\n",
" feature = {'type': 'Feature', 'geometry': geom, 'properties': dict(tags)}\n",
" self.outfile.write(json.dumps(feature))\n",
"\n",
"# We are interested in boundary relations that make up areas and not in ways at all.\n",
"filters = [osmium.filter.EntityFilter(osmium.osm.RELATION | osmium.osm.AREA),\n",
" osmium.filter.TagFilter(('boundary', 'administrative'))]\n",
"\n",
"with open('../data/out/boundaries.geojson', 'w') as outf:\n",
" handler = BoundaryHandler(reader.places, outf)\n",
" handler.apply_file('../data/liechtenstein.osm.pbf', filters=filters)\n",
" handler.finish()"
]
},
{
"cell_type": "markdown",
"id": "28e12d41-ead3-47e8-8671-1fc46b764803",
"metadata": {},
"source": [
"There are two things you should keep in mind, when working with areas:\n",
"1. When the area handler is invoked, the input file is always read twice. The first pass checks the relations and find out which ways it contains. The second pass assembles all necessary ways and builds the geometries.\n",
"2. The area handler automatically enables caching of node locations. You don't need to worry about this when working with small files like our Liechtenstein example. For larger files of continent- or planet-size, the node cache can become quite large. You should read up about the alternative implementations that can write out the node cache on disk to save RAM.\n",
"\n",
"This is already it. In the long version, we have read the input file twice, once to get the nodes and in the second pass to get the relations. This is not really necessary because the nodes come always before the relations in the file. The quick solution shows how to combine both handlers to create the geojson file in a single pass. The only part to pay attention to is the use of filters. Given that we have very different filters for nodes and relations, it is important to call `enable_for()` with the correct OSM type."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading

0 comments on commit a6f7182

Please sign in to comment.