Add osm tag file parsing #23

newsch · 2023-08-03T13:57:54Z

Parse wikipedia and wikidata tags from a tsv file of OSM tags,
compatible with the --csv output of osmconvert.

Closes #19.

Notes from parsing all planet wikipedia/wikidata tags:

6 UTF-8 TSV parsing errors
585 tag parsing errors
- 511 titles
  - no lang
  - ;/း instead of :
  - qid instead of title
- 74 qids
  - wikipedia title instead of qid
  - Q123;Q124
  - (Q123)
  - Warfstermolen (Q1866088)
  - yandex urls
  - amenity
  - coordinates

There are 50 wikipedia entries with url escaping, some are urls instead of titles and not handled correctly.

Remaining work:

add url checks in title parsing, match generator url handling (and handle mobile website urls)
~~serialize parse errors to disk for changes~~ Add structured parse errors (see Triage OSM tag errors #25 for rest), log summary
Use osmpbf crate to parse planet file to fix osmconvert truncation problem
Merge Add script for running with map generator #21 and update run.sh to use new method

biodranik · 2023-08-03T17:49:41Z

Is there an invalid UTF in OpenStreetMap.org? Any examples? Can it be fixed?

It would be great to fix errors in OpenStreetMap.org directly.

On the other hand, it is important to log errors with OpenStreetMap.org data and report them to osmers or fix ourselves, while producing robust output.

newsch · 2023-08-03T18:23:28Z

I'm not sure yet if it's in the OSM db or something that got messed up by osmium/osmconvert, and I can't copy/paste them here, but I attached the relevant lines here:
planet-wikipedia-utf8-errors.csv

CSV parse error: record 602681 (line 602682, field: 6, byte: 48946539): invalid utf-8: invalid UTF-8 in field 6 near byte index 25
CSV parse error: record 1326111 (line 1326112, field: 4, byte: 102037202): invalid utf-8: invalid UTF-8 in field 4 near byte index 25
CSV parse error: record 1610617 (line 1610618, field: 6, byte: 118698840): invalid utf-8: invalid UTF-8 in field 6 near byte index 25
CSV parse error: record 1778686 (line 1778687, field: 6, byte: 128580834): invalid utf-8: invalid UTF-8 in field 6 near byte index 25
CSV parse error: record 2279654 (line 2279713, field: 6, byte: 158513245): invalid utf-8: invalid UTF-8 in field 6 near byte index 25
CSV parse error: record 2723754 (line 2723813, field: 4, byte: 186764436): invalid utf-8: invalid UTF-8 in field 4 near byte index 254

I'll add an option to output the errors in a structured way that we can deal with. Some are straightforward fixes, others might require someone with local knowledge that we can leave notes on.

biodranik · 2023-08-03T18:35:57Z

Thanks! Could it be some local locale issue? What is your terminal locale/code page? Does it support non-US ones? Or it could be some osmium issue.

There is nothing wrong with the output:
https://www.openstreetmap.org/api/0.6/node/1648348163
https://www.openstreetmap.org/node/1648348163

https://www.openstreetmap.org/api/0.6/way/53580264

https://www.openstreetmap.org/api/0.6/relation/7032757

newsch · 2023-08-03T18:55:23Z

My terminal is set to en_US.UTF-8, but it went straight from osmium -> osmconvert -> file.
I have an outdated planet file, I'll extract the full objects from there and see.

newsch · 2023-08-03T19:17:18Z

osmium outputs it fine, but osmconvert doesn't.
I think it's a truncation issue, they're all around 257 bytes long, the error is at the end, and osmconvert has a fixed buffer of length 256 for keys/values.

It looks like these aren't titles but notes? or copied text from the articles? Wikipedia has a limit of 255 characters in titles.

biodranik · 2023-08-03T19:38:35Z

Good catch! You can patch osmconvert to increase the size, or try native rust approach.

biodranik · 2023-08-03T19:47:48Z

The link above says 255 urf-8 bytes in title, not 255 characters.

You may try to osmupdate your planet (don't forget to drop authors and history).

newsch · 2023-08-04T19:13:15Z

Sorry, I misspoke - yes, 255 bytes of utf-8. I'll try the rust parser you used.

biodranik

Such a big PR! It would be easier to do it in parts.

How does simplification output look now?

Can we run and test it?

benches/id_parsing.rs

newsch · 2023-08-09T19:15:15Z

Such a big PR! It would be easier to do it in parts.

I will try to break it up into better commits and let you know - some of it is refactoring that isn't helpful to see together.

How does simplification output look now?

This doesn't touch simplification, I am doing that next. I'll add my updates to #4 and open a PR with the changes.

Can we run and test it?

Sure! I've been testing it as I go, and run.sh is updated to extract the tags from a pbf file. The new format is documented in the script and the README, the TL;DR is that now you pass a pbf file as well:

./run.sh /tmp/output/ ~/Downloads/planet-latest.osm.pbf ~/Downloads/enwiki-NS0-20230401-ENTERPRISE-HTML.json.tar.gz ...

Parse wikipedia and wikidata tags from a tsv file of OSM tags, compatible with the "--csv" output of `osmconvert`. Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>

- Use CLI subcommands (e.g. `om-wikiparser get-articles`) - Move article processing into a separate module - Convert simplify helper from separate binary to subcommand Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>

- Use rayon and osmpbf crates, output intermediate TSV file in the same format as osmconvert, for use with the new `--osm-tags` flag. - Number of threads spawned can be configured with `--procs` flag. - Replace all wikidata id references with QID. - Update script and documentation to use new subcommands. - run.sh now expects a pbf file to extract tags from. Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>

- Check for urls in osm tags - Handle mobile urls Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>

- Move Qid and Title to separate modules - Reformat benchmark Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>

- Add custom error types with `thiserror` crate in preparation for #25. - Parsing errors are captured instead of logged to `warn` by default. - All parsing errors are still logged to `debug` level. - If >= 0.02% of tags can't be parsed, an error is logged. - TSV line errors are always logged as errors. - I/O errors will fail instead of be logged. Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>

newsch · 2023-08-09T20:38:19Z

I think it will be easiest to review the remaining commits individually. They're reasonably separated and each has a meaningful commit message.

These contain meaningful changes:

Add new option to parse osm tag file (+114/-11)
Extract tags in parallel in rust (+559/-97)
Improve url handling (+26/-2)
Structure parse errors and only log warning if above threshold (+155/-40)

These move things around and don't change meaningful functionality:

Refactor into subcommands (+305/-277)
Refactor and rename title/qid wrappers (+211/-207)

biodranik

Thanks for clear commits and their descriptions!

biodranik · 2023-08-09T21:29:57Z

src/wm/title.rs

+            .ok_or_else(|| anyhow!("Expected subdomain"))?;
+        let host = host.strip_prefix("m.").unwrap_or(host);
+        if host != "wikipedia.org" {
+            bail!("Expected wikipedia.org for domain")


Does it make sense to print wrong hosts in a log to fix/support them?

They are caught at a higher level and logged/saved with the full string

src/wm/title.rs

biodranik · 2023-08-09T21:33:40Z

src/get_articles.rs

+        if !line_errors.is_empty() {
+            let error_count = line_errors.len();
+            let new_items = wikidata_qids.len() + wikipedia_titles.len() - original_items;
+            let expected_threshold = 0.02;


What is the benefit of hiding errors under a threshold? Isn't it beneficial to see all errors and be able to estimate/compare the quality of the dump, and to easily grep/find what is most important, or feed the whole log to contributors for fixes?

The threshold only determines if the message is info vs error level.
When you use the run.sh script with multiple languages it prints a copy of the hundreds of errors for each language.
I think writing the parse errors to a file separately will be easier to read and deal with.

I'm open to other ideas.

newsch force-pushed the osm-tags branch from 7aede06 to 45f4548 Compare August 8, 2023 17:43

newsch mentioned this pull request Aug 9, 2023

Triage OSM tag errors #25

Open

6 tasks

newsch marked this pull request as ready for review August 9, 2023 18:22

newsch requested a review from biodranik August 9, 2023 18:31

newsch added this to the v0.2 milestone Aug 9, 2023

biodranik reviewed Aug 9, 2023

View reviewed changes

benches/id_parsing.rs Outdated Show resolved Hide resolved

benches/id_parsing.rs Outdated Show resolved Hide resolved

benches/id_parsing.rs Outdated Show resolved Hide resolved

newsch force-pushed the osm-tags branch 2 times, most recently from 9997f98 to 48aae6a Compare August 9, 2023 19:53

newsch added 6 commits August 9, 2023 16:01

Add new option to parse osm tag file

a2c113a

Parse wikipedia and wikidata tags from a tsv file of OSM tags, compatible with the "--csv" output of `osmconvert`. Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>

Refactor into subcommands

0ac935c

- Use CLI subcommands (e.g. `om-wikiparser get-articles`) - Move article processing into a separate module - Convert simplify helper from separate binary to subcommand Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>

Improve url handling

2532d13

- Check for urls in osm tags - Handle mobile urls Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>

Refactor and rename title/qid wrappers

29cdbe2

- Move Qid and Title to separate modules - Reformat benchmark Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>

newsch force-pushed the osm-tags branch from 48aae6a to b250dd4 Compare August 9, 2023 20:16

biodranik approved these changes Aug 9, 2023

View reviewed changes

newsch merged commit 941d2b1 into main Aug 10, 2023

newsch deleted the osm-tags branch August 10, 2023 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add osm tag file parsing #23

Add osm tag file parsing #23

newsch commented Aug 3, 2023 •

edited

Loading

biodranik commented Aug 3, 2023

newsch commented Aug 3, 2023

biodranik commented Aug 3, 2023

newsch commented Aug 3, 2023

newsch commented Aug 3, 2023

biodranik commented Aug 3, 2023

biodranik commented Aug 3, 2023

newsch commented Aug 4, 2023

biodranik left a comment

newsch commented Aug 9, 2023

newsch commented Aug 9, 2023 •

edited

Loading

biodranik left a comment

biodranik Aug 9, 2023

newsch Aug 9, 2023

biodranik Aug 9, 2023

newsch Aug 9, 2023

Add osm tag file parsing #23

Add osm tag file parsing #23

Conversation

newsch commented Aug 3, 2023 • edited Loading

biodranik commented Aug 3, 2023

newsch commented Aug 3, 2023

biodranik commented Aug 3, 2023

newsch commented Aug 3, 2023

newsch commented Aug 3, 2023

biodranik commented Aug 3, 2023

biodranik commented Aug 3, 2023

newsch commented Aug 4, 2023

biodranik left a comment

Choose a reason for hiding this comment

newsch commented Aug 9, 2023

newsch commented Aug 9, 2023 • edited Loading

biodranik left a comment

Choose a reason for hiding this comment

biodranik Aug 9, 2023

Choose a reason for hiding this comment

newsch Aug 9, 2023

Choose a reason for hiding this comment

biodranik Aug 9, 2023

Choose a reason for hiding this comment

newsch Aug 9, 2023

Choose a reason for hiding this comment

newsch commented Aug 3, 2023 •

edited

Loading

newsch commented Aug 9, 2023 •

edited

Loading