Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add osm tag file parsing #23

Merged
merged 6 commits into from
Aug 10, 2023
Merged

Add osm tag file parsing #23

merged 6 commits into from
Aug 10, 2023

Conversation

newsch
Copy link
Collaborator

@newsch newsch commented Aug 3, 2023

Parse wikipedia and wikidata tags from a tsv file of OSM tags,
compatible with the --csv output of osmconvert.

Closes #19.

Notes from parsing all planet wikipedia/wikidata tags:

  • 6 UTF-8 TSV parsing errors
  • 585 tag parsing errors
    • 511 titles
      • no lang
      • ;/ instead of :
      • qid instead of title
    • 74 qids
      • wikipedia title instead of qid
      • Q123;Q124
      • (Q123)
      • Warfstermolen (Q1866088)
      • yandex urls
      • amenity
      • coordinates

There are 50 wikipedia entries with url escaping, some are urls instead of titles and not handled correctly.

Remaining work:

  • add url checks in title parsing, match generator url handling (and handle mobile website urls)
  • serialize parse errors to disk for changes Add structured parse errors (see Triage OSM tag errors #25 for rest), log summary
  • Use osmpbf crate to parse planet file to fix osmconvert truncation problem
  • Merge Add script for running with map generator #21 and update run.sh to use new method

@biodranik
Copy link
Member

Is there an invalid UTF in OpenStreetMap.org? Any examples? Can it be fixed?

It would be great to fix errors in OpenStreetMap.org directly.

On the other hand, it is important to log errors with OpenStreetMap.org data and report them to osmers or fix ourselves, while producing robust output.

@newsch
Copy link
Collaborator Author

newsch commented Aug 3, 2023

I'm not sure yet if it's in the OSM db or something that got messed up by osmium/osmconvert, and I can't copy/paste them here, but I attached the relevant lines here:
planet-wikipedia-utf8-errors.csv

CSV parse error: record 602681 (line 602682, field: 6, byte: 48946539): invalid utf-8: invalid UTF-8 in field 6 near byte index 25
CSV parse error: record 1326111 (line 1326112, field: 4, byte: 102037202): invalid utf-8: invalid UTF-8 in field 4 near byte index 25
CSV parse error: record 1610617 (line 1610618, field: 6, byte: 118698840): invalid utf-8: invalid UTF-8 in field 6 near byte index 25
CSV parse error: record 1778686 (line 1778687, field: 6, byte: 128580834): invalid utf-8: invalid UTF-8 in field 6 near byte index 25
CSV parse error: record 2279654 (line 2279713, field: 6, byte: 158513245): invalid utf-8: invalid UTF-8 in field 6 near byte index 25
CSV parse error: record 2723754 (line 2723813, field: 4, byte: 186764436): invalid utf-8: invalid UTF-8 in field 4 near byte index 254

I'll add an option to output the errors in a structured way that we can deal with. Some are straightforward fixes, others might require someone with local knowledge that we can leave notes on.

@biodranik
Copy link
Member

Thanks! Could it be some local locale issue? What is your terminal locale/code page? Does it support non-US ones? Or it could be some osmium issue.

There is nothing wrong with the output:
https://www.openstreetmap.org/api/0.6/node/1648348163
https://www.openstreetmap.org/node/1648348163

https://www.openstreetmap.org/api/0.6/way/53580264

https://www.openstreetmap.org/api/0.6/relation/7032757

@newsch
Copy link
Collaborator Author

newsch commented Aug 3, 2023

My terminal is set to en_US.UTF-8, but it went straight from osmium -> osmconvert -> file.
I have an outdated planet file, I'll extract the full objects from there and see.

@newsch
Copy link
Collaborator Author

newsch commented Aug 3, 2023

osmium outputs it fine, but osmconvert doesn't.
I think it's a truncation issue, they're all around 257 bytes long, the error is at the end, and osmconvert has a fixed buffer of length 256 for keys/values.

It looks like these aren't titles but notes? or copied text from the articles? Wikipedia has a limit of 255 characters in titles.

@biodranik
Copy link
Member

Good catch! You can patch osmconvert to increase the size, or try native rust approach.

@biodranik
Copy link
Member

The link above says 255 urf-8 bytes in title, not 255 characters.

You may try to osmupdate your planet (don't forget to drop authors and history).

@newsch
Copy link
Collaborator Author

newsch commented Aug 4, 2023

Sorry, I misspoke - yes, 255 bytes of utf-8. I'll try the rust parser you used.

@newsch newsch mentioned this pull request Aug 9, 2023
6 tasks
@newsch newsch marked this pull request as ready for review August 9, 2023 18:22
@newsch newsch requested a review from biodranik August 9, 2023 18:31
@newsch newsch added this to the v0.2 milestone Aug 9, 2023
Copy link
Member

@biodranik biodranik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a big PR! It would be easier to do it in parts.

How does simplification output look now?

Can we run and test it?

benches/id_parsing.rs Outdated Show resolved Hide resolved
benches/id_parsing.rs Outdated Show resolved Hide resolved
benches/id_parsing.rs Outdated Show resolved Hide resolved
@newsch
Copy link
Collaborator Author

newsch commented Aug 9, 2023

Such a big PR! It would be easier to do it in parts.

I will try to break it up into better commits and let you know - some of it is refactoring that isn't helpful to see together.

How does simplification output look now?

This doesn't touch simplification, I am doing that next. I'll add my updates to #4 and open a PR with the changes.

Can we run and test it?

Sure! I've been testing it as I go, and run.sh is updated to extract the tags from a pbf file. The new format is documented in the script and the README, the TL;DR is that now you pass a pbf file as well:

./run.sh /tmp/output/ ~/Downloads/planet-latest.osm.pbf ~/Downloads/enwiki-NS0-20230401-ENTERPRISE-HTML.json.tar.gz ...

@newsch newsch force-pushed the osm-tags branch 2 times, most recently from 9997f98 to 48aae6a Compare August 9, 2023 19:53
newsch added 6 commits August 9, 2023 16:01
Parse wikipedia and wikidata tags from a tsv file of OSM tags,
compatible with the "--csv" output of `osmconvert`.

Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
- Use CLI subcommands (e.g. `om-wikiparser get-articles`)
- Move article processing into a separate module
- Convert simplify helper from separate binary to subcommand

Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
- Use rayon and osmpbf crates, output intermediate TSV file in the same
  format as osmconvert, for use with the new `--osm-tags` flag.
- Number of threads spawned can be configured with `--procs` flag.
- Replace all wikidata id references with QID.
- Update script and documentation to use new subcommands.
- run.sh now expects a pbf file to extract tags from.

Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
- Check for urls in osm tags
- Handle mobile urls

Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
- Move Qid and Title to separate modules
- Reformat benchmark

Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
- Add custom error types with `thiserror` crate in preparation for #25.
- Parsing errors are captured instead of logged to `warn` by default.
    - All parsing errors are still logged to `debug` level.
    - If >= 0.02% of tags can't be parsed, an error is logged.
    - TSV line errors are always logged as errors.
    - I/O errors will fail instead of be logged.

Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
@newsch
Copy link
Collaborator Author

newsch commented Aug 9, 2023

I think it will be easiest to review the remaining commits individually. They're reasonably separated and each has a meaningful commit message.

These contain meaningful changes:

These move things around and don't change meaningful functionality:

Copy link
Member

@biodranik biodranik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clear commits and their descriptions!

src/wm/title.rs Outdated
.ok_or_else(|| anyhow!("Expected subdomain"))?;
let host = host.strip_prefix("m.").unwrap_or(host);
if host != "wikipedia.org" {
bail!("Expected wikipedia.org for domain")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to print wrong hosts in a log to fix/support them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are caught at a higher level and logged/saved with the full string

src/wm/title.rs Outdated Show resolved Hide resolved
if !line_errors.is_empty() {
let error_count = line_errors.len();
let new_items = wikidata_qids.len() + wikipedia_titles.len() - original_items;
let expected_threshold = 0.02;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit of hiding errors under a threshold? Isn't it beneficial to see all errors and be able to estimate/compare the quality of the dump, and to easily grep/find what is most important, or feed the whole log to contributors for fixes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threshold only determines if the message is info vs error level.
When you use the run.sh script with multiple languages it prints a copy of the hundreds of errors for each language.
I think writing the parse errors to a file separately will be easier to read and deal with.

I'm open to other ideas.

@newsch newsch merged commit 941d2b1 into main Aug 10, 2023
@newsch newsch deleted the osm-tags branch August 10, 2023 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate using osmfilter/osmium for generating inputs
2 participants