-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add osm tag file parsing #23
Conversation
Is there an invalid UTF in OpenStreetMap.org? Any examples? Can it be fixed? It would be great to fix errors in OpenStreetMap.org directly. On the other hand, it is important to log errors with OpenStreetMap.org data and report them to osmers or fix ourselves, while producing robust output. |
I'm not sure yet if it's in the OSM db or something that got messed up by
I'll add an option to output the errors in a structured way that we can deal with. Some are straightforward fixes, others might require someone with local knowledge that we can leave notes on. |
Thanks! Could it be some local locale issue? What is your terminal locale/code page? Does it support non-US ones? Or it could be some osmium issue. There is nothing wrong with the output: |
My terminal is set to |
It looks like these aren't titles but notes? or copied text from the articles? Wikipedia has a limit of 255 characters in titles. |
Good catch! You can patch osmconvert to increase the size, or try native rust approach. |
The link above says 255 urf-8 bytes in title, not 255 characters. You may try to osmupdate your planet (don't forget to drop authors and history). |
Sorry, I misspoke - yes, 255 bytes of utf-8. I'll try the rust parser you used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such a big PR! It would be easier to do it in parts.
How does simplification output look now?
Can we run and test it?
I will try to break it up into better commits and let you know - some of it is refactoring that isn't helpful to see together.
This doesn't touch simplification, I am doing that next. I'll add my updates to #4 and open a PR with the changes.
Sure! I've been testing it as I go, and
|
9997f98
to
48aae6a
Compare
Parse wikipedia and wikidata tags from a tsv file of OSM tags, compatible with the "--csv" output of `osmconvert`. Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
- Use CLI subcommands (e.g. `om-wikiparser get-articles`) - Move article processing into a separate module - Convert simplify helper from separate binary to subcommand Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
- Use rayon and osmpbf crates, output intermediate TSV file in the same format as osmconvert, for use with the new `--osm-tags` flag. - Number of threads spawned can be configured with `--procs` flag. - Replace all wikidata id references with QID. - Update script and documentation to use new subcommands. - run.sh now expects a pbf file to extract tags from. Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
- Check for urls in osm tags - Handle mobile urls Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
- Move Qid and Title to separate modules - Reformat benchmark Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
- Add custom error types with `thiserror` crate in preparation for #25. - Parsing errors are captured instead of logged to `warn` by default. - All parsing errors are still logged to `debug` level. - If >= 0.02% of tags can't be parsed, an error is logged. - TSV line errors are always logged as errors. - I/O errors will fail instead of be logged. Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
I think it will be easiest to review the remaining commits individually. They're reasonably separated and each has a meaningful commit message. These contain meaningful changes:
These move things around and don't change meaningful functionality:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clear commits and their descriptions!
src/wm/title.rs
Outdated
.ok_or_else(|| anyhow!("Expected subdomain"))?; | ||
let host = host.strip_prefix("m.").unwrap_or(host); | ||
if host != "wikipedia.org" { | ||
bail!("Expected wikipedia.org for domain") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to print wrong hosts in a log to fix/support them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are caught at a higher level and logged/saved with the full string
if !line_errors.is_empty() { | ||
let error_count = line_errors.len(); | ||
let new_items = wikidata_qids.len() + wikipedia_titles.len() - original_items; | ||
let expected_threshold = 0.02; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the benefit of hiding errors under a threshold? Isn't it beneficial to see all errors and be able to estimate/compare the quality of the dump, and to easily grep/find what is most important, or feed the whole log to contributors for fixes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The threshold only determines if the message is info
vs error
level.
When you use the run.sh
script with multiple languages it prints a copy of the hundreds of errors for each language.
I think writing the parse errors to a file separately will be easier to read and deal with.
I'm open to other ideas.
Parse wikipedia and wikidata tags from a tsv file of OSM tags,
compatible with the
--csv
output ofosmconvert
.Closes #19.
Notes from parsing all planet wikipedia/wikidata tags:
;
/း
instead of:
Q123;Q124
(Q123)
Warfstermolen (Q1866088)
There are 50 wikipedia entries with url escaping, some are urls instead of titles and not handled correctly.
Remaining work:
serialize parse errors to disk for changesAdd structured parse errors (see Triage OSM tag errors #25 for rest), log summaryosmpbf
crate to parse planet file to fixosmconvert
truncation problemrun.sh
to use new method