-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write OSM tag errors to file #28
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, these are great improvements!
// Wikipedia titles must be less than 256 bytes of UTF-8. | ||
// See: https://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_(technical_restrictions)#Title_length | ||
if !title.len() < 256 { | ||
return Err(ParseTitleError::TitleLong); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can larger titles be supported anyway? What will break if this return is skipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would show a broken wikipedia link to the user.
But there are other invalid "titles" that are less than 255 characters.
src/wm/title.rs
Outdated
if lang.is_empty() { | ||
return Err(ParseTitleError::NoLang); | ||
} | ||
// Longest Wiki name is 'zh-classical' (12 chars). | ||
// See: https://meta.wikimedia.org/wiki/List_of_Wikipedias | ||
if lang.len() > 12 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if new, longer names will be added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will break, unless we dynamically check with wikipedia somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed that check, we can add warnings later if it is a problem.
README.md
Outdated
@@ -36,6 +36,7 @@ Usage: om-wikiparser <COMMAND> | |||
Commands: | |||
get-articles Extract, filter, and simplify article HTML from Wikipedia Enterprise HTML dumps | |||
get-tags Extract wikidata/wikipedia tags from an OpenStreetMap PBF dump | |||
tag-errors Write errors found while parsing an OSM tag file to stdout in TSV format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be run after get-articles or get-tags? Does it make sense to clarify usage cases here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It takes the output of get-tags
or osmconvert
, I'll clarify that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to see the main usage cases in the "how to prepare data for OM generator" context documented either in readme, or in the help output.
- Only embed commit on release builds. - Add CI and scripts to excluded cargo files. Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
A special 'bench' profile can be used instead. Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
- Write a TSV file with the line number, error, and input text. - Include OSM object id if available in tag file. - Update run script to write file once before extracting. Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
- Add @otype and @Version columns to the `get-tags` output. - Parse @otype, @oname, and @Version columns in osm tagfiles. - Attach and output available metadata in the `tag-errors` command. OSM ids are not shared across nodes, ways, and relations, so the object type should be saved as well. Including the edit version will make it easier to see if a mis-tagged object is outdated. Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
9acee62
to
e439256
Compare
This takes care of the remaining changes needed by #25.
It will be easier to ignore ddb818e when reviewing, it is only refactoring.