Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write OSM tag errors to file #28

Merged
merged 6 commits into from
Sep 29, 2023
Merged

Write OSM tag errors to file #28

merged 6 commits into from
Sep 29, 2023

Conversation

newsch
Copy link
Collaborator

@newsch newsch commented Aug 25, 2023

This takes care of the remaining changes needed by #25.

  • Write a TSV file with the original line number, OSM object url, OSM edit version, error, and input text.
  • Add additional checks for title and language validity

It will be easier to ignore ddb818e when reviewing, it is only refactoring.

Copy link
Member

@biodranik biodranik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, these are great improvements!

// Wikipedia titles must be less than 256 bytes of UTF-8.
// See: https://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_(technical_restrictions)#Title_length
if !title.len() < 256 {
return Err(ParseTitleError::TitleLong);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can larger titles be supported anyway? What will break if this return is skipped?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would show a broken wikipedia link to the user.
But there are other invalid "titles" that are less than 255 characters.

src/wm/title.rs Outdated
if lang.is_empty() {
return Err(ParseTitleError::NoLang);
}
// Longest Wiki name is 'zh-classical' (12 chars).
// See: https://meta.wikimedia.org/wiki/List_of_Wikipedias
if lang.len() > 12 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if new, longer names will be added?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will break, unless we dynamically check with wikipedia somehow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed that check, we can add warnings later if it is a problem.

README.md Outdated
@@ -36,6 +36,7 @@ Usage: om-wikiparser <COMMAND>
Commands:
get-articles Extract, filter, and simplify article HTML from Wikipedia Enterprise HTML dumps
get-tags Extract wikidata/wikipedia tags from an OpenStreetMap PBF dump
tag-errors Write errors found while parsing an OSM tag file to stdout in TSV format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be run after get-articles or get-tags? Does it make sense to clarify usage cases here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It takes the output of get-tags or osmconvert, I'll clarify that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to see the main usage cases in the "how to prepare data for OM generator" context documented either in readme, or in the help output.

- Only embed commit on release builds.
- Add CI and scripts to excluded cargo files.

Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
A special 'bench' profile can be used instead.

Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
- Write a TSV file with the line number, error, and input text.
- Include OSM object id if available in tag file.
- Update run script to write file once before extracting.

Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
- Add @otype and @Version columns to the `get-tags` output.
- Parse @otype, @oname, and @Version columns in osm tagfiles.
- Attach and output available metadata in the `tag-errors` command.

OSM ids are not shared across nodes, ways, and relations, so the object
type should be saved as well. Including the edit version will make it
easier to see if a mis-tagged object is outdated.

Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
Signed-off-by: Evan Lloyd New-Schmidt <[email protected]>
@newsch newsch merged commit 29d9037 into main Sep 29, 2023
@newsch newsch deleted the export-errors branch September 29, 2023 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants