Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow non ASCII but printable characters in ID field #918

Open
niyalist opened this issue Jun 24, 2021 · 3 comments
Open

Allow non ASCII but printable characters in ID field #918

niyalist opened this issue Jun 24, 2021 · 3 comments
Labels
community rules This is used for Out of Spec / Out of Best Practice rules that we'd like to include in the validator enhancement New feature request or improvement on an existing feature GTFS Reference Used for Adding or changing rules that belong in the GTFS reference status: Blocked Can't work on it currently because of an external factor.
Milestone

Comments

@niyalist
Copy link

Feature request

Please allow non ASCII but printable characters in ID field.
We, Japanese GTFS engineers, often use some local characters as an ID in order to ease to check the GTFS data by the non-tech person in a transit agency. Especially, we use Japanese expression of types of days such as holiday, weekday in Japanese way in service_id like "平日", "土休日". I think only a machine refers ID as an internal identifier, so even non ASCII characters are usable.

Is your feature request related to a problem? Please describe.
#712 is discussing this issue.
Also, discussion in GTFS specs relate to this issue.

Proposed solution
Remove this rule, or set the category of this rule be "INFO" rather than "WARNING".

Describe alternatives you've considered
One possible solution is to enable this rule referring to the given country code.

@isabelle-dr isabelle-dr added the GTFS Reference Used for Adding or changing rules that belong in the GTFS reference label Jun 29, 2021
@isabelle-dr
Copy link
Contributor

isabelle-dr commented Jul 5, 2021

Thank you for opening this issue.
The current validator behaviour (a warning) seems to be in line with the specification (a recommendation):

An ID field value is an internal ID, not intended to be shown to riders, and is a sequence of any UTF-8 characters. Using only printable ASCII characters is recommended.

A warning doesn't make a dataset invalid, but I see that it can cause a noisy output if having non-ASCII characters in id fields is part of your process.
We are currently in discussion on how to best approach discrepancies between the GTFS Reference (the specification) and the way GTFS is used today, within the validator.
Is there other id fields other thanservice_id where you are facing a similar issue?

@takohei
Copy link

takohei commented Jul 10, 2021

@isabelle-dr Thank you for your comment.

Some of popular GTFS generation tools in Japan use non-ASCII characters in some fields except service_id.

Tool1. Sono Sujiya

Tool: http://www.sinjidai.com/sujiya/
Feed example: Uno Bus in Japan

  • route_id: ネオ瀬戸線_A -- [route name]_[route sub pattern]
  • trip_id: 平日_05時40分_系統1642 -- [service_id][departure time][stop pattern code].
  • stop_id: 2_親 -- [original stop id]_[parent]

The number of warnings is 68,028! (screen shot on Google Transit)
image

Tool2. Mieruka Format

Tool: https://www.rosenzu.com/net/mieru/fm/
Feed example: Toyota City Bus

  • trip_id: 2平日_06時24分_系統101001 -- [service code][service_id][departure time][stop pattern code]

Tool3. Nishizawa Tool

Tool: https://home.csis.u-tokyo.ac.jp/~nishizawa/gtfs/
Feed example: Kobe City Shiokaze Bus

  • trip_id: 1_1+1+平土+1 -- [route_id]_[stop pattern serial]+[direction_id]+[service_id]+[trip serial]

Explanation

The share of these tool are about 20% for each in Japan.

These IDs are natural composite keys.
These are not good practices in softwere engineering, but also pragmatic for operators who create GTFS feeds for the following reasons.

  • Most validation tools indicate error or warning records by ID rather than by name.
  • Many of the operators have low IT literacy. Therefore they are unfamiliar with searching records using ID.

@isabelle-dr
Copy link
Contributor

Hello,

Thank you for this precise answer. I understand better how the id fields are built in Japan (as natural composite keys). Having 60,828 warnings isn't a nice experience and it must be hard to know what is a "real" problem with the datasets.

After talking with the specification team, here is what I can add: the GTFS community decided to recommend using only printable ASCII characters in id fields in order to prevent any parsing issues.
The validator aims at representing the canonical set of rules described in the official specification, which means that:
1- this rule will be kept at a WARNING level in the validator
2- it cannot be linked to a certain country code

If you are interested in using a different system for the id fields, and maybe changing the way the data is being checked by GTFS operators, I’ll be happy to help.
There are a few examples of feeds that use only numbers and letters (from the latin alphabet) in the id fields:

Also, some GTFS providers modified the official GTFS schema in order to add information about the service:
Additional field in the calendar.txt file holiday_name
Additional file calendar_attributes.txt with fields service_id and service_description
An example of a feed doing this is Central Oregon Breeze GTFS.

  • calendar.txt file:

Screen Shot 2021-07-16 at 3 17 35 PM

  • calendar_attributes.txt file:

Screen Shot 2021-07-16 at 3 17 31 PM

Again, if you’re interested in adding this to your GTFS schema, I’ll be happy to help.

The last thing I could recommend is maybe opening a discussion with the GTFS community and proposing a change in the specification. The process is described here.

Let me know if I can be of any other assistance to you.

@isabelle-dr isabelle-dr added the community rules This is used for Out of Spec / Out of Best Practice rules that we'd like to include in the validator label Aug 26, 2021
@isabelle-dr isabelle-dr added this to the v3.1.0 - custom validation milestone Jan 9, 2022
@isabelle-dr isabelle-dr removed this from the v3.2.0 - custom validation milestone Oct 3, 2022
@isabelle-dr isabelle-dr added status: Blocked Can't work on it currently because of an external factor. enhancement New feature request or improvement on an existing feature labels Oct 3, 2022
@isabelle-dr isabelle-dr added this to the Future work milestone Oct 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community rules This is used for Out of Spec / Out of Best Practice rules that we'd like to include in the validator enhancement New feature request or improvement on an existing feature GTFS Reference Used for Adding or changing rules that belong in the GTFS reference status: Blocked Can't work on it currently because of an external factor.
Projects
None yet
Development

No branches or pull requests

3 participants