Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redundent representation of Tokens/Terminals #9

Open
rljacobson opened this issue Oct 1, 2022 · 0 comments
Open

Redundent representation of Tokens/Terminals #9

rljacobson opened this issue Oct 1, 2022 · 0 comments

Comments

@rljacobson
Copy link

As far as I can tell, the tokens appear in the following ways:

  1. as constants on the SymbolKind struct
  2. wrapped in SymbolKind in the array SymbolKind::VALUES_ (which seems to just emulate an enum…?)
  3. as constants on the Lexer

In the calculator example in the docs/test, the lexer and parser only needed to use the constants on Lexer, but more sophisticated projects might need tokens enumerated in the Value or Token types, which is another list.

I think there is a simpler way. Suppose we instead have a Symbol enumeration that has all of the tokens as variants as in the following:

// Use, e.g., the `enum-primitive-derive` crate for i32<->enum conversion.
#[derive(Copy, Clone, Eq, PartialEq, Ord, PartialOrd, Debug, Primitive)]
#[repr(i32)]
pub enum Symbol {
    YYEmpty,
    YYEOF,
    YYerror,
    YYUNDEF,
    //    ⋮     Whatever other "utility" variants are needed, 
    //    ⋮     so long as there are a statically known number of them.
    UserTerminalToken1,
    UserTerminalToken2,
    UserTerminalToken3,
    //    ⋮     All other terminal tokens the user declared in the spec file.
    UserTerminalTokenN,
    UserNonterminalSymbol1,
    UserNonterminalSymbol2,
    UserNonterminalSymbol3,
    //    ⋮     All other nonterminal symbols from the spec file.
    UserNonterminalSymbolM
}

This enum is generated but can be used by the lexer or whatever other code might need it. Also, simple translation/conversion functions would be generated as in the following:

impl Symbol {
    pub fn yychar_value(&self) -> i32 {
        match self {
            Symbol::YYEmpty => -2,
            Symbol::YYEOF => 0,
            //    ⋮    Whatever other "special" values there are.
            Symbol::YYError => 256,
            Symbol::YYUndef => 257,
            other => (other as i32) - (Symbol::UserNonterminalSymbol1 as i32) + 258
            // This constant 258 should be statically known. It is the first token value for yychar.
        }
    }

    pub fn yytoken_value(&self) -> i32 {
        self as i32
    }

    /// The inverse of the `Symbol::yychar_value()` function.
    pub fn from_yychar(yychar: i32) -> Symbol {
        match yychar {
            -2 => Symbol::YYEmpty,
            //    ⋮    Whatever other "special" values there are.
            i if i < 256 => Symbol::YYUndef,
            256 => Symbol::YYError,
            257 => Symbol::YYUndef,
            i if i <= 256 + YYNTOKENS_ => Symbol::from_i32( i - 258 + (Symbol::UserNonterminalSymbol1 as i32) ).unwrap(),
            _ => Symbol::YYUndef
        }
    }

    pub fn name(&self) -> &'static str {
        yynames_[(self as i32) as usize]
    }
}

This has the advantages of:

  1. moving the consts in Lexer to a dedicated enum
  2. moving the consts in SymbolKind to a dedicated enum, eliminating the need for SymbolKind and all the SymbolKind::get() calls
  3. making the yychar variable redundant altogether, replacing each read of yychar with a yytoken::yychar_value(), for example
  4. making yytranslate_() and yytranslate_table_ unnecessary
  5. eliminates Lexer::TOKEN_NAMES (which I think is redundant anyway...?)

I am not sure I have all the details correct in the code above, but it seems to me that something like this should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant