Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attribute single quote mark errors as "Ambiguous syntax" #11

Open
radonnachie opened this issue Jun 10, 2020 · 7 comments
Open

Attribute single quote mark errors as "Ambiguous syntax" #11

radonnachie opened this issue Jun 10, 2020 · 7 comments

Comments

@radonnachie
Copy link

Quite excited to see what I can make of the pyVHDLparser. In using it on a simple enough file, I encounter the following:

================================================================================
                        pyVHDLParser - Test Application
================================================================================
FATAL: An unknown or unhandled exception reached the topmost exception handler!
  Exception type:      TokenizerException
  Exception message:   (line:  15, col: 37): Ambiguous syntax detected. buffer: ''l'
  Caused in:           GetVHDLTokenizer in file '/mnt/d/Development/OpenSource/pyVHDLParser/pyVHDLParser/Token/Parser.py' at line 368
--------------------------------------------------------------------------------
  ...Token/Parser.py", line 368, in GetVHDLTokenizer
    raise TokenizerException("Ambiguous syntax detected. buffer: '{buffer}'".format(buffer=buffer), start)
--------------------------------------------------------------------------------
Please report this bug at GitHub: https://github.com/VLSI-EDA/pyIPCMI/issues
--------------------------------------------------------------------------------

This seems to be triggered by the access of 'length... I'll be diving into the Parser.py file to see if I can rectify.

Thanks!

@radonnachie
Copy link
Author

radonnachie commented Jun 10, 2020

pyVHDLParser/Token/Parser.py : 363-365

if ((buffer[0] in __ALPHA_CHARACTERS__) and (buffer[1] in __ALPHA_CHARACTERS__)):
  tokenKind =     cls.TokenKind.AlphaChars
elif ((buffer[0] in __WHITESPACE_CHARACTERS__) and (buffer[1] in __WHITESPACE_CHARACTERS__)):
...

So buffer[0] is the single quote... doesn't seem like any attribute access will survive this. Am I missing something?

@radonnachie
Copy link
Author

I changed line 362 to

buffer =          buffer[1:3]

from

buffer =          buffer[:2]

@Paebbels Paebbels transferred this issue from Paebbels/pyIPCMI Jun 11, 2020
@Paebbels
Copy link
Owner

Hi, sorry for the wrong issue link in the error message. It's a pyVHDLParser error, not an pyIPCMI error :).

Parsing attributes is very complex in VHDL and by nature ambiguous. Some parser (to be exact lexers/tokenizers) protect them selves by allowing only attribute names longer then 1 character, otherwise it might get mixed up with character literals 'c'.

Examples:

  • a'b'c => character literal c between a and c
  • aa'bb'cc => chain of attribute names cc applied to bb to aa

@radonnachie
Copy link
Author

radonnachie commented Jun 13, 2020

Okay, nice! Only character literals have single quotes, (I believe for longer literals double quotes are required) right? If not one could exhaust the attribute list. Otherwise, perhaps the following is helpful in achieving that definition?

re.search("('\w(\w+))+", buffer)
>>> l = "a'b'c"
>>> a = "a'bb'cc"
>>> al = "a'bb'c"
>>> print(re.search("('\w(\w+))+", l))
None
>>> print(re.search("('\w(\w+))+", a))
<re.Match object; span=(1, 7), match="'bb'cc">
>>> print(re.search("('\w(\w+))+", al))
<re.Match object; span=(1, 4), match="'bb">

It would catch a chain of attributes as a single compound-attribute, which may be quite neat. The chain will extend until there is a character literal. It'd have to be used recursively if one wants the full tree of literal-attribute accesses.

It doesn't care if there are literals before an attribute:

>>> la = "a'c'bb"
>>> print(re.search("('\w(\w+))+", la))
<re.Match object; span=(3, 6), match="'bb">

Is ☝️ an issue?

@Paebbels
Copy link
Owner

The list of attributes is not limited. Users can define own attributes, thus comparing against such a list doesn't work.

Moreover, a tokenizer doesn't know these details. I just splits the input file into a stream of tokens and tries it's best to figure out what kind of token it is. The Tokenizer in pyVHDLParser already creates more token types then any other parser I know.

A literal is a class of base element in a language:

  • keywords (VHDL calls them reserved words)
  • identifiers
  • extended identifier
  • literals (leaf elements in an expression tree)
    • integer number
    • floating point number
    • character
    • string
    • bitstrings
  • operators
  • delimiter
  • whitespace
  • comment

Thus, 125, 45.975, 'c', "hello world", x"0110" are literals.

@radonnachie
Copy link
Author

I extensively edited my last comment, primarily because I thought to research literals. Thanks for exhaustively listing them. I think, though, that you weren't answering the meat of what I was querying...

I think that, from the get-go, none of the conversation has moved towards a solution:

  1. I believe that isAlpha(buffer[0:2]) will always fail because buffer[0] is the apostrophe.
  2. I believe that the original intention was to ensure that there are at least two alpha characters after the apostrophe (isAlpha(buffer[1:3])), in order to ensure that one is not dealing with a character literal.
  • It still has not been confirmed that the only literal that uses an apostrophe (single-quote) is the character literal, justifying that it is the only non-attribute ambiguity to guard against.
  1. I proposed a superfluous regex to accomplish finding attributes in general, perhaps undermining whatever pre-processing occured to buffer.

I am trying to share that the tokeniser failed on my file which used attributes, and I am looking commit the fix. I think 2. may be more immediate and I was looking to see if that was your original intention.

I feel that this repo is on hold, so I don't mind if your head is not in the right space to fully consider any fix whatsoever. Just say.

@Paebbels
Copy link
Owner

@RocketRoss yes development here is currently very slow due to other activities.

The repo has now >250 test cases and reaches 48% branch coverage. I'm working on improving test coverage and also documentation. While doing so, some bugs where discovered and fixed. Your issue is not yet investigated.

For a regexp solution: The Tokenizer works without regexp to ensure a high performance.

My plan is to work more in Christmas holidays on this project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants