Skip to content
This repository has been archived by the owner on Sep 20, 2021. It is now read-only.

Fix error handling for non-UTF-8 string in Lexer #93

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions Exception/InternalError.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
<?php

namespace Hoa\Compiler\Exception;

use LogicException;

/**
* It probably points to some internal issue of the Hoa Compiler library.
* Regardless source of the bug, try to report about this exception to the library maintainers.
* Even if bug is yours, this exception must not happen.
*/
final class InternalError extends LogicException
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final is not required here. I think you can remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All exceptions must extend Hoa\Exception, see sibling exception in the same directory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also name the class RegularExpression or PCRE, something less generic than InternalError.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final is not required here. I think you can remove it.

Let's try to see it from another point of view: there is no final keyword. There is open one. You suggest to open for inheritance. Why? I mean, it's known idea that composition is preferable over inheritance for some objective reasons. So it could be just a language design mistake to make classes open to inheritance by default. Some modern languages, for example, fixed it. So what's the real reason to remove final open class for inheritance?

I would also name the class RegularExpression or PCRE, something less generic than InternalError.

I don't get it either: RegularExpression is a special language or pattern, depends on context, but not an error. Kinda weird for exception class name, IMO. You probably rely on namespaces, but I

I understand and respect your own coding style, but I'm curious how do you explain it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use final in the existing code base. I prefer to address that as another PR if you don't mind :-). You are pointing to an interesting approach, and I like it, but yeah, keep things seperate a little bit :-).

About the class/exception name, maybe REgularExpression is not appropriate. But I don't find InternalError more appropriate, it's too much abstract.

The exception represents an error in the lexer. We can (i) reuse the Hoa\Compiler\Exception\Lexer exception class (https://github.com/hoaproject/Compiler/blob/master/Exception/Lexer.php), or (ii) create another one to reflect more precisely what is the origin of the exception, hence a better name than InternalError.

If you decide to re-use Exception\Lexer, I'm fine with that. I would even suggest to do that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hoa\Compiler\Exception\Lexer doesn't look like internal error of the Lexer. In other words, internal error is library's failure, user did nothing wrong (at least how I interpret it; that's why I reused it for Text is not valid utf-8 string, you probably need to switch "lexer.unicode" setting off., obvious misconfiguration by user). It's like comparing 4xx HTTP error codes to 5xx.

I don't like idea to mix up semantically different errors into one exception.

InternalError is definitely generic name, I consider this exception as unchecked one; users shouldn't be aware of this exception at all. Thus, they shouldn't catch this type of exception directly (only like \Exception or \Throwable in specific parts of application where failure of subprogram can be handled).

The thing either works or it is broken. Classifying hypothetical errors is a waste of time for me as most of them are really hypothetical: you don't expect them, otherwise you'd just fixed the; you just make some "save point" for yourself to make sure that it works internally like you expected without surprises. It is similar to Assertion::blahBlah(): you don't care about exception class. Assertion just must not fail. If it does, your program is bugged like if you passed object as argument to integer parameter.

And regarding names. Practice says that verbs are way more important in programming than nouns. Stack trace says Lexer is broken and Lexer throws Lexer exception. I want to know what happened wrong. LexerEncounteredInvalidUtf8Sequence, TokenIsEmptyString, etc.

I also was confused when I found class Lexer in unit-tests which tried to mimic test of real Lexer. :)

But I can reuse Hoa\Compiler\Exception\Lexer.

Copy link
Author

@unkind unkind Nov 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to add: that's why there is no unit test for InternalError: it's impossible to achieve via public API of the class unless the code is actually bugged.

PhpStorm also doesn't inspect tag @throws if exception was inherited from \LogicException and \RuntimeException by default, that's why it was inherited from \LogicException.

But in the current moment PhpStorm just goes nuts with exceptions.

{
public function __construct($message, Exception $previous = null)
unkind marked this conversation as resolved.
Show resolved Hide resolved
{
parent::__construct($message, 0, $previous);
}
}
32 changes: 29 additions & 3 deletions Llk/Lexer.php
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
namespace Hoa\Compiler\Llk;

use Hoa\Compiler;
use Hoa\Compiler\Exception\InternalError;

/**
* Class \Hoa\Compiler\Llk\Lexer.
Expand Down Expand Up @@ -110,6 +111,8 @@ public function __construct(array $pragmas = [])
*/
public function lexMe($text, array $tokens)
{
$this->validateInputInUnicodeMode($text);

$this->_text = $text;
$this->_tokens = $tokens;
$this->_nsStack = null;
Expand Down Expand Up @@ -272,9 +275,9 @@ protected function nextToken($offset)
*/
protected function matchLexeme($lexeme, $regex, $offset)
{
$_regex = str_replace('#', '\#', $regex);
$preg = preg_match(
'#\G(?|' . $_regex . ')#' . $this->_pcreOptions,
$_regex = '#\G(?|' . str_replace('#', '\#', $regex) . ')#' . $this->_pcreOptions;
$preg = @preg_match(
$_regex,
$this->_text,
$matches,
0,
Expand All @@ -285,6 +288,16 @@ protected function matchLexeme($lexeme, $regex, $offset)
return null;
}

if (false === $preg) {
throw new Compiler\Exception\InternalError(
sprintf(
'Lexer encountered a PCRE error (code: %d), full regex: "%s".',
preg_last_error(),
$_regex
)
);
}

if ('' === $matches[0]) {
throw new Compiler\Exception\Lexer(
'A lexeme must not match an empty value, which is the ' .
Expand All @@ -300,4 +313,17 @@ protected function matchLexeme($lexeme, $regex, $offset)
'length' => mb_strlen($matches[0])
];
}

/**
* @param string $text
* @return bool
*/
private function validateInputInUnicodeMode($text)
{
if (strpos($this->_pcreOptions, 'u') !== false && preg_match('##u', $text) === false) {
throw new Compiler\Exception\Lexer(
'Text is not valid utf-8 string, you probably need to switch "lexer.unicode" setting off.'
);
}
}
}
23 changes: 23 additions & 0 deletions Test/Unit/Llk/Lexer.php
Original file line number Diff line number Diff line change
Expand Up @@ -496,4 +496,27 @@ public function case_unicode_disabled()
' ↑'
);
}

public function case_invalid_utf8_with_unicode_mode()
{
$this
->given(
$lexer = new SUT(['lexer.unicode' => true]),
$datum = "\xFF",
$tokens = [
'default' => [
'foo' => "\xFF"
]
]
)
->when($result = $lexer->lexMe($datum, $tokens))
->then
->exception(function () use ($result) {
$result->next();
})
->isInstanceOf(LUT\Exception\Lexer::class)
->hasMessage(
'Text is not valid utf-8 string, you probably need to switch "lexer.unicode" setting off.'
);
}
}