Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade generated parse_line() to accept constant SBuf #1840

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

rousskov
Copy link
Contributor

@rousskov rousskov commented Jun 11, 2024

Also applied recent commit 314e430 "only SP and HT" logic when parsing
filenames in preprocessor include directives and operands in
preprocessor if statements. Prior to this change, in these contexts,
Squid also accepted ASCII VT and FF characters as well as any other
locale-specific characters classified as space by isspace(3).

Other changes attempt to preserve existing squid.conf grammar
idiosyncrasies. A lot of squid.conf parsing code needs to be upgraded;
these changes focus on parse_line() because upcoming reconfiguration
improvements require that particular function to accept SBuf.

Also fixed ConfigParser SetCfgLine() method and CfgLine data member
descriptions. These members are not about the whole squid.conf line, and
not every squid.conf line goes through these members: These members are
dedicated to tokens located after a directive name on a directive line.
A lot of related code needs to be polished, but we fixed these docs
because this change affects all SetCfgLine() callers and CfgLine users.

These changes attempt to preserve existing squid.conf grammar
idiosyncrasies. A lot of squid.conf parsing code needs to be upgraded;
these changes focus on parse_line() because upcoming reconfiguration
improvements require that particular function to accept SBuf.

Also fixed ConfigParser SetCfgLine() method and CfgLine data member
descriptions. These members are not about the whole squid.conf line, and
not every squid.conf line goes through these members: These members are
dedicated to tokens located after a directive name on a directive line.
A lot of related code needs to be polished, but we fixed these docs
because this change affects all SetCfgLine() callers and CfgLine users.
@rousskov rousskov added the S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box label Jun 11, 2024
Copy link
Contributor Author

@rousskov rousskov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This review annotates PR changes but does not request any adjustments.

src/base/CharacterSet.cc Outdated Show resolved Hide resolved
}

std::optional<SBuf>
NextWordWhileRemovingDoubleQuotesAndBackslashesInsideThem(Parser::Tokenizer &tk)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using this very long name because I assume that the underlying syntax here does not have a well-known name (it is effectively defined by existing strwordtok() implementation that this code is based on) and to automatically flag all callers for extra scrutiny: If the caller should use a well-known syntax instead, they should not call this function.

Just like with libcSpace(): We may (and probably should!) change callers grammar to some well-known alternative, but let's not do that in this behavior-preserving PR.

If my assumption about the lack of a well-known name for this syntax is wrong, please let me know what the correct name is, and I will rename this function in his PR, of course.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Er, this is called "un-escaping a quoted string"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Er, this is called "un-escaping a quoted string"

Yes, it is. A lot of things can be (and are!) called that. This function name should be more precise IMO, for the reasons detailed earlier.

src/cache_cf.cc Outdated Show resolved Hide resolved
@rousskov rousskov added the S-waiting-for-PR Closure of other PR(s), current or future, is expected (and usually required) label Jun 13, 2024
    if 0=1

Our preprocessor grammar is not based on space-separated tokens!
Condition "00=000" will need to be replaced with "0=000"
when int64() bug is fixed.
Better than nothing. We could add a stronger test that validates
equality, but it would make this configuration file inapplicable to SMP
Squids. IMO, we should not do that without also developing the
corresponding instruction (and adding that to the instructions file).
Copy link
Contributor

@eduard-bagdasaryan eduard-bagdasaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rousskov rousskov removed the S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box label Jun 18, 2024
Copy link
Contributor

@yadij yadij left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I believe you are dong the refactor "wrong", ie. in a way which makes the change steps more difficult and complex than other approaches.

There are a lot of "Also, ..." changes in src/cache_cf.cc which are not documented in the PR description.

src/ConfigParser.cc Outdated Show resolved Hide resolved
src/ConfigParser.cc Outdated Show resolved Hide resolved
src/ConfigParser.cc Show resolved Hide resolved
src/base/CharacterSet.cc Outdated Show resolved Hide resolved
src/cache_cf.cc Outdated Show resolved Hide resolved
Copy link
Contributor Author

@rousskov rousskov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yadij, I hope I have addressed all change requests. Please revisit the two change requests related to character sets ASAP. I hope those requests were based on the wrong assumption about the current official code, and the corresponding changes should be implemented in a followup PR instead. If you insist, I will address them in this PR, but I do not want this PR wait another ten days for that decision. Thank you.

src/ConfigParser.cc Show resolved Hide resolved
src/base/CharacterSet.cc Outdated Show resolved Hide resolved
src/cache_cf.cc Outdated Show resolved Hide resolved
src/ConfigParser.cc Outdated Show resolved Hide resolved
src/ConfigParser.cc Outdated Show resolved Hide resolved
@rousskov rousskov requested a review from yadij June 21, 2024 17:45
@rousskov rousskov added the S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box label Jun 21, 2024
This change should have been done in master/v7 commit 314e430, but I
missed it because the caller does not use isspace() directly.

This change removes the last use of branch-added
CharacterSet::libcSpace().
Comment on lines +379 to 385
/// extracts all leading space characters (if any)
/// \returns whether at least one character was extracted
static bool
SkipOptionalSpace(Parser::Tokenizer &tk)
{
SubstituteMacro(line, len, "${service_name}", service_name.c_str());
SubstituteMacro(line, len, "${process_name}", TheKidName.c_str());
SubstituteMacro(line, len, "${process_number}", xitoa(KidIdentifier));
return tk.skipAll(CharacterSet::WSP);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please inline this one-line function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should not be inlined because inlining it will duplicate its code and obscure its high-level intent/logic in many callers.

In most cases, we should not inline, regardless of the number of lines a function has. There are exceptions to that rule of thumb, of course, but this particular function is not exceptional.

Copy link
Contributor

@yadij yadij Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling a one-line function is just as much duplicate code as writing its contents in full.

This function should not be inlined because inlining it will duplicate its code

"duplicate code" arguments are only relevant to lines which are complex (eg math equations).

and obscure its high-level intent/logic in many callers.

The "high level intent" is clear from the syntax and parameters of skipAll(CharacterSet::WSP);.
This being a separate function is redundant (as is the new documentation for it).

In most cases, we should not inline, regardless of the number of lines a function has. There are exceptions to that rule of thumb, of course, but this particular function is not exceptional.

I Disagree, one-line functions are all special-case for that rule of thumb. This one is also an exceptional in addition to that one-line nature:

FYI: One-line functions/methods that simply call another function/method (such as this one does) are called "trampolines" and best practice is to avoid use of trampolines whenever possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling a one-line function is just as much duplicate code as writing its contents in full.

That assertion is false for at least two independent reasons:

  • That encapsulated code line contains more details than a function name. It contains the choice of delimiters to use when skipping space.
  • Using the same context-specific wrapper function (name) in multiple places carries more information than calling a low level skipAll() method would. Knowing that all these places are related and all of them skip optional space (rather than, say, required space) helps read and maintain code.

This function should not be inlined because inlining it will duplicate its code

"duplicate code" arguments are only relevant to lines which are complex (eg math equations).

That assertion is false because complexity is not the only thing we do not want to duplicate. Duplicating simple decisions is also a negative.

The "high level intent" is clear from the syntax and parameters of skipAll(CharacterSet::WSP);.

Does that call skip optional or required space? What other places should be updated if I want to update WSP in one particular place? Encapsulating that call helps answer these and other questions. Yes, those answers can usually be derived without encapsulation, but at a higher cost (sometimes significantly higher cost!).

In most cases, we should not inline, regardless of the number of lines a function has. There are exceptions to that rule of thumb, of course, but this particular function is not exceptional.

I Disagree, one-line functions are all special-case for that rule of thumb.

I see no good reason to make one-line functions a special case. The onus is on you to prove such an exception is needed (just stating that it is needed is not enough).

N.B. Absence of an exception to "do not inline unless you have a very good reason" rule of thumb does not mean that one-line functions are always better than duplicated inlined code, of course.

This one is also an exceptional in addition to that one-line nature:

FYI: One-line functions/methods that simply call another function/method (such as this one does) are called "trampolines" and best practice is to avoid use of trampolines whenever possible.

The term "trampoline function" means quite different things in various programming languages/contexts, but it is not used to describe "a function calling another function" in any relevant context I can find. The term is usually related to functions that define/contain (not just call!) other functions or receive other functions as parameters/state. None of that applies here, of course.

/// \returns std::nullopt if input does not look like an `include` statement
/// \returns `include` parameters if input is an `include` statement
static std::optional<SBuf>
IsIncludeLine(Parser::Tokenizer tk)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing reference leaves the keyword "include " in callers Tokenizer to be interpreted as a filename:

Suggested change
IsIncludeLine(Parser::Tokenizer tk)
IsIncludeLine(Parser::Tokenizer &tk)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing reference leaves the keyword "include " in callers Tokenizer to be interpreted as a filename

That reference is not missing here and adding it would introduce problems: This IsX() function is a test function. It does not and should not change its input/Tokenizer. The caller is fine because it gets filenames from the function return value. The caller also does not want any input consumed because it is running other tests against the same input line and does not want to worry about one test affecting the other.

char* lvalue = (char*)xmalloc(equation - expr + 1);
xstrncpy(lvalue, expr, equation - expr + 1);
trim_trailing_ws(lvalue);
// grammar: space* keyword space* END
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This grammar does not match the parser. The second space* is prohibited from being >0 by RejectTrailingGarbage() below:

Suggested change
// grammar: space* keyword space* END
// grammar: space* keyword 0#space

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This grammar does not match the parser. The second space* is prohibited from being >0 by RejectTrailingGarbage() below

The grammar matches: The second space* is allowed/consumed right before RejectTrailingGarbage() is called:

if (SkipOptionalSpace(tk)) { // second `space*` is allowed/consumed
    RejectTrailingGarbage("preprocessor keyword", keyword, tk); // throws if there is trailing garbage
    return true;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. right. The wrapper function naming and code structure makes it seem like any skipping done causes a rejection.

The problem this is the opposite. A line *without spaces after the found token is not rejected as garbage. For example;

  • the old parser (as already mentioned) accepted "else if" and treated it as an else-statement with a nested if-statement
  • this logic will accept "elseif" and do the same.
    .. both of which are unintentional and broken/undefined behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A line *without spaces after the found token is not rejected as garbage.

That statement is misleading in this context because such a line is not accepted either: A line with garbage after letters matching a given keyword K results in IsIfStatementLine(K) returning false. That is correct behavior for that function and the grammar it defines/supports -- preprocessor keyword K was not found. After IsIfStatementLine(K) returns false, other parsing code (possibly including other IsIfStatementLine(X) calls) deals with that line, rejecting it as needed.

For example; ... this logic will accept "elseif" and [treat it as an else-statement with a nested if-statement]

That example is incorrect or misleading. PR code does not accept "elseif". It correctly rejects "elseif", first as not a preprocessor line and then as an unknown squid.conf directive:

2024/11/05 10:51:40| ERROR: unrecognized directive near 'elseif'
2024/11/05 10:51:41| FATAL: configuration failure: Found 1 unrecognized directive(s)

If we add a directive named like elsewhere_services ... or iffy_clients_enabled ..., PR preprocessor code will not need to be modified for Squid to correctly recognize that new directive despite its name starting with the same letters as the preprocessor keyword "else" or "if". Needless to say, the chances of us adding those exact directives are pretty much zero, but I am using them to illustrate that PR code structure is correct and can support future changes such as addition of new directives (or new preprocessor keywords that are more likely to match some directive name prefixes!).

Comment on lines +502 to +506
if (SkipOptionalSpace(tk)) {
RejectTrailingGarbage("preprocessor keyword", keyword, tk);
return true;
}
// e.g., "elseif"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, "elseif" is not supported.

However, despite cf.data.pre stating that "NOTE: An else-if condition is not supported." the configuration grammar:

   // grammar: "else" space "if" space condition space* END

does actually work in current Squid. This new logic will break that. Please fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, "elseif" is not supported.

Yes, this comment provides an example of invalid (but realistic) input that would allow function code to reach this particular spot on a negative/"return false" code path.

However, despite cf.data.pre stating that "NOTE: An else-if condition is not supported." the configuration grammar:

   // grammar: "else" space "if" space condition space* END

does actually work in current Squid.

No, else if does not work in the current official code. Our documentation is correct -- that and similar else-if expressions are not supported: Official code does not bother to check for leftovers and misinterprets else if ... as unconditional else! There are other problems related to if/else/endif handling in official code (that this PR does not make worse); please do not ask this "SBuf upgrade" PR to add proper else-if support to Squid.

This new logic will break that. Please fix.

PR code does not break non-existent else if support (so there is nothing to "fix" in this change request scope), but this PR does fix detection of that still-unsupported case and provides useful diagnostics:

FATAL: configuration failure: found trailing garbage after parsing preprocessor keyword else: if false

@yadij yadij added S-waiting-for-author author action is expected (and usually required) and removed S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box labels Jul 7, 2024
Copy link
Contributor Author

@rousskov rousskov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yadij, I believe I have addressed all change requests. Please unblock this PR.

Comment on lines +379 to 385
/// extracts all leading space characters (if any)
/// \returns whether at least one character was extracted
static bool
SkipOptionalSpace(Parser::Tokenizer &tk)
{
SubstituteMacro(line, len, "${service_name}", service_name.c_str());
SubstituteMacro(line, len, "${process_name}", TheKidName.c_str());
SubstituteMacro(line, len, "${process_number}", xitoa(KidIdentifier));
return tk.skipAll(CharacterSet::WSP);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should not be inlined because inlining it will duplicate its code and obscure its high-level intent/logic in many callers.

In most cases, we should not inline, regardless of the number of lines a function has. There are exceptions to that rule of thumb, of course, but this particular function is not exceptional.

/// \returns std::nullopt if input does not look like an `include` statement
/// \returns `include` parameters if input is an `include` statement
static std::optional<SBuf>
IsIncludeLine(Parser::Tokenizer tk)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing reference leaves the keyword "include " in callers Tokenizer to be interpreted as a filename

That reference is not missing here and adding it would introduce problems: This IsX() function is a test function. It does not and should not change its input/Tokenizer. The caller is fine because it gets filenames from the function return value. The caller also does not want any input consumed because it is running other tests against the same input line and does not want to worry about one test affecting the other.

char* lvalue = (char*)xmalloc(equation - expr + 1);
xstrncpy(lvalue, expr, equation - expr + 1);
trim_trailing_ws(lvalue);
// grammar: space* keyword space* END
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This grammar does not match the parser. The second space* is prohibited from being >0 by RejectTrailingGarbage() below

The grammar matches: The second space* is allowed/consumed right before RejectTrailingGarbage() is called:

if (SkipOptionalSpace(tk)) { // second `space*` is allowed/consumed
    RejectTrailingGarbage("preprocessor keyword", keyword, tk); // throws if there is trailing garbage
    return true;
}

Comment on lines +502 to +506
if (SkipOptionalSpace(tk)) {
RejectTrailingGarbage("preprocessor keyword", keyword, tk);
return true;
}
// e.g., "elseif"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, "elseif" is not supported.

Yes, this comment provides an example of invalid (but realistic) input that would allow function code to reach this particular spot on a negative/"return false" code path.

However, despite cf.data.pre stating that "NOTE: An else-if condition is not supported." the configuration grammar:

   // grammar: "else" space "if" space condition space* END

does actually work in current Squid.

No, else if does not work in the current official code. Our documentation is correct -- that and similar else-if expressions are not supported: Official code does not bother to check for leftovers and misinterprets else if ... as unconditional else! There are other problems related to if/else/endif handling in official code (that this PR does not make worse); please do not ask this "SBuf upgrade" PR to add proper else-if support to Squid.

This new logic will break that. Please fix.

PR code does not break non-existent else if support (so there is nothing to "fix" in this change request scope), but this PR does fix detection of that still-unsupported case and provides useful diagnostics:

FATAL: configuration failure: found trailing garbage after parsing preprocessor keyword else: if false

test-suite/squidconf/pp-ifs.conf Outdated Show resolved Hide resolved
@rousskov rousskov added S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box and removed S-waiting-for-author author action is expected (and usually required) S-waiting-for-PR Closure of other PR(s), current or future, is expected (and usually required) labels Jul 8, 2024
@rousskov rousskov requested a review from yadij July 8, 2024 20:29
@squid-anubis squid-anubis added M-failed-other https://github.com/measurement-factory/anubis#pull-request-labels and removed M-failed-other https://github.com/measurement-factory/anubis#pull-request-labels labels Sep 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants