Upgrade generated parse_line() to accept constant SBuf #1840

rousskov · 2024-06-11T17:50:55Z

Also applied recent commit 314e430 "only SP and HT" logic when parsing
filenames in preprocessor include directives and operands in
preprocessor if statements. Prior to this change, in these contexts,
Squid also accepted ASCII VT and FF characters as well as any other
locale-specific characters classified as space by isspace(3).

Other changes attempt to preserve existing squid.conf grammar
idiosyncrasies. A lot of squid.conf parsing code needs to be upgraded;
these changes focus on parse_line() because upcoming reconfiguration
improvements require that particular function to accept SBuf.

Also fixed ConfigParser SetCfgLine() method and CfgLine data member
descriptions. These members are not about the whole squid.conf line, and
not every squid.conf line goes through these members: These members are
dedicated to tokens located after a directive name on a directive line.
A lot of related code needs to be polished, but we fixed these docs
because this change affects all SetCfgLine() callers and CfgLine users.

These changes attempt to preserve existing squid.conf grammar idiosyncrasies. A lot of squid.conf parsing code needs to be upgraded; these changes focus on parse_line() because upcoming reconfiguration improvements require that particular function to accept SBuf. Also fixed ConfigParser SetCfgLine() method and CfgLine data member descriptions. These members are not about the whole squid.conf line, and not every squid.conf line goes through these members: These members are dedicated to tokens located after a directive name on a directive line. A lot of related code needs to be polished, but we fixed these docs because this change affects all SetCfgLine() callers and CfgLine users.

rousskov

This review annotates PR changes but does not request any adjustments.

src/base/CharacterSet.cc

rousskov · 2024-06-11T18:23:53Z

src/parser/Tokenizer.cc

+}
+
+std::optional<SBuf>
+NextWordWhileRemovingDoubleQuotesAndBackslashesInsideThem(Parser::Tokenizer &tk)


I am using this very long name because I assume that the underlying syntax here does not have a well-known name (it is effectively defined by existing strwordtok() implementation that this code is based on) and to automatically flag all callers for extra scrutiny: If the caller should use a well-known syntax instead, they should not call this function.

Just like with libcSpace(): We may (and probably should!) change callers grammar to some well-known alternative, but let's not do that in this behavior-preserving PR.

If my assumption about the lack of a well-known name for this syntax is wrong, please let me know what the correct name is, and I will rename this function in his PR, of course.

Er, this is called "un-escaping a quoted string"

Er, this is called "un-escaping a quoted string"

Yes, it is. A lot of things can be (and are!) called that. This function name should be more precise IMO, for the reasons detailed earlier.

src/cache_cf.cc

if 0=1 Our preprocessor grammar is not based on space-separated tokens!

Condition "00=000" will need to be replaced with "0=000" when int64() bug is fixed.

Better than nothing. We could add a stronger test that validates equality, but it would make this configuration file inapplicable to SMP Squids. IMO, we should not do that without also developing the corresponding instruction (and adding that to the instructions file).

test-suite/squidconf/pp-ifs.conf

src/base/CharacterSet.cc

eduard-bagdasaryan

LGTM

Also fix #include order for STL headers. Context: squid-cache#1840 (review)

yadij

FWIW, I believe you are dong the refactor "wrong", ie. in a way which makes the change steps more difficult and complex than other approaches.

There are a lot of "Also, ..." changes in src/cache_cf.cc which are not documented in the PR description.

src/ConfigParser.cc

src/base/CharacterSet.cc

src/cache_cf.cc

Context: squid-cache#1840 (comment)

rousskov

@yadij, I hope I have addressed all change requests. Please revisit the two change requests related to character sets ASAP. I hope those requests were based on the wrong assumption about the current official code, and the corresponding changes should be implemented in a followup PR instead. If you insist, I will address them in this PR, but I do not want this PR wait another ten days for that decision. Thank you.

src/ConfigParser.cc

src/base/CharacterSet.cc

src/cache_cf.cc

src/ConfigParser.cc

See master/v7 commit 314e430.

This change should have been done in master/v7 commit 314e430, but I missed it because the caller does not use isspace() directly. This change removes the last use of branch-added CharacterSet::libcSpace().

yadij · 2024-07-07T03:35:16Z

src/cache_cf.cc

+/// extracts all leading space characters (if any)
+/// \returns whether at least one character was extracted
+static bool
+SkipOptionalSpace(Parser::Tokenizer &tk)
 {
-    SubstituteMacro(line, len, "${service_name}", service_name.c_str());
-    SubstituteMacro(line, len, "${process_name}", TheKidName.c_str());
-    SubstituteMacro(line, len, "${process_number}", xitoa(KidIdentifier));
+    return tk.skipAll(CharacterSet::WSP);
 }


Please inline this one-line function.

This function should not be inlined because inlining it will duplicate its code and obscure its high-level intent/logic in many callers.

In most cases, we should not inline, regardless of the number of lines a function has. There are exceptions to that rule of thumb, of course, but this particular function is not exceptional.

Calling a one-line function is just as much duplicate code as writing its contents in full.

This function should not be inlined because inlining it will duplicate its code

"duplicate code" arguments are only relevant to lines which are complex (eg math equations).

and obscure its high-level intent/logic in many callers.

The "high level intent" is clear from the syntax and parameters of skipAll(CharacterSet::WSP);.
This being a separate function is redundant (as is the new documentation for it).

In most cases, we should not inline, regardless of the number of lines a function has. There are exceptions to that rule of thumb, of course, but this particular function is not exceptional.

I Disagree, one-line functions are all special-case for that rule of thumb. This one is also an exceptional in addition to that one-line nature:

FYI: One-line functions/methods that simply call another function/method (such as this one does) are called "trampolines" and best practice is to avoid use of trampolines whenever possible.

Calling a one-line function is just as much duplicate code as writing its contents in full.

That assertion is false for at least two independent reasons:

That encapsulated code line contains more details than a function name. It contains the choice of delimiters to use when skipping space.

Using the same context-specific wrapper function (name) in multiple places carries more information than calling a low level skipAll() method would. Knowing that all these places are related and all of them skip optional space (rather than, say, required space) helps read and maintain code.

This function should not be inlined because inlining it will duplicate its code

"duplicate code" arguments are only relevant to lines which are complex (eg math equations).

That assertion is false because complexity is not the only thing we do not want to duplicate. Duplicating simple decisions is also a negative.

The "high level intent" is clear from the syntax and parameters of skipAll(CharacterSet::WSP);.

Does that call skip optional or required space? What other places should be updated if I want to update WSP in one particular place? Encapsulating that call helps answer these and other questions. Yes, those answers can usually be derived without encapsulation, but at a higher cost (sometimes significantly higher cost!).

In most cases, we should not inline, regardless of the number of lines a function has. There are exceptions to that rule of thumb, of course, but this particular function is not exceptional.

I Disagree, one-line functions are all special-case for that rule of thumb.

I see no good reason to make one-line functions a special case. The onus is on you to prove such an exception is needed (just stating that it is needed is not enough).

N.B. Absence of an exception to "do not inline unless you have a very good reason" rule of thumb does not mean that one-line functions are always better than duplicated inlined code, of course.

This one is also an exceptional in addition to that one-line nature:

FYI: One-line functions/methods that simply call another function/method (such as this one does) are called "trampolines" and best practice is to avoid use of trampolines whenever possible.

The term "trampoline function" means quite different things in various programming languages/contexts, but it is not used to describe "a function calling another function" in any relevant context I can find. The term is usually related to functions that define/contain (not just call!) other functions or receive other functions as parameters/state. None of that applies here, of course.

yadij · 2024-07-07T03:38:55Z

src/cache_cf.cc

+/// \returns std::nullopt if input does not look like an `include` statement
+/// \returns `include` parameters if input is an `include` statement
+static std::optional<SBuf>
+IsIncludeLine(Parser::Tokenizer tk)


Missing reference leaves the keyword "include " in callers Tokenizer to be interpreted as a filename:

Suggested change

IsIncludeLine(Parser::Tokenizer tk)

IsIncludeLine(Parser::Tokenizer &tk)

Missing reference leaves the keyword "include " in callers Tokenizer to be interpreted as a filename

That reference is not missing here and adding it would introduce problems: This IsX() function is a test function. It does not and should not change its input/Tokenizer. The caller is fine because it gets filenames from the function return value. The caller also does not want any input consumed because it is running other tests against the same input line and does not want to worry about one test affecting the other.

yadij · 2024-07-07T03:54:20Z

src/cache_cf.cc

-        char* lvalue = (char*)xmalloc(equation - expr + 1);
-        xstrncpy(lvalue, expr, equation - expr + 1);
-        trim_trailing_ws(lvalue);
+    // grammar: space* keyword space* END


This grammar does not match the parser. The second space* is prohibited from being >0 by RejectTrailingGarbage() below:

Suggested change

// grammar: space* keyword space* END

// grammar: space* keyword 0#space

This grammar does not match the parser. The second space* is prohibited from being >0 by RejectTrailingGarbage() below

The grammar matches: The second space* is allowed/consumed right before RejectTrailingGarbage() is called:

if (SkipOptionalSpace(tk)) { // second `space*` is allowed/consumed RejectTrailingGarbage("preprocessor keyword", keyword, tk); // throws if there is trailing garbage return true; }

Hmm. right. The wrapper function naming and code structure makes it seem like any skipping done causes a rejection.

The problem this is the opposite. A line *without spaces after the found token is not rejected as garbage. For example;

the old parser (as already mentioned) accepted "else if" and treated it as an else-statement with a nested if-statement

this logic will accept "elseif" and do the same.
.. both of which are unintentional and broken/undefined behaviour.

A line *without spaces after the found token is not rejected as garbage.

That statement is misleading in this context because such a line is not accepted either: A line with garbage after letters matching a given keyword K results in IsIfStatementLine(K) returning false. That is correct behavior for that function and the grammar it defines/supports -- preprocessor keyword K was not found. After IsIfStatementLine(K) returns false, other parsing code (possibly including other IsIfStatementLine(X) calls) deals with that line, rejecting it as needed.

For example; ... this logic will accept "elseif" and [treat it as an else-statement with a nested if-statement]

That example is incorrect or misleading. PR code does not accept "elseif". It correctly rejects "elseif", first as not a preprocessor line and then as an unknown squid.conf directive:

2024/11/05 10:51:40| ERROR: unrecognized directive near 'elseif' 2024/11/05 10:51:41| FATAL: configuration failure: Found 1 unrecognized directive(s)

If we add a directive named like elsewhere_services ... or iffy_clients_enabled ..., PR preprocessor code will not need to be modified for Squid to correctly recognize that new directive despite its name starting with the same letters as the preprocessor keyword "else" or "if". Needless to say, the chances of us adding those exact directives are pretty much zero, but I am using them to illustrate that PR code structure is correct and can support future changes such as addition of new directives (or new preprocessor keywords that are more likely to match some directive name prefixes!).

yadij · 2024-07-07T04:15:41Z

src/cache_cf.cc

+        if (SkipOptionalSpace(tk)) {
+            RejectTrailingGarbage("preprocessor keyword", keyword, tk);
+            return true;
+        }
+        // e.g., "elseif"


Ah, "elseif" is not supported.

However, despite cf.data.pre stating that "NOTE: An else-if condition is not supported." the configuration grammar:

// grammar: "else" space "if" space condition space* END

does actually work in current Squid. This new logic will break that. Please fix.

Ah, "elseif" is not supported.

Yes, this comment provides an example of invalid (but realistic) input that would allow function code to reach this particular spot on a negative/"return false" code path.

However, despite cf.data.pre stating that "NOTE: An else-if condition is not supported." the configuration grammar:

// grammar: "else" space "if" space condition space* END

does actually work in current Squid.

No, else if does not work in the current official code. Our documentation is correct -- that and similar else-if expressions are not supported: Official code does not bother to check for leftovers and misinterprets else if ... as unconditional else! There are other problems related to if/else/endif handling in official code (that this PR does not make worse); please do not ask this "SBuf upgrade" PR to add proper else-if support to Squid.

This new logic will break that. Please fix.

PR code does not break non-existent else if support (so there is nothing to "fix" in this change request scope), but this PR does fix detection of that still-unsupported case and provides useful diagnostics:

FATAL: configuration failure: found trailing garbage after parsing preprocessor keyword else: if false

rousskov

@yadij, I believe I have addressed all change requests. Please unblock this PR.

rousskov · 2024-07-08T16:31:39Z

src/cache_cf.cc

+/// extracts all leading space characters (if any)
+/// \returns whether at least one character was extracted
+static bool
+SkipOptionalSpace(Parser::Tokenizer &tk)
 {
-    SubstituteMacro(line, len, "${service_name}", service_name.c_str());
-    SubstituteMacro(line, len, "${process_name}", TheKidName.c_str());
-    SubstituteMacro(line, len, "${process_number}", xitoa(KidIdentifier));
+    return tk.skipAll(CharacterSet::WSP);
 }


This function should not be inlined because inlining it will duplicate its code and obscure its high-level intent/logic in many callers.

In most cases, we should not inline, regardless of the number of lines a function has. There are exceptions to that rule of thumb, of course, but this particular function is not exceptional.

rousskov · 2024-07-08T16:38:21Z

src/cache_cf.cc

+/// \returns std::nullopt if input does not look like an `include` statement
+/// \returns `include` parameters if input is an `include` statement
+static std::optional<SBuf>
+IsIncludeLine(Parser::Tokenizer tk)


Missing reference leaves the keyword "include " in callers Tokenizer to be interpreted as a filename

That reference is not missing here and adding it would introduce problems: This IsX() function is a test function. It does not and should not change its input/Tokenizer. The caller is fine because it gets filenames from the function return value. The caller also does not want any input consumed because it is running other tests against the same input line and does not want to worry about one test affecting the other.

rousskov · 2024-07-08T16:43:56Z

src/cache_cf.cc

-        char* lvalue = (char*)xmalloc(equation - expr + 1);
-        xstrncpy(lvalue, expr, equation - expr + 1);
-        trim_trailing_ws(lvalue);
+    // grammar: space* keyword space* END


This grammar does not match the parser. The second space* is prohibited from being >0 by RejectTrailingGarbage() below

The grammar matches: The second space* is allowed/consumed right before RejectTrailingGarbage() is called:

if (SkipOptionalSpace(tk)) { // second `space*` is allowed/consumed RejectTrailingGarbage("preprocessor keyword", keyword, tk); // throws if there is trailing garbage return true; }

rousskov · 2024-07-08T17:37:11Z

src/cache_cf.cc

+        if (SkipOptionalSpace(tk)) {
+            RejectTrailingGarbage("preprocessor keyword", keyword, tk);
+            return true;
+        }
+        // e.g., "elseif"


Ah, "elseif" is not supported.

Yes, this comment provides an example of invalid (but realistic) input that would allow function code to reach this particular spot on a negative/"return false" code path.

However, despite cf.data.pre stating that "NOTE: An else-if condition is not supported." the configuration grammar:

// grammar: "else" space "if" space condition space* END

does actually work in current Squid.

No, else if does not work in the current official code. Our documentation is correct -- that and similar else-if expressions are not supported: Official code does not bother to check for leftovers and misinterprets else if ... as unconditional else! There are other problems related to if/else/endif handling in official code (that this PR does not make worse); please do not ask this "SBuf upgrade" PR to add proper else-if support to Squid.

This new logic will break that. Please fix.

PR code does not break non-existent else if support (so there is nothing to "fix" in this change request scope), but this PR does fix detection of that still-unsupported case and provides useful diagnostics:

FATAL: configuration failure: found trailing garbage after parsing preprocessor keyword else: if false

test-suite/squidconf/pp-ifs.conf

rousskov requested a review from eduard-bagdasaryan June 11, 2024 17:51

rousskov added the S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box label Jun 11, 2024

rousskov commented Jun 11, 2024

View reviewed changes

eduard-bagdasaryan reviewed Jun 12, 2024

View reviewed changes

src/cache_cf.cc Show resolved Hide resolved

eduard-bagdasaryan reviewed Jun 13, 2024

View reviewed changes

src/cache_cf.cc Show resolved Hide resolved

eduard-bagdasaryan reviewed Jun 13, 2024

View reviewed changes

src/cache_cf.cc Outdated Show resolved Hide resolved

rousskov added the S-waiting-for-PR Closure of other PR(s), current or future, is expected (and usually required) label Jun 13, 2024

rousskov added 5 commits June 13, 2024 17:38

fixup: Do not insist on space around pp condition operator

f0fe5ba

if 0=1 Our preprocessor grammar is not based on space-separated tokens!

fixup: Added a preprocessor test

49084f9

Condition "00=000" will need to be replaced with "0=000" when int64() bug is fixed.

fixup: Simplified and polished preprocessor parsing code

6d897ce

fixup: More temporary workarounds for Tokenizer::int64() bugs

02a4059

rousskov commented Jun 14, 2024

View reviewed changes

test-suite/squidconf/pp-ifs.conf Outdated Show resolved Hide resolved

rousskov mentioned this pull request Jun 17, 2024

Fix Tokenizer::int64() parsing of "0" when guessing base #1842

Closed

eduard-bagdasaryan reviewed Jun 17, 2024

View reviewed changes

src/base/CharacterSet.cc Outdated Show resolved Hide resolved

eduard-bagdasaryan approved these changes Jun 17, 2024

View reviewed changes

fixup: Prefer std::numeric_limits to unnamed constants

4a565ed

Also fix #include order for STL headers. Context: squid-cache#1840 (review)

rousskov removed the S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box label Jun 18, 2024

yadij requested changes Jun 21, 2024

View reviewed changes

src/ConfigParser.cc Outdated Show resolved Hide resolved

src/ConfigParser.cc Outdated Show resolved Hide resolved

src/ConfigParser.cc Show resolved Hide resolved

src/base/CharacterSet.cc Outdated Show resolved Hide resolved

src/cache_cf.cc Outdated Show resolved Hide resolved

rousskov added 2 commits June 21, 2024 13:30

fixup: Undo error text improvement

d0e01dd

Context: squid-cache#1840 (comment)

fixup: Detail a TODO

896e2f5

Context: squid-cache#1840 (comment)

rousskov commented Jun 21, 2024

View reviewed changes

src/ConfigParser.cc Show resolved Hide resolved

src/base/CharacterSet.cc Outdated Show resolved Hide resolved

src/cache_cf.cc Outdated Show resolved Hide resolved

src/ConfigParser.cc Outdated Show resolved Hide resolved

src/ConfigParser.cc Outdated Show resolved Hide resolved

rousskov requested a review from yadij June 21, 2024 17:45

rousskov added the S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box label Jun 21, 2024

Merged master to resolve a seemingly trivial merge conflict

67eefe6

rousskov mentioned this pull request Jul 2, 2024

Restrict squid.conf preprocessor space characters to SP and HT #1858

Closed

rousskov added 3 commits July 5, 2024 11:53

Merge master to resolve conflicts with "only SP/HT" changes

caa6dd4

See master/v7 commit 314e430.

Switch to "only SP/HT" grammar when parsing include filenames

2e4d700

This change should have been done in master/v7 commit 314e430, but I missed it because the caller does not use isspace() directly. This change removes the last use of branch-added CharacterSet::libcSpace().

Removed no-longer-used branch-added CharacterSet::libcSpace()

7e1af18

yadij requested changes Jul 7, 2024

View reviewed changes

yadij added S-waiting-for-author author action is expected (and usually required) and removed S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box labels Jul 7, 2024

rousskov added 3 commits July 8, 2024 16:04

Merge master to get int64() parsing fixes for our test cases

7e0a31d

fixup: Removed distracting workarounds for int64() zero-parsing bugs

2176dd8

fixup: Fixed (or at least improved) test case wording

2a6a41f

rousskov commented Jul 8, 2024

View reviewed changes

rousskov requested a review from yadij July 8, 2024 20:29

rousskov mentioned this pull request Aug 28, 2024

Reject config with unknown directives before committing to it #1897

Closed

Merge branch 'master' into SQUID-984-parse_line-SBuf

9e68829

squid-anubis added M-failed-other https://github.com/measurement-factory/anubis#pull-request-labels and removed M-failed-other https://github.com/measurement-factory/anubis#pull-request-labels labels Sep 22, 2024

rousskov mentioned this pull request Oct 2, 2024

Upgrade HTTP header field-line parser #1908

Draft

rousskov mentioned this pull request Oct 28, 2024

Replace MEM_DREAD_CTRL pool with MEMPROXY class pool #1924

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade generated parse_line() to accept constant SBuf #1840

Upgrade generated parse_line() to accept constant SBuf #1840

rousskov commented Jun 11, 2024 •

edited

Loading

rousskov left a comment

rousskov Jun 11, 2024

yadij Jun 28, 2024

rousskov Jun 29, 2024

eduard-bagdasaryan left a comment

yadij left a comment

rousskov left a comment

yadij Jul 7, 2024

rousskov Jul 8, 2024

yadij Nov 5, 2024 •

edited

Loading

rousskov Nov 5, 2024

yadij Jul 7, 2024

rousskov Jul 8, 2024

yadij Jul 7, 2024

rousskov Jul 8, 2024

yadij Nov 5, 2024

rousskov Nov 5, 2024

yadij Jul 7, 2024

rousskov Jul 8, 2024

rousskov left a comment

rousskov Jul 8, 2024

rousskov Jul 8, 2024

rousskov Jul 8, 2024

rousskov Jul 8, 2024

	IsIncludeLine(Parser::Tokenizer tk)
	IsIncludeLine(Parser::Tokenizer &tk)

	// grammar: space* keyword space* END
	// grammar: space* keyword 0#space

Upgrade generated parse_line() to accept constant SBuf #1840

Are you sure you want to change the base?

Upgrade generated parse_line() to accept constant SBuf #1840

Conversation

rousskov commented Jun 11, 2024 • edited Loading

rousskov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eduard-bagdasaryan left a comment

Choose a reason for hiding this comment

yadij left a comment

Choose a reason for hiding this comment

rousskov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yadij Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rousskov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rousskov commented Jun 11, 2024 •

edited

Loading

yadij Nov 5, 2024 •

edited

Loading