Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metarefresh fails when source URL responds without HTTP version and response code #42

Open
aaronhark opened this issue Dec 11, 2024 · 4 comments

Comments

@aaronhark
Copy link

Describe the bug
The metarefresh module can fetch metadata from a list of provided URLs and compile it in to the SimpleSAMLphp flatfile format. If one of those URLs responds but does not include a header with its HTTP version and response code (e.g. HTTP/1.1 200 OK), the entire metarefresh process fails due to an uncaught exception. The exception occurs because the logic in MetaLoader.php that handles the response does not account for the possibility of a null value for this header.

Specifics of Environment
Using SimpleSAMLphp 2.3.5 as an SP on Linux with PHP 8.3.10

To Reproduce

  1. Install simplesamlphp ^2.3.5
  2. Install module metarefresh ^1.2
  3. Configure module_metarefresh.php to use these three example URL-based sources and output as flatfile:
  1. Trigger metarefresh via the cron module

Expected Behavior
Fetch the XML metadata from each specified URL, and compile it into the SimpleSAMLphp flatfile format.

Actual Behavior
Because the web server of the second link (https://amidp.drew.edu/nidp/saml2/metadata) returns the XML without an HTTP version and response code, nothing gets loaded into the [0] key of the response array and an exception is subsequently thrown. We can debate whether the web server owner ought to fix its response behavior, but the fact of the matter is there's nothing wrong with the actual XML metadata it serves up. The only reason I know it's not returning an HTTP version and response code is from debugging this issue.

PHP Warning:  Undefined array key 0 in /vendor/simplesamlphp/simplesamlphp/modules/metarefresh/src/MetaLoader.php on line 125
PHP Fatal error:  Uncaught TypeError: preg_match(): Argument #2 ($subject) must be of type string, null given in /vendor/simplesamlphp/simplesamlphp/modules/metarefresh/src/MetaLoader.php:125
Stack trace:
#0 /vendor/simplesamlphp/simplesamlphp/modules/metarefresh/src/MetaLoader.php(125): preg_match('@^HTTP/1\\.[01]\\...', NULL)
#1 /vendor/simplesamlphp/simplesamlphp/modules/metarefresh/src/MetaRefresh.php(129): SimpleSAML\Module\metarefresh\MetaLoader->loadSource(Array)
#2 /vendor/simplesamlphp/simplesamlphp/modules/metarefresh/hooks/hook_cron.php(27): SimpleSAML\Module\metarefresh\MetaRefresh->runRefresh('hourly')
#3 /vendor/simplesamlphp/simplesamlphp/src/SimpleSAML/Module.php(597): metarefresh_hook_cron(Array)
#4 /vendor/simplesamlphp/simplesamlphp/modules/cron/src/Cron.php(55): SimpleSAML\Module::callHooks('cron', Array)
#5 /vendor/simplesamlphp/simplesamlphp/modules/cron/bin/cron.php(45): SimpleSAML\Module\cron\Cron->runTag('hourly')
#6 {main}
  thrown in /vendor/simplesamlphp/simplesamlphp/modules/metarefresh/src/MetaLoader.php on line 125

Solution
There are several ways to resolve this; I'm just not sure what the preferred approach would be for the maintainers of this project. Therefore, I'm presenting two potential solutions inline rather than doing a PR. If one of these is preferred, I'm happy to fork and submit a PR. The most straightforward place to effect this change is at lines 118-134 of src/MetaLoader.php.

Option 1: If for some reason it is critical to this module that the HTTP version and response code header be present, more gracefully handle the lack of one by modifying the code to read as follows. With this option, the reproduction steps I provided above would result in the metadata being loaded successfully from URLs 1 and 3, and URL 2 throwing a warning to the logs.

// We have response headers, so the request succeeded
if (!isset($responseHeaders)) {
    // No response headers, this means the request failed in some way, so re-use old data
    Logger::warning('No response from ' . $source['src'] . ' - attempting to re-use cached metadata');
    $this->addCachedMetadata($source);
    return;
} elseif (!isset($responseHeaders[0])) {
    // Failed to receive an HTTP version and response code in the [0] array value. Treat as an invalid file.
    Logger::warning('Received no HTTP header in response - attempting to re-use cached metadata');
    $this->addCachedMetadata($source);
    return;
} elseif (preg_match('@^HTTP/1\.[01]\s304\s@', $responseHeaders[0])) {
    // 304 response
    Logger::debug('Received HTTP 304 (Not Modified) - attempting to re-use cached metadata');
    $this->addCachedMetadata($source);
    return;
} elseif (!preg_match('@^HTTP/1\.[01]\s200\s@', $responseHeaders[0])) {
    // Other error
    Logger::info('Error from ' . $source['src'] . ' - attempting to re-use cached metadata');
    $this->addCachedMetadata($source);
    return;
}

Option 2: Rather than testing for the HTTP version and response code -- which is irrelevant to the validity of the content served up -- it seems it might make more sense to look at the content-type header to ensure we actually received text/xml. With this option, the reproduction steps I provided above would result in the metadata being loaded successfully from all three URLs.

// We have response headers, so the request succeeded
if (!isset($responseHeaders['content-type']) || substr($responseHeaders['content-type'], 0, 8) != "text/xml") {
    // Invalid response headers. This means the request failed in some way, so re-use old data
    Logger::info('Invalid response headers from ' . $source['src'] . ' - attempting to re-use cached metadata');
    $this->addCachedMetadata($source);
    return;
}
@thijskh
Copy link
Member

thijskh commented Dec 11, 2024

Thanks for reporting. Indeed this is uncessesarily unwieldy and therefore not robust.

I'm thinking in a different route to resolve this.

I think the module should not be concerning itself with details like parsing out HTTP version codes. It already outsources a part of this to SSP's HTTP utility class. But still that class does similair things.

There are many excellent libs that can perform a HTTP request, handle all eventualities and return something sane. Something like curl is highly competent in this. I think everyone is much better off if we replace the custom fiddling with a dependency on PHP's curl module. This is not an unworldly dependency for something like SSP to have, in my opinion, and buys us the shared code used by millions of projects. A project like SSP should focus on being good at the SAML part, fetchting something over HTTP has been solved by others.

Nonetheless, as for your options:

  1. The module does need some handling of error codes to properly deal with either 304's or with failure (so it can load cached data). However, if no code is present at all but the request has content, it might indeed be possible to assume there's valid XML and see if it's parsable.

I do note that I just get a HTTP version reponse back from the URL you give:

< HTTP/1.1 200
  1. This would not make it more robust, as there will be other URLs that do not send a content-type header at all. In fact, the 'correct' content-type header for SAML metadata is actually application/samlmetadata+xml.

@tvdijen
Copy link
Member

tvdijen commented Dec 11, 2024

I also see a HTTP version, so that's not the problem. I suspect @aaronhark is for some reason receiving a HTTP/2 response which doesn't pass the regexp.

I think given our efforts to move towards PSR-7 in the next major release, it also makes sense to offload this to a proper PSR-18 compatible HTTP-client lib like guzzle instead of using curl directly. Or maybe even better; the http-client from Symfony.

@thijskh
Copy link
Member

thijskh commented Dec 11, 2024

In any case, remove code where SimpleSAMLphp itself is parsing HTTP responses with regexes...

@aaronhark
Copy link
Author

@thijskh Just for my understanding, is it not doing the same load of cached metadata in each part of this if-statement, regardless of what HTTP code is returned?

$this->addCachedMetadata($source);

I agree completely with the idea of not fiddling at the HTTP response level directly in this module. In fact, that's actually how we solved this issue: moved all of our fetching of metadata to a separate cURL-based script. I was just offering these code modifications if the desire was to keep with the approach as it currently exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants