Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Email not displaying/being parsed properly #6713

Closed
MrBMT opened this issue Apr 5, 2019 · 10 comments
Closed

Email not displaying/being parsed properly #6713

MrBMT opened this issue Apr 5, 2019 · 10 comments

Comments

@MrBMT
Copy link

MrBMT commented Apr 5, 2019

Note: This is with the current git master (4d1b3e2) being used.

I sent an email to myself today from another system, however the text displayed via Roundcube is incorrect - the first line is completely missing.

The message sent was:

Hello, This is a test.
Does it work this time?

However it displays via Roundcube:

Does it work this time?

As shown in this screenshot:

Screenshot 2019-04-05 at 11 08 13

Viewing the HTML being produced, it is completely ignoring the first line of the email:

<div id="messagebody"><div class="message-htmlpart" id="message-htmlpart1"><div class="rcmBody"><!-- html ignored --><!-- head ignored --><!-- meta ignored --><!-- html ignored --><br />Does it work this time?</div></div>

For your reference, an anonymised version of the raw message source is below:

Return-Path: <[email protected]>
Delivered-To: [email protected]
Received: by otheremailserver.uk (Postfix, from userid 5001)
	id 27773B9DB; Fri,  5 Apr 2019 10:50:39 +0100 (BST)
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	london.thewebhostco.uk
X-Spam-Level: 
X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HTML_MESSAGE autolearn=ham autolearn_force=no
	version=3.4.0
Received: from myemailserver.uk (myemailserver.uk [111.111.111.111])
	by otheremailserver.uk (Postfix) with ESMTPS id CA0961F66
	for <[email protected]>; Fri,  5 Apr 2019 10:50:38 +0100 (BST)
DKIM-Filter: OpenDKIM Filter v2.11.0 otheremailserver.uk CA0961F66
Authentication-Results: otheremailserver.uk;
	dkim=pass (2048-bit key) header.d=myemailserver.uk [email protected] header.b="removed"
Received: from localhost (localhost [127.0.0.1]) (Authenticated sender: [email protected])
	by myemailserver.uk (Postcow) with ESMTPA id 1A95B47003220;
	Fri,  5 Apr 2019 10:50:38 +0100 (BST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=myemailserver.uk;
	s=dkim; t=1554457838; h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:in-reply-to:
	 references; bh=removed=;
	b=removed==
Content-Type: multipart/alternative; boundary="----=_=-_OpenGroupware_org_NGMime-141-1554457837.947829-1------"
To: "MrBMT" <[email protected]>, [email protected]
User-Agent: SOGoMail 4.0.7
MIME-Version: 1.0
Date: Fri, 05 Apr 2019 10:50:37 +0100
Subject: Hello
Message-ID: <8d-5ca72500-3-3216a400@200831193>
From: "MrBMT" <[email protected]>
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=myemailserver.uk;
	s=dkim; t=1554457838; h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:in-reply-to:
	 references; bh=removed=;
	b=removed==
ARC-Seal: i=1; s=dkim; d=myemailserver.uk; t=1554457838; a=rsa-sha256;
	cv=none;
	b=removed==
ARC-Authentication-Results: i=1;
	myemailserver.uk;
	auth=pass [email protected] [email protected]

------=_=-_OpenGroupware_org_NGMime-141-1554457837.947829-1------
Content-Type: text/plain; charset=utf-8
Content-Length: 47


Hello, This is a test.
Does it work this time?

------=_=-_OpenGroupware_org_NGMime-141-1554457837.947829-1------
Content-Type: text/html; charset=utf-8
Content-Length: 64

<html>Hello, This is a test.<br />Does it work this time?</html>

------=_=-_OpenGroupware_org_NGMime-141-1554457837.947829-1--------
@MrBMT MrBMT changed the title Email not displaying properly Email not displaying/being parsed properly Apr 5, 2019
@dsoares
Copy link
Contributor

dsoares commented Apr 5, 2019

I have a similar problem, the first line of the message is missing. I'm also using the current git master (4d1b3e2). The message is something like:

...
h e a d e r s
...
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="--_com.android.email_525547547556700"
Message-Id: <[email protected]>

----_com.android.email_525547547556700
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: base64

TsOjbywgbsOjbywgZXJhIG1lc21vIHPDsyBwYXJhIGxoZSBkaXplciBxdWUgZm9pIHVtIGVycm8g
bWV1LiBPIHByb2JsZW1hIGRvIG92ZXJmaXR0aW5nIG1hbnTDqW0tc2UuPGRpdiBjbGFzcz0icXVv
...

----_com.android.email_525547547556700--

If we decode the base64, the message is something like:

Não, não, era mesmo só para lhe dizer que foi um erro meu. O problema do overfitting 
mantém-se.<div class="quote" style="line-height: 1.5"><br><br>
-------- Mensagem original --------
<br>Assunto: Re: Re.: Gráficos
<br>...

I found that Masterminds\HTML5 parser ignores everything before the first HTML tag it encounters (maybe because it states that it does not support plaintext?), so the message is shown like this, without the first line:

-------- Mensagem original --------
Assunto: Re: Re.: Gráficos
...

@alecpl
Copy link
Member

alecpl commented Apr 5, 2019

The @MrBMT's sample contains <html> tag, so maybe it requires <body>?

@dsoares
Copy link
Contributor

dsoares commented Apr 5, 2019

It's a Masterminds\HTML5 thing. For cases like:

$html = "<html>Hello, This is a test.<br />Does it work this time?</html>";
$html = "Hello, This is a test.<br />Does it work this time?";

it still thinks it is in insertMode <= static::IM_BEFORE_HEAD (see DOMTreeBuilder.php#551) so it removes the text with ParseError "Unexpected text. Ignoring....
For messages like the examples above, with a wrong HTML structure, it only moves the insertMode here:

// This is sort of a last-ditch attempt to correct for cases where no head/body
// elements are provided.
if ($this->insertMode <= static::IM_BEFORE_HEAD && 'head' !== $name && 'html' !== $name) {
    $this->insertMode = static::IM_IN_BODY;
}

so, insertMode only changes when it sees a tag not named head nor html (that's why it worked when i added any other HTML tag to the beginning of the message; in the examples above, it ignores everything before <br />).
I removed my PRs because neither was the right way to fix this.

@alecpl
Copy link
Member

alecpl commented Apr 6, 2019

I agree that we need a better fix. I created a ticket for Mastermind Masterminds/html5-php#166.

@alecpl alecpl modified the milestones: 1.4.0, 1.5-beta Apr 6, 2019
alecpl added a commit that referenced this issue Apr 14, 2019
@alecpl
Copy link
Member

alecpl commented Apr 14, 2019

Implemented workaround. Fixed.

@alecpl alecpl closed this as completed Apr 14, 2019
@MrBMT
Copy link
Author

MrBMT commented Apr 14, 2019

@alecpl I updated to the latest git master and it still wasn't being processed correctly. I ran what was being passed through to the fix_html5() function through htmlspecialchars() then var_dumped the result, which turned out to be:
string(202) "<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /></head><html>Hello, This is a test.<br />Does it work this time?</html>"

So for some reason, roundcube is adding <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /></head> and because of this, your fix wasn't working for me.

In addition, the $html = substr_replace($html, '<body>', $pos, 0); was adding the body tag in slightly the wrong place (resulting in <html<body>> instead of <html><body>) as $pos there should have been $pos+1.

In order to fix things, I had to amend the function as follows:

    /**
     * Cleanup and workarounds on input to Masterminds/HTML5
     */
    protected function fix_html5($html)
    {
        $html = str_replace('<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /></head>', "", $html);

        // HTML5 requires <head> or <body> (#6713)
        // https://github.com/Masterminds/html5-php/issues/166
        if (!preg_match('/<(head|body)/i', $html)) {
            $pos = stripos($html, '<html');

            if ($pos === false) {
                $html = '<html><body>' . $html;
            }
            else {
                $pos  = strpos($html, '>', $pos);
                $html = substr_replace($html, '<body>', $pos+1, 0);
            }
        }

        return $html;
    }

I haven't had a chance this evening to look in to where the <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /></head> was being added, so there may well be a better way of approaching this but just thought I should let you know about this for the moment.

Thanks for looking in to this!

@alecpl
Copy link
Member

alecpl commented Apr 15, 2019

I forgot we add that meta tag in another place. Should be fixed now.

@alecpl alecpl closed this as completed Apr 15, 2019
@dsoares
Copy link
Contributor

dsoares commented Apr 15, 2019

@alecpl unfortunately, this fix doesn't solve my issue. My problematic message skips the added fix_html5() method because it does contain an <head> or <body> in the middle of the message.
The message code (the $html parameter to the fix_html5() method) is:

Não, não, era mesmo só para lhe dizer que foi um erro meu. O problema do overfitting mantém-se.
<div class="quote" style="line-height: 1.5"><br>
<br>-------- Mensagem original --------
<br>Assunto: Re: Re.: Gráficos
<br>De: (...)
<br>Para: (...)<br><br>
<blockquote (...)>
<html>
<head><meta charset="UTF-8" /></head>
<body dir="auto">Mas acha que deveríamos repensar e (...)

For my case, the following PHPUnit test case must work:

        $html   = 'First line<br /><html><body>Second line';
        $washed = $washer->wash($html);

        $this->assertContains('First line', $washed);

While it's not fixed, i must patch the fix_html5() method like this:

diff --git a/program/lib/Roundcube/rcube_washtml.php b/program/lib/Roundcube/rcube_washtml.php
index 8e4edde..459d9cf 100644
--- a/program/lib/Roundcube/rcube_washtml.php
+++ b/program/lib/Roundcube/rcube_washtml.php
@@ -783,6 +783,14 @@ class rcube_washtml
      */
     protected function fix_html5($html)
     {
+        // work only with the first part of the message, before any HTML tag
+        if ($pos = strpos($html, '<')) {
+            $html_after = substr($html, $pos);
+            $html = substr($html, 0, $pos);
+        } else {
+            $html_after = '';
+        }
+
         // HTML5 requires <head> or <body> (#6713)
         // https://github.com/Masterminds/html5-php/issues/166
         if (!preg_match('/<(head|body)/i', $html)) {
@@ -797,6 +805,8 @@ class rcube_washtml
             }
         }
 
+        $html .= $html_after;
+
         return $html;
     }

@alecpl
Copy link
Member

alecpl commented Apr 15, 2019

Problem with this sample is that it is not really a valid HTML. DOMDocument-based code didn't indeed skip that text, so maybe we should support that case too. What a mess.

@alecpl
Copy link
Member

alecpl commented Apr 28, 2019

Fixed.

@alecpl alecpl closed this as completed Apr 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants