Skip to content
This repository has been archived by the owner on Mar 29, 2022. It is now read-only.

Commit

Permalink
Update and simplify HAR files processing
Browse files Browse the repository at this point in the history
- Drop everything about Insomnia, whose HAR export is
  useless for HTTPolice: see comments in
  Kong/insomnia#416
  as well as Kong/insomnia#840

- Drop the big test files that used to contain entire palettes
  of exchanges manually exported from a range of browsers.
  These are all obsolete by now and I can't be bothered to run them all
  again. Instead, have much smaller and more easily reproducible files
  illustrating specific behaviors.

- Trying to guess the real HTTP version from a browser's HAR export
  has proved to be a fool's errand. Just consider it unknown.
  Fix no. 1029 in this situation.

- Drop workarounds for problems that I can't reproduce with the current
  browser versions.

- Add new workarounds, too.
  • Loading branch information
vfaronov committed Jun 27, 2019
1 parent ee31471 commit d90b7bd
Show file tree
Hide file tree
Showing 32 changed files with 2,010 additions and 11,567 deletions.
14 changes: 13 additions & 1 deletion CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,25 @@ History of changes

Unreleased
~~~~~~~~~~

Added
-----
- Basic checks for most of the headers defined by `WHATWG Fetch`_,
such as ``Access-Control-Allow-Origin``.
- Updated workarounds for HAR files exported from Chrome and Firefox.
More checks are now skipped on such files, which means
fewer false positives due to missing or mangled data.
- Notice `1282`_ is now reported on ``application/text``.

Fixed
-----
- Notice `1276`_ is now a comment, not an error.
- Notice `1277`_ is no longer reported on ``X-Real-IP``.
- Notice `1282`_ is now reported on ``application/text``.
- Notice `1029`_ (``TE`` requires ``Connection: TE``)
is now only reported on HTTP/1.1 requests.

.. _WHATWG Fetch: https://fetch.spec.whatwg.org/
.. _1029: https://httpolice.readthedocs.io/page/notices.html#1029


0.8.0 - 2019-03-03
Expand Down
6 changes: 3 additions & 3 deletions doc/har.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,10 @@ HTTPolice can analyze HAR files with the ``-i har`` option::

However, please note that HAR support in exporters is **erratic**.
HTTPolice tries to do a reasonable job on files exported from
major Web browsers and some other HTTP tools,
but some information is simply lost.
major Web browsers and some other HTTP tools, but some information is lost
and some checks are skipped to avoid false positives.

If HTTPolice fails on your HAR files,
If HTTPolice gives unexpected results on your HAR files,
feel free to `submit an issue`__ (don’t forget to attach the files),
and I’ll see what can be done about it.

Expand Down
8 changes: 2 additions & 6 deletions doc/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,17 +40,13 @@ Then feed this HAR file to HTTPolice::
------------ response: 200 OK
E 1000 Syntax error in Server header
E 1013 Multiple Date headers are forbidden
------------ request: GET /assets/searchstyle.css
E 1029 TE header requires "Connection: TE"
------------ request: GET /search/jquery-1.9.1.min.js
E 1029 TE header requires "Connection: TE"
------------ request: GET /search/oktavia-jquery-ui.js
E 1029 TE header requires "Connection: TE"
------------ request: GET /search/oktavia-english-search.js
------------ response: 200 OK
E 1000 Syntax error in Server header
E 1013 Multiple Date headers are forbidden
C 1277 Obsolete 'X-' prefix in headers
------------ request: GET /assets/8mbps100msec-nginx195-h2o150.png
C 1276 Accept: */* is as good as image/webp
[...and so on...]


Expand Down
164 changes: 35 additions & 129 deletions httpolice/inputs/har.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,23 +3,24 @@
import base64
import io
import json
import re
from urllib.parse import urlparse

from httpolice import framing1
from httpolice.exchange import Exchange
from httpolice.helpers import pop_pseudo_headers
from httpolice.inputs.common import InputError
from httpolice.known import h, m, media, st
from httpolice.parse import ParseError
from httpolice.known import h, m, st
from httpolice.request import Request
from httpolice.response import Response
from httpolice.stream import Stream
from httpolice.structure import (FieldName, StatusCode, Unavailable, http2,
http11)
from httpolice.structure import FieldName, StatusCode, Unavailable
from httpolice.util.text import decode_path


FIDDLER = [u'Fiddler']
CHROME = [u'WebInspector']
FIREFOX = [u'Firefox']
EDGE = [u'F12 Developer Tools']


def har_input(paths):
for path in paths:
# According to the spec, HAR files are UTF-8 with an optional BOM.
Expand All @@ -30,7 +31,7 @@ def har_input(paths):
except ValueError as exc:
raise InputError('%s: bad HAR file: %s' % (path, exc)) from exc
try:
creator = CreatorInfo(data['log']['creator'])
creator = data['log']['creator']['name']
for entry in data['log']['entries']:
yield _process_entry(entry, creator, path)
except (TypeError, KeyError) as exc:
Expand All @@ -45,43 +46,14 @@ def _process_entry(data, creator, path):


def _process_request(data, creator, path):
(version, header_entries, pseudo_headers) = _process_message(data, creator)
if creator.is_chrome and version == http11 and u':host' in pseudo_headers:
# SPDY exported from Chrome.
version = None

# Firefox exports "Connection: keep-alive" on HTTP/2 requests
# (which triggers notice 1244)
# even though it does not actually send it
# (this can be verified with SSLKEYLOGFILE + Wireshark).
if creator.is_firefox and version == http2:
header_entries = [
(name, value)
for (name, value) in header_entries
if (name, value) != (h.connection, u'keep-alive')
]

version, header_entries = _process_message(data, creator)
method = data['method']
header_names = {name for (name, _) in header_entries}

parsed = urlparse(data['url'])
scheme = parsed.scheme

if creator.is_insomnia:
# https://github.com/getinsomnia/insomnia/issues/840
if h.host not in header_names:
header_entries.insert(0, (h.host, parsed.netloc))
if h.user_agent not in header_names:
# The actual version can probably be extracted from
ua_string = u'insomnia/%s' % creator.reconstruct_insomnia_version()
header_entries.append((h.user_agent, ua_string))
if h.accept not in header_names:
header_entries.append((h.accept, u'*/*'))
header_names = {name for (name, _) in header_entries}

if method == m.CONNECT:
target = parsed.netloc
elif h.host in header_names:
elif any(name == h.host for (name, _) in header_entries):
# With HAR, we can't tell if the request was to a proxy or to a server.
# So we force most requests into the "origin form" of the target,
target = parsed.path
Expand Down Expand Up @@ -111,22 +83,7 @@ def _process_request(data, creator, path):
post = data.get('postData')
if post and post.get('text'):
text = post['text']

if creator.is_firefox and \
post['mimeType'] == media.application_x_www_form_urlencoded \
and u'\r\n' in text:
# Yes, Firefox actually outputs this stuff. Go figure.
(wtf, actual_text) = text.rsplit(u'\r\n', 1)
try:
buf = io.BufferedReader(io.BytesIO(wtf.encode('iso-8859-1')))
more_entries = framing1.parse_header_fields(Stream(buf))
except (UnicodeError, ParseError): # pragma: no cover
pass
else:
header_entries.extend(more_entries)
text = actual_text

if creator.is_fiddler and method == m.CONNECT and u'Fiddler' in text:
if creator in FIDDLER and method == m.CONNECT and u'Fiddler' in text:
# Fiddler's HAR export adds a body with debug information
# to CONNECT requests.
text = None
Expand All @@ -143,49 +100,32 @@ def _process_request(data, creator, path):
def _process_response(data, req, creator, path):
if data['status'] == 0: # Indicates error in Chrome.
return None
(version, header_entries, _) = _process_message(data, creator)
version, header_entries = _process_message(data, creator)
status = StatusCode(data['status'])
reason = data['statusText']

if creator.is_firefox:
# Firefox joins all ``Set-Cookie`` response fields with newlines.
# (It also joins other fields with commas,
# but that is permitted by RFC 7230 Section 3.2.2.)
header_entries = [
(name, value)
for (name, joined_value) in header_entries
for value in (joined_value.split(u'\n') if name == h.set_cookie
else [joined_value])
]

if creator.is_fiddler and req.method == m.CONNECT and status.successful:
if creator in FIDDLER and req.method == m.CONNECT and status.successful:
# Fiddler's HAR export adds extra debug headers to CONNECT responses
# after the tunnel is closed.
header_entries = [(name, value)
for (name, value) in header_entries
if name not in [u'EndTime', u'ClientToServerBytes',
u'ServerToClientBytes']]

# The logic for body is similar to that for requests (see above),
# except that
# (1) Firefox also includes a body with 304 responses;
# (2) browsers may set ``bodySize = -1`` even when ``content.size >= 0``.
# The logic for body is mostly like that for requests (see above).
if data['bodySize'] == 0 or data['content']['size'] == 0 or \
status == st.not_modified:
status == st.not_modified: # Firefox also includes body on 304
body = b''
elif creator in FIREFOX:
# Firefox seems to exports bogus bodySize:
# see test/har_data/firefox_gif.har
body = None
# Browsers may set ``bodySize = -1`` even when ``content.size >= 0``.
elif data['bodySize'] > 0 or data['content']['size'] > 0:
body = Unavailable()
else:
body = None

if version == http11 and creator.is_firefox and \
any(name == u'x-firefox-spdy' for (name, _) in header_entries):
# Helps with SPDY in Firefox.
version = None
if creator.is_chrome and version != req.version:
# Helps with SPDY in Chrome.
version = None

resp = Response(version, status, reason, header_entries, body=body,
remark=u'from %s' % path)

Expand All @@ -194,12 +134,9 @@ def _process_response(data, req, creator, path):
try:
decoded_body = base64.b64decode(data['content']['text'])
except ValueError:
# Firefox sometimes marks normal, unencoded text as "base64"
# (see ``test/har_data/firefox_gif.har``).
# But let's not try to guess.
pass
else:
if creator.is_fiddler and req.method == m.CONNECT and \
if creator in FIDDLER and req.method == m.CONNECT and \
status.successful and b'Fiddler' in decoded_body:
# Fiddler's HAR export adds a body with debug information
# to CONNECT responses.
Expand All @@ -216,47 +153,16 @@ def _process_response(data, req, creator, path):
def _process_message(data, creator):
header_entries = [(FieldName(d['name']), d['value'])
for d in data['headers']]
pseudo_headers = pop_pseudo_headers(header_entries)
if creator.is_edge: # Edge exports HTTP/2 messages as HTTP/1.1.
version = None
elif creator.is_insomnia: # Insomnia's HAR export hardcodes HTTP/1.1.
version = None
elif data['httpVersion'] == u'unknown': # Used by Chrome.
version = None
else:
version = data['httpVersion'].upper()
if version == u'HTTP/2.0': # Used by Firefox, Chrome, ...
version = http2
return (version, header_entries, pseudo_headers)


class CreatorInfo(dict):

__slots__ = []

@property
def is_chrome(self):
return self['name'] == u'WebInspector'

@property
def is_firefox(self):
# Not sure if "Iceweasel" is actually used, but it won't hurt.
return self['name'] in [u'Firefox', u'Iceweasel']

@property
def is_edge(self):
return self['name'] == u'F12 Developer Tools'

@property
def is_fiddler(self):
return self['name'] == u'Fiddler'

@property
def is_insomnia(self):
return self['name'] == u'Insomnia REST Client'

def reconstruct_insomnia_version(self):
match = re.search(r'v([0-9]+\.[0-9]+\.[0-9]+)', self['version'])
if not match: # pragma: no cover
return u'x.x.x'
return match.groups(1)
pop_pseudo_headers(header_entries)

# Web browsers' HAR export poorly reflects the actual traffic on the wire.
# Their httpVersion can't be trusted, and they often mangle lower-level
# parts of the protocol, e.g. at the time of writing Chrome sometimes omits
# the Host header from HTTP/1.1 requests. Just consider their HTTP version
# to be always unknown, and a lot of this pain goes away.
version = None
if data['httpVersion'].startswith(u'HTTP/') and \
creator not in CHROME + FIREFOX + EDGE:
version = data['httpVersion']

return version, header_entries
10 changes: 4 additions & 6 deletions httpolice/request.py
Original file line number Diff line number Diff line change
Expand Up @@ -236,12 +236,10 @@ def check_request(req):
if tc.chunked in headers.te:
complain(1028)

if version == http2:
if headers.te and headers.te != [u'trailers']:
complain(1244, header=headers.te)
else:
if headers.te and u'TE' not in headers.connection:
complain(1029)
if version == http2 and headers.te and headers.te != [u'trailers']:
complain(1244, header=headers.te)
if version == http11 and headers.te and u'TE' not in headers.connection:
complain(1029)

if version == http11 and headers.host.is_absent:
complain(1031)
Expand Down
37 changes: 37 additions & 0 deletions test/har_data/bad_base64.har
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
{
"_warning": "This is not a real HAR file! It only contains the keys that are interesting to HTTPolice. Please do not use this as an example of a valid HAR file.",
"_expected": [],
"log": {
"creator": {"name": "demo"},
"entries": [
{
"request": {
"method": "GET",
"url": "http://example.com/",
"httpVersion": "HTTP/1.1",
"headers": [
{"name": "Host", "value": "example.com"},
{"name": "User-Agent", "value": "demo"}
],
"bodySize": 0
},
"response": {
"httpVersion": "HTTP/1.1",
"status": 200,
"statusText": "OK",
"headers": [
{"name": "Date", "value": "Thu, 31 Dec 2015 18:26:56 GMT"},
{"name": "Content-Type", "value": "text/plain"},
{"name": "Content-Length", "value": "14"}
],
"bodySize": 14,
"content": {
"size": 14,
"encoding": "base64",
"text": "Hello world!\r\n"
}
}
}
]
}
}
Loading

0 comments on commit d90b7bd

Please sign in to comment.