A concise API, honed over multiple years, for querying HTML. There are just 5 main functions:
all/2
, find/2
and find!/2
for finding things, plus attr/2
and text/1
for extracting
information. There are also a handful of other useful functions, referenced below and described in detail in
the module docs. HTML parsing is handled by
Floki.
The input can be:
- A string of HTML.
- An IO Data of HTML.
- A Floki html_node or html_tree data structure. HtmlQuery uses Floki internally and can accept its data structure as input, and some HtmlQuery functions return its data structure as output.
- Anything that implements the
String.Chars
protocol. See Implementing String.Chars below.
We created a related library called XmlQuery which has the same API but is used for querying XML. You can read more about them in Querying HTML and XML in Elixir with HtmlQuery and XmlQuery.
This library is MIT licensed and is part of a growing number of Elixir open source libraries published at github.com/synchronal.
This library is tested against the most recent 3 versions of Elixir and Erlang.
def deps do
[
{:html_query, "~> 1.4"}
]
end
Detailed docs are in the HtmlQuery module docs; a quick usage overview follows.
We typically alias HtmlQuery
to Hq
:
alias HtmlQuery, as: Hq
The rest of these examples use the following HTML:
html = """
<h1>Please update your profile</h1>
<form id="profile" test-role="profile">
<label>Name <input name="name" type="text" value="Fido"> </label>
<label>Age <input name="age" type="text" value="10"> </label>
<label>Bio <textarea name="bio">Fido likes long walks and playing fetch.</textarea> </label>
</form>
</form>
"""
Query functions use CSS selector strings for finding nodes. The MDN CSS Selectors guide is a helpful CSS reference.
Hq.find(html, "form#profile label textarea")
We’ve found that reserving CSS classes and IDs for styling instead of using them for testing reduces the chance of
styling changes breaking tests, and so we often add attributes that start with test-
into our HTML; a query for
test-role
would look like:
Hq.find(html, "form[test-role=profile]")
For simple queries, HtmlQuery.Css provides a shorthand using keyword lists. For complicated queries, it’s usually clearer to use a CSS string.
Hq.find(html, test_role: "profile")
all/2
finds all elements matching the query, find/2
returns the first element that matches the selector or nil
if
none was found, and find!/2
is like find/2
but raises unless exactly one element is found.
Hq.all(html, "input") # returns a list of all the <input> elements
Hq.find(html, "input[name=age]") # returns the <input> with `name=age`
Hq.find!(html, "input[name=foo]") # raises because no such element exists
See the module docs for more details.
text/1
is the simplest extraction function:
html |> Hq.find(:h1) |> Hq.text() # returns "Please update your profile"
attr/2
returns the value of an attribute:
html |> Hq.find("input[name=age]") |> Hq.attr(:value) # returns "10"
To extract data from multiple HTML nodes, we found that it is clearer to compose multiple functions rather than to have a more complicated API:
html |> Hq.all(:input) |> Enum.map(&Hq.text/1) # returns ["Name", "Age"]
html |> Hq.all("input[type=text]") |> Enum.map(&Hq.attr(&1, "value")) # returns ["Fido", "10"]
There are also functions for extracting form fields as a map, meta tags as a list, and table contents as a list of lists or a list of maps. See the module docs for more details.
parse/1
and parse_doc/1
delegate to Floki’s parse_fragment/1
and parse_document!/1
functions. These functions
are rarely needed since all the HtmlQuery functions will parse HTML if needed. See the
module docs for more details.
inspect_html/2
prints prettified HTML with a label, normalize/1
parses and re-stringifies HTML which can be handy
when trying to compare two strings of HTML, pretty/1
formats HTML in a human-friendly format, and reject/2
removes
nodes that match the given selector. See the module docs for more
details.
HtmlQuery functions that accept HTML will convert any module that implements String.Chars
. For example, our
Pages testing library implements String.Chars
for controller output like
this:
defimpl String.Chars do
def to_string(%Pages.Driver.Conn{conn: %Plug.Conn{status: 200} = conn}),
do: Phoenix.ConnTest.html_response(conn, 200)
end
and implements String.Chars
for LiveView output like this:
defimpl String.Chars, for: Pages.Driver.LiveView do
def to_string(%Pages.Driver.LiveView{rendered: rendered}) when not is_nil(rendered),
do: rendered
def to_string(%Pages.Driver.LiveView{live: live}) when not is_nil(live),
do: live |> Phoenix.LiveViewTest.render()
end
brew bundle
bin/dev/doctor
bin/dev/update
bin/dev/audit
bin/dev/shipit