Skip to content

Latest commit

 

History

History
173 lines (127 loc) · 5.95 KB

README.md

File metadata and controls

173 lines (127 loc) · 5.95 KB

HtmlQuery

A concise API, honed over multiple years, for querying HTML. There are just 5 main functions: all/2, find/2 and find!/2 for finding things, plus attr/2 and text/1 for extracting information. There are also a handful of other useful functions, referenced below and described in detail in the module docs. HTML parsing is handled by Floki.

The input can be:

  • A string of HTML.
  • An IO Data of HTML.
  • A Floki html_node or html_tree data structure. HtmlQuery uses Floki internally and can accept its data structure as input, and some HtmlQuery functions return its data structure as output.
  • Anything that implements the String.Chars protocol. See Implementing String.Chars below.

We created a related library called XmlQuery which has the same API but is used for querying XML. You can read more about them in Querying HTML and XML in Elixir with HtmlQuery and XmlQuery.

This library is MIT licensed and is part of a growing number of Elixir open source libraries published at github.com/synchronal.

This library is tested against the most recent 3 versions of Elixir and Erlang.

Installation

def deps do
  [
    {:html_query, "~> 1.4"}
  ]
end

Usage

Detailed docs are in the HtmlQuery module docs; a quick usage overview follows.

We typically alias HtmlQuery to Hq:

alias HtmlQuery, as: Hq

The rest of these examples use the following HTML:

html = """
  <h1>Please update your profile</h1>
  <form id="profile" test-role="profile">
    <label>Name <input name="name" type="text" value="Fido"> </label>
    <label>Age <input name="age" type="text" value="10"> </label>
    <label>Bio <textarea name="bio">Fido likes long walks and playing fetch.</textarea> </label>
  </form>
</form>
"""

Querying

Query functions use CSS selector strings for finding nodes. The MDN CSS Selectors guide is a helpful CSS reference.

Hq.find(html, "form#profile label textarea")

We’ve found that reserving CSS classes and IDs for styling instead of using them for testing reduces the chance of styling changes breaking tests, and so we often add attributes that start with test- into our HTML; a query for test-role would look like:

Hq.find(html, "form[test-role=profile]")

For simple queries, HtmlQuery.Css provides a shorthand using keyword lists. For complicated queries, it’s usually clearer to use a CSS string.

Hq.find(html, test_role: "profile")

Finding

all/2 finds all elements matching the query, find/2 returns the first element that matches the selector or nil if none was found, and find!/2 is like find/2 but raises unless exactly one element is found.

Hq.all(html, "input") # returns a list of all the <input> elements
Hq.find(html, "input[name=age]") # returns the <input> with `name=age`
Hq.find!(html, "input[name=foo]") # raises because no such element exists

See the module docs for more details.

Extracting

text/1 is the simplest extraction function:

html |> Hq.find(:h1) |> Hq.text() # returns "Please update your profile"

attr/2 returns the value of an attribute:

html |> Hq.find("input[name=age]") |> Hq.attr(:value) # returns "10"

To extract data from multiple HTML nodes, we found that it is clearer to compose multiple functions rather than to have a more complicated API:

html |> Hq.all(:input) |> Enum.map(&Hq.text/1) # returns ["Name", "Age"]
html |> Hq.all("input[type=text]") |> Enum.map(&Hq.attr(&1, "value")) # returns ["Fido", "10"]

There are also functions for extracting form fields as a map, meta tags as a list, and table contents as a list of lists or a list of maps. See the module docs for more details.

Parsing

parse/1 and parse_doc/1 delegate to Floki’s parse_fragment/1 and parse_document!/1 functions. These functions are rarely needed since all the HtmlQuery functions will parse HTML if needed. See the module docs for more details.

Utilities

inspect_html/2 prints prettified HTML with a label, normalize/1 parses and re-stringifies HTML which can be handy when trying to compare two strings of HTML, pretty/1 formats HTML in a human-friendly format, and reject/2 removes nodes that match the given selector. See the module docs for more details.

Implementing String.Chars

HtmlQuery functions that accept HTML will convert any module that implements String.Chars. For example, our Pages testing library implements String.Chars for controller output like this:

defimpl String.Chars do
  def to_string(%Pages.Driver.Conn{conn: %Plug.Conn{status: 200} = conn}),
    do: Phoenix.ConnTest.html_response(conn, 200)
end

and implements String.Chars for LiveView output like this:

defimpl String.Chars, for: Pages.Driver.LiveView do
  def to_string(%Pages.Driver.LiveView{rendered: rendered}) when not is_nil(rendered),
    do: rendered

  def to_string(%Pages.Driver.LiveView{live: live}) when not is_nil(live),
    do: live |> Phoenix.LiveViewTest.render()
end

Development

brew bundle

bin/dev/doctor
bin/dev/update
bin/dev/audit
bin/dev/shipit