Skip to content

Commit

Permalink
Merge pull request #14 from facelessuser/whitespace
Browse files Browse the repository at this point in the history
Whitespace fixes and remove document flags
  • Loading branch information
facelessuser authored Dec 14, 2018
2 parents 9096f2d + a724995 commit ca16ffb
Show file tree
Hide file tree
Showing 18 changed files with 344 additions and 282 deletions.
1 change: 1 addition & 0 deletions docs/src/dictionary/en-custom.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ API
Accessors
Aspell
BeautifulSoup
CDATA
CSS
CSS's
Changelog
Expand Down
7 changes: 7 additions & 0 deletions docs/src/markdown/about/changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
# Changelog

## 1.0.0b2

- **NEW**: Drop document flags. Document type can be detected from the Beautiful Soup object directly.
- **FIX**: CSS selectors should be evaluated with CSS whitespace rules.
- **FIX**: Processing instructions, CDATA, and declarations should all be ignored in `:contains` and child considerations for `:empty`.
- **FIX**: In Beautiful Soup, the document itself is the first tag. Do not match the "document" tag by returning false for any tag that doesn't have a parent.

## 1.0.0b1

- **NEW**: Add support for non-standard `:contains()` selector.
Expand Down
3 changes: 2 additions & 1 deletion docs/src/markdown/about/development.md
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,7 @@ class SelectorTag:
class SelectorAttribute:
"""Selector attribute rule."""

def __init__(self, attribute, prefix, pattern):
def __init__(self, attribute, prefix, pattern, xml_type_pattern):
"""Initialize."""
```

Expand All @@ -229,6 +229,7 @@ class SelectorAttribute:
`attribute` | Contains the attribute name to match.
`prefix` | Contains the attribute namespace prefix to match if any.
`pattern` | Contains a `re` regular expression object that matches the desired attribute value.
`xml_type_pattern` | As the default `type` pattern is case insensitive, when the attribute value is `type` and a case sensitivity has not been explicitly defined, a secondary case sensitive `type` pattern is compiled for use with XML documents when detected.

### `SelectorNth`

Expand Down
49 changes: 13 additions & 36 deletions docs/src/markdown/api.md
Original file line number Diff line number Diff line change
@@ -1,39 +1,12 @@
# API

## `soupsieve.HTML5`
Soup Sieve will detect the document type being used from the Beautiful Soup object that is given to it. For all HTML document types, it will treat tag names and attribute names without case sensitivity like most browsers do (even with XHTML). For HTML5, XHTML and XML, it will consider namespaces per the document's support (provided by the parser). To get namespaces support in HTML5, it is recommended to use `html5lib` as the parser. Some additional configuration is required when using namespaces, see [Namespace](#namespaces) for more information.

`HTML5` is a flag that instructs Soup Sieve to use HTML5 logic. When the `HTML5` flag is used, Soup Sieve will take into account namespaces for known embedded HTML5 namespaces such as SVG. `HTML5` will also not compare tag names and attribute names with case sensitivity.
While attribute values are always generally treated as case sensitive, HTML5, XHTML, and HTML treat the `type` attribute special. The `type` attribute's value is always case insensitive. This is generally how most browsers treat `type`. If you need `type` to be sensitive, you can use the `s` flag: `#!css [type="submit" s]`.

!!! tip
While attribute values are always treated as case sensitive, HTML5, XHTML, and HTML treat the `type` attribute special, `type`'s value is always case insensitive. This is generally how most browsers treat `type`.
## Flags

If you need `type` to be sensitive, you can use the `s` flag: `#!css [type="submit" s]`.

Keep in mind, that Soup Sieve itself is not responsible for deciding what tag has or does not have a namespace. This is actually determined by the parser used in Beautiful Soup. This flag only tells Soup Sieve that the parser should be calculating namespaces, so it is okay to look at them. The user is responsible for using an appropriate parser for HTML5. If using the [lxml][lxml] or [html5lib][html5lib] with Beautiful Soup, HTML5 namespaces *should* be accounted for in the parsing. If you are using Python's builtin HTML parser, this may not be the case.

## `soupsieve.HTML`

`HTML` is a flag that instructs Soup Sieve to use pre HTML5 logic. When the `HTML` flag is used, Soup Sieve will not consider namespaces when evaluating elements. `HTML` will also not compare tag names and attribute names with case sensitivity.

!!! tip
While attribute values are always treated as case sensitive, HTML5, XHTML, and HTML treat the `type` attribute special, `type`'s value is always case insensitive. This is generally how most browsers treat `type`.

If you need `type` to be sensitive, you can use the `s` flag: `#!css [type="submit" s]`.

## `soupsieve.XML`

`XML` is a flag that instructs Soup Sieve to use XML logic. `XML` will cause Soup Sieve to take namespaces into considerations, and it will evaluate tag names and attribute names with case sensitivity. It will also relax what it considers valid tag name and attribute characters. It will also disable `.class` and `#id` selectors this is more an HTML concept.

## `soupsieve.XHTML`

`XHTML` is a flag that instructs Soup Sieve to use XHTML logic. This will cause Soup Sieve to take namespaces into considerations, and evaluate tag names and attributes names with no case sensitivity as this is how most browsers deal with XHTML tags. `.class` and `#id` are perfectly valid in XHTML.

!!! tip
While attribute values are always treated as case sensitive, HTML5, XHTML, and HTML treat the `type` attribute special, `type`'s value is always case insensitive. This is generally how most browsers treat `type`.

If you need `type` to be sensitive, you can use the `s` flag: `#!css [type="submit" s]`.

It is recommend to use the `xml` mode in Beautiful Soup when parsing XHTML documents.
There are no flags at this time, but the parameter is provided for potential future use.

## `soupsieve.select()`

Expand All @@ -44,7 +17,7 @@ def select(select, node, namespaces=None, limit=0, flags=0):

`select` given a tag, will select all tags that match the provided CSS selector string. You can give `limit` a positive integer to return a specific number tags (0 means to return all tags).

`select` accepts a CSS selector string, a `node` or element, an optional [namespace](#namespaces) dictionary, a `limit`, and `flags`. If no flags are specified, HTML5 mode will be assumed.
`select` accepts a CSS selector string, a `node` or element, an optional [namespace](#namespaces) dictionary, a `limit`, and `flags`.

```pycon3
>>> import soupsieve as sv
Expand All @@ -64,13 +37,13 @@ def iselect(select, node, namespaces=None, limit=0, flags=0):
## `soupsieve.match()`

```py3
def match(select, node, namespaces=None, mode=0):
def match(select, node, namespaces=None, flags=0):
"""Match node."""
```

`match` matches a given node/element with a given CSS selector.

`match` accepts a CSS selector string, a `node` or element, an optional [namespace](#namespaces) dictionary, and flags. If no flags are specified, HTML5 mode will be assumed.
`match` accepts a CSS selector string, a `node` or element, an optional [namespace](#namespaces) dictionary, and flags.

```pycon3
>>> nodes = sv.select('p:is(.a, .b, .c)', soup)
Expand All @@ -89,7 +62,7 @@ def filter(select, nodes, namespaces=None, flags=0):

`filter` takes an iterable containing HTML nodes and will filter them based on the provided CSS selector string. If given a Beautiful Soup tag, it will iterate the children that are tags.

`filter` accepts a CSS selector string, an iterable containing tags, an optional [namespace](#namespaces) dictionary, and flags. If no flags are specified, HTML5 mode will be assumed.
`filter` accepts a CSS selector string, an iterable containing tags, an optional [namespace](#namespaces) dictionary, and flags.

```pycon3
>>> sv.filter('p:not(.b)', soup.div)
Expand All @@ -105,7 +78,7 @@ def comments(node, limit=0, flags=0):

`comments` if useful to extract all comments from a document or document tag. It will extract from the given tag down through all of its children. You can limit how many comments are returned with `limit`.

`comments` accepts a `node` or element, a `limit`, and a flags. If no flags are specified, HTML5 mode will be assumed.
`comments` accepts a `node` or element, a `limit`, and flags.

## `soupsieve.icomments()`

Expand Down Expand Up @@ -173,3 +146,7 @@ namespace = {
```

Tags do not necessarily have to have a prefix for Soup Sieve to recognize them. For instance, in HTML5, SVG *should* automatically get the SVG namespace. Depending how namespaces were defined in the documentation, tags may inherit namespaces in some conditions. Namespace assignment is mainly handled by the parser and exposed through the Beautiful Soup API. Soup Sieve uses the Beautiful Soup API to then compare namespaces when the appropriate document that supports namespaces is set.

--8<--
refs.txt
--8<--
8 changes: 5 additions & 3 deletions docs/src/markdown/selectors.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,9 +54,7 @@ Selector | Example | Descript
`:empty` | `#!css p:empty` | Selects every `#!html <p>` element that has no children and either no text. Whitespace and comments are ignored.

!!! warning "Experimental Selectors"
`:has()` implementation is experimental and may change. There are currently no reference implementation available in any browsers, not to mention the CSS4 specifications have not been finalized, so current implementation is based on our best interpretation.

Recent addition of `:nth-*`, `:first-*`, `:last-*`, and `:only-*` is experimental. It has been implemented to the best of our understanding, especially `of S` support. Any issues with should be reported.
`:has()` and `of S` support (in `:nth-child(an+b [of S]?)`) is experimental and may change. There are currently no reference implementations available in any browsers, not to mention the CSS4 specifications have not been finalized, so current implementation is based on our best interpretation. Any issues should be reported.

## Custom Selectors

Expand All @@ -67,3 +65,7 @@ Just because we include selectors from one source, does not mean we have intenti
Selector | Example | Description
------------------------------- | ----------------------------------- | -----------
`:contains(text)` | `#!css p:contains(text)` | Select all `#!html <p>` elements that contain "text" in their content, either directly in themselves or indirectly in their decedents.

--8<--
refs.txt
--8<--
2 changes: 1 addition & 1 deletion soupsieve/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@
SoupSieve = cm.SoupSieve


def compile(pattern, namespaces=None, flags=HTML5): # noqa: A001
def compile(pattern, namespaces=None, flags=0): # noqa: A001
"""Compile CSS pattern."""

if namespaces is None:
Expand Down
2 changes: 1 addition & 1 deletion soupsieve/__meta__.py
Original file line number Diff line number Diff line change
Expand Up @@ -186,5 +186,5 @@ def parse_version(ver, pre=False):
return Version(major, minor, micro, release, pre, post, dev)


__version_info__ = Version(1, 0, 0, "beta", 1)
__version_info__ = Version(1, 0, 0, "beta", 2)
__version__ = __version_info__._get_canonical()
84 changes: 46 additions & 38 deletions soupsieve/css_match.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from .util import deprecated

# Empty tag pattern (whitespace okay)
RE_NOT_EMPTY = re.compile('[^ \t\r\n]')
RE_NOT_EMPTY = re.compile('[^ \t\r\n\f]')

# Relationships
REL_PARENT = ' '
Expand All @@ -19,6 +19,8 @@
REL_HAS_SIBLING = ':~'
REL_HAS_CLOSE_SIBLING = ':+'

NS_XHTML = 'http://www.w3.org/1999/xhtml'


class CSSMatch:
"""Perform CSS matching."""
Expand All @@ -29,9 +31,6 @@ def __init__(self, selectors, namespaces, flags):
self.selectors = selectors
self.namespaces = namespaces
self.flags = flags
self.mode = flags & util.MODE_MSK
if self.mode == 0:
self.mode == util.DEFAULT_MODE

def get_namespace(self, el):
"""Get the namespace for the element."""
Expand All @@ -45,18 +44,12 @@ def get_namespace(self, el):
def supports_namespaces(self):
"""Check if namespaces are supported in the HTML type."""

return self.mode in (util.HTML5, util.XHTML, util.XML)

def is_xml(self):
"""Check if document is an XML type."""

return self.mode in (util.XHTML, util.XML)
return self.is_xml or self.html_namespace

def get_attribute(self, el, attr, prefix):
"""Get attribute from element if it exists."""

value = None
is_xml = self.is_xml()
if self.supports_namespaces():
value = None
# If we have not defined namespaces, we can't very well find them, so don't bother trying.
Expand All @@ -81,7 +74,7 @@ def get_attribute(self, el, attr, prefix):
# We can't match our desired prefix attribute as the attribute doesn't have a prefix
if prefix and not p and prefix != '*':
continue
if is_xml:
if self.is_xml:
# The prefix doesn't match
if prefix and p and prefix != '*' and prefix != p:
continue
Expand Down Expand Up @@ -140,17 +133,15 @@ def match_attributes(self, el, attributes):
if attributes:
for a in attributes:
value = self.get_attribute(el, a.attribute, a.prefix)
pattern = a.xml_type_pattern if not self.html_namespace and a.xml_type_pattern else a.pattern
if isinstance(value, list):
value = ' '.join(value)
if a.pattern is None and value is None:
match = False
break
elif a.pattern is not None and value is None:
if value is None:
match = False
break
elif a.pattern is None:
elif pattern is None:
continue
elif value is None or a.pattern.match(value) is None:
elif pattern.match(value) is None:
match = False
break
return match
Expand All @@ -160,7 +151,7 @@ def match_tagname(self, el, tag):

return not (
tag.name and
tag.name not in ((util.lower(el.name) if not self.is_xml() else el.name), '*')
tag.name not in ((util.lower(el.name) if not self.is_xml else el.name), '*')
)

def match_tag(self, el, tag):
Expand Down Expand Up @@ -284,7 +275,7 @@ def match_nth_tag_type(self, el, child):
"""Match tag type for `nth` matches."""

return(
(child.name == (util.lower(el.name) if not self.is_xml() else el.name)) and
(child.name == (util.lower(el.name) if not self.is_xml else el.name)) and
(not self.supports_namespaces() or self.get_namespace(child) == self.get_namespace(el))
)

Expand All @@ -295,8 +286,6 @@ def match_nth(self, el, nth):

for n in nth:
matched = False
if not el.parent:
break
if n.selectors and not self.match_selectors(el, n.selectors):
break
parent = el.parent
Expand Down Expand Up @@ -390,20 +379,22 @@ def match_nth(self, el, nth):
break
return matched

def has_child(self, el):
"""Check if element has child."""

found_child = False
for child in el.children:
if isinstance(child, util.CHILD):
found_child = True
break
return found_child

def match_empty(self, el, empty):
"""Check if element is empty (if requested)."""

return not empty or (RE_NOT_EMPTY.search(el.text) is None and not self.has_child(el))
is_empty = True
if empty:
for child in el.children:
if isinstance(child, util.TAG):
is_empty = False
break
elif (
(isinstance(child, util.NAV_STRINGS) and not isinstance(child, util.NON_CONTENT_STRINGS)) and
RE_NOT_EMPTY.search(child)
):
is_empty = False
break
return is_empty

def match_subselectors(self, el, selectors):
"""Match selectors."""
Expand All @@ -417,9 +408,10 @@ def match_subselectors(self, el, selectors):
def match_contains(self, el, contains):
"""Match element if it contains text."""

types = (util.NAV_STRINGS,) if not self.is_xml else (util.NAV_STRINGS, util.CDATA)
match = True
for c in contains:
if c not in el.get_text():
if c not in el.get_text(types=types):
match = False
break
return match
Expand All @@ -428,7 +420,6 @@ def match_selectors(self, el, selectors):
"""Check if element matches one of the selectors."""

match = False
is_html = self.mode != util.XML
is_not = selectors.is_not
for selector in selectors:
match = is_not
Expand All @@ -441,10 +432,10 @@ def match_selectors(self, el, selectors):
if not self.match_empty(el, selector.empty):
continue
# Verify id matches
if is_html and selector.ids and not self.match_id(el, selector.ids):
if selector.ids and not self.match_id(el, selector.ids):
continue
# Verify classes match
if is_html and selector.classes and not self.match_classes(el, selector.classes):
if selector.classes and not self.match_classes(el, selector.classes):
continue
# Verify attribute(s) match
if not self.match_attributes(el, selector.attributes):
Expand All @@ -464,10 +455,27 @@ def match_selectors(self, el, selectors):

return match

def is_html_ns(self, el):
"""Check if in HTML namespace."""

ns = getattr(el, 'namespace') if el else None
return ns and ns == NS_XHTML

def match(self, el):
"""Match."""

return isinstance(el, util.TAG) and self.match_selectors(el, self.selectors)
doc = el
while doc.parent:
doc = doc.parent
root = None
for child in doc.children:
if isinstance(child, util.TAG):
root = child
break
self.html_namespace = self.is_html_ns(root)
self.is_xml = doc.is_xml and not self.html_namespace

return isinstance(el, util.TAG) and el.parent and self.match_selectors(el, self.selectors)


class SoupSieve(util.Immutable):
Expand Down
Loading

0 comments on commit ca16ffb

Please sign in to comment.