Merge pull request #14 from facelessuser/whitespace

Whitespace fixes and remove document flags
facelessuser · Dec 14, 2018 · ca16ffb · ca16ffb
2 parents 9096f2d + a724995
commit ca16ffb
Show file tree

Hide file tree

Showing 18 changed files with 344 additions and 282 deletions.
diff --git a/docs/src/dictionary/en-custom.txt b/docs/src/dictionary/en-custom.txt
@@ -2,6 +2,7 @@ API
 Accessors
 Aspell
 BeautifulSoup
+CDATA
 CSS
 CSS's
 Changelog

diff --git a/docs/src/markdown/about/changelog.md b/docs/src/markdown/about/changelog.md
@@ -1,5 +1,12 @@
 # Changelog
 
+## 1.0.0b2
+
+- **NEW**: Drop document flags. Document type can be detected from the Beautiful Soup object directly.
+- **FIX**: CSS selectors should be evaluated with CSS whitespace rules.
+- **FIX**: Processing instructions, CDATA, and declarations should all be ignored in `:contains` and child considerations for `:empty`.
+- **FIX**: In Beautiful Soup, the document itself is the first tag. Do not match the "document" tag by returning false for any tag that doesn't have a parent.
+
 ## 1.0.0b1
 
 - **NEW**: Add support for non-standard `:contains()` selector.

diff --git a/docs/src/markdown/about/development.md b/docs/src/markdown/about/development.md
@@ -220,7 +220,7 @@ class SelectorTag:
 class SelectorAttribute:
     """Selector attribute rule."""
 
-    def __init__(self, attribute, prefix, pattern):
+    def __init__(self, attribute, prefix, pattern, xml_type_pattern):
         """Initialize."""
 ```
 
@@ -229,6 +229,7 @@ class SelectorAttribute:
 `attribute`         | Contains the attribute name to match.
 `prefix`            | Contains the attribute namespace prefix to match if any.
 `pattern`           | Contains a `re` regular expression object that matches the desired attribute value.
+`xml_type_pattern`  | As the default `type` pattern is case insensitive, when the attribute value is `type` and a case sensitivity has not been explicitly defined, a secondary case sensitive `type` pattern is compiled for use with XML documents when detected.
 
 ### `SelectorNth`
 

diff --git a/docs/src/markdown/api.md b/docs/src/markdown/api.md
@@ -1,39 +1,12 @@
 # API
 
-## `soupsieve.HTML5`
+Soup Sieve will detect the document type being used from the Beautiful Soup object that is given to it. For all HTML document types, it will treat tag names and attribute names without case sensitivity like most browsers do (even with XHTML). For HTML5, XHTML and XML, it will consider namespaces per the document's support (provided by the parser). To get namespaces support in HTML5, it is recommended to use `html5lib` as the parser. Some additional configuration is required when using namespaces, see [Namespace](#namespaces) for more information.
 
-`HTML5` is a flag that instructs Soup Sieve to use HTML5 logic. When the `HTML5` flag is used, Soup Sieve will take into account namespaces for known embedded HTML5 namespaces such as SVG. `HTML5` will also not compare tag names and attribute names with case sensitivity.
+While attribute values are always generally treated as case sensitive, HTML5, XHTML, and HTML treat the `type` attribute special. The `type` attribute's value is always case insensitive. This is generally how most browsers treat `type`. If you need `type` to be sensitive, you can use the `s` flag: `#!css [type="submit" s]`.
 
-!!! tip
-    While attribute values are always treated as case sensitive, HTML5, XHTML, and HTML treat the `type` attribute special, `type`'s value is always case insensitive. This is generally how most browsers treat `type`.
+## Flags
 
-    If you need `type` to be sensitive, you can use the `s` flag: `#!css [type="submit" s]`.
-
-Keep in mind, that Soup Sieve itself is not responsible for deciding what tag has or does not have a namespace.  This is actually determined by the parser used in Beautiful Soup. This flag only tells Soup Sieve that the parser should be calculating namespaces, so it is okay to look at them. The user is responsible for using an appropriate parser for HTML5.  If using the [lxml][lxml] or [html5lib][html5lib] with Beautiful Soup, HTML5 namespaces *should* be accounted for in the parsing. If you are using Python's builtin HTML parser, this may not be the case.
-
-## `soupsieve.HTML`
-
-`HTML` is a flag that instructs Soup Sieve to use pre HTML5 logic. When the `HTML` flag is used, Soup Sieve will not consider namespaces when evaluating elements. `HTML` will also not compare tag names  and attribute names with case sensitivity.
-
-!!! tip
-    While attribute values are always treated as case sensitive, HTML5, XHTML, and HTML treat the `type` attribute special, `type`'s value is always case insensitive. This is generally how most browsers treat `type`.
-
-    If you need `type` to be sensitive, you can use the `s` flag: `#!css [type="submit" s]`.
-
-## `soupsieve.XML`
-
-`XML` is a flag that instructs Soup Sieve to use XML logic. `XML` will cause Soup Sieve to take namespaces into considerations, and it will evaluate tag names and attribute names with case sensitivity. It will also relax what it considers valid tag name and attribute characters. It will also disable `.class` and `#id` selectors this is more an HTML concept.
-
-## `soupsieve.XHTML`
-
-`XHTML` is a flag that instructs Soup Sieve to use XHTML logic. This will cause Soup Sieve to take namespaces into considerations, and evaluate tag names and attributes names with no case sensitivity as this is how most browsers deal with XHTML tags. `.class` and `#id` are perfectly valid in XHTML.
-
-!!! tip
-    While attribute values are always treated as case sensitive, HTML5, XHTML, and HTML treat the `type` attribute special, `type`'s value is always case insensitive. This is generally how most browsers treat `type`.
-
-    If you need `type` to be sensitive, you can use the `s` flag: `#!css [type="submit" s]`.
-
-It is recommend to use the `xml` mode in Beautiful Soup when parsing XHTML documents.
+There are no flags at this time, but the parameter is provided for potential future use.
 
 ## `soupsieve.select()`
 
@@ -44,7 +17,7 @@ def select(select, node, namespaces=None, limit=0, flags=0):
 
 `select` given a tag, will select all tags that match the provided CSS selector string. You can give `limit` a positive integer to return a specific number tags (0 means to return all tags).
 
-`select` accepts a CSS selector string, a `node` or element, an optional [namespace](#namespaces) dictionary, a `limit`, and `flags`. If no flags are specified, HTML5 mode will be assumed.
+`select` accepts a CSS selector string, a `node` or element, an optional [namespace](#namespaces) dictionary, a `limit`, and `flags`.
 
 ```pycon3
 >>> import soupsieve as sv
@@ -64,13 +37,13 @@ def iselect(select, node, namespaces=None, limit=0, flags=0):
 ## `soupsieve.match()`
 
 ```py3
-def match(select, node, namespaces=None, mode=0):
+def match(select, node, namespaces=None, flags=0):
     """Match node."""
 ```
 
 `match` matches a given node/element with a given CSS selector.
 
-`match` accepts a CSS selector string, a `node` or element, an optional [namespace](#namespaces) dictionary, and flags.  If no flags are specified, HTML5 mode will be assumed.
+`match` accepts a CSS selector string, a `node` or element, an optional [namespace](#namespaces) dictionary, and flags.
 
 ```pycon3
 >>> nodes = sv.select('p:is(.a, .b, .c)', soup)
@@ -89,7 +62,7 @@ def filter(select, nodes, namespaces=None, flags=0):
 
 `filter` takes an iterable containing HTML nodes and will filter them based on the provided CSS selector string. If given a Beautiful Soup tag, it will iterate the children that are tags.
 
-`filter` accepts a CSS selector string, an iterable containing tags, an optional [namespace](#namespaces) dictionary, and flags.  If no flags are specified, HTML5 mode will be assumed.
+`filter` accepts a CSS selector string, an iterable containing tags, an optional [namespace](#namespaces) dictionary, and flags.
 
 ```pycon3
 >>> sv.filter('p:not(.b)', soup.div)
@@ -105,7 +78,7 @@ def comments(node, limit=0, flags=0):
 
 `comments` if useful to extract all comments from a document or document tag. It will extract from the given tag down through all of its children.  You can limit how many comments are returned with `limit`.
 
-`comments` accepts a `node` or element, a `limit`, and a flags.  If no flags are specified, HTML5 mode will be assumed.
+`comments` accepts a `node` or element, a `limit`, and flags.
 
 ## `soupsieve.icomments()`
 
@@ -173,3 +146,7 @@ namespace = {
 ```
 
 Tags do not necessarily have to have a prefix for Soup Sieve to recognize them.  For instance, in HTML5, SVG *should* automatically get the SVG namespace. Depending how namespaces were defined in the documentation, tags may inherit namespaces in some conditions.  Namespace assignment is mainly handled by the parser and exposed through the Beautiful Soup API. Soup Sieve uses the Beautiful Soup API to then compare namespaces when the appropriate document that supports namespaces is set.
+
+--8<--
+refs.txt
+--8<--
diff --git a/docs/src/markdown/selectors.md b/docs/src/markdown/selectors.md
@@ -54,9 +54,7 @@ Selector                        | Example                             | Descript
 `:empty`                        | `#!css p:empty`                     | Selects every `#!html <p>` element that has no children and either no text. Whitespace and comments are ignored.
 
 !!! warning "Experimental Selectors"
-    `:has()` implementation is experimental and may change. There are currently no reference implementation available in any browsers, not to mention the CSS4 specifications have not been finalized, so current implementation is based on our best interpretation.
-
-    Recent addition of `:nth-*`, `:first-*`, `:last-*`, and `:only-*` is experimental. It has been implemented to the best of our understanding, especially `of S` support. Any issues with should be reported.
+    `:has()` and `of S` support (in `:nth-child(an+b [of S]?)`) is experimental and may change. There are currently no reference implementations available in any browsers, not to mention the CSS4 specifications have not been finalized, so current implementation is based on our best interpretation. Any issues should be reported.
 
 ## Custom Selectors
 
@@ -67,3 +65,7 @@ Just because we include selectors from one source, does not mean we have intenti
 Selector                        | Example                             | Description
 ------------------------------- | ----------------------------------- | -----------
 `:contains(text)`               | `#!css p:contains(text)`            | Select all `#!html <p>` elements that contain "text" in their content, either directly in themselves or indirectly in their decedents.
+
+--8<--
+refs.txt
+--8<--
diff --git a/soupsieve/__init__.py b/soupsieve/__init__.py
@@ -40,7 +40,7 @@
 SoupSieve = cm.SoupSieve
 
 
-def compile(pattern, namespaces=None, flags=HTML5):  # noqa: A001
+def compile(pattern, namespaces=None, flags=0):  # noqa: A001
     """Compile CSS pattern."""
 
     if namespaces is None:

diff --git a/soupsieve/__meta__.py b/soupsieve/__meta__.py
@@ -186,5 +186,5 @@ def parse_version(ver, pre=False):
     return Version(major, minor, micro, release, pre, post, dev)
 
 
-__version_info__ = Version(1, 0, 0, "beta", 1)
+__version_info__ = Version(1, 0, 0, "beta", 2)
 __version__ = __version_info__._get_canonical()
diff --git a/soupsieve/css_match.py b/soupsieve/css_match.py
@@ -5,7 +5,7 @@
 from .util import deprecated
 
 # Empty tag pattern (whitespace okay)
-RE_NOT_EMPTY = re.compile('[^ \t\r\n]')
+RE_NOT_EMPTY = re.compile('[^ \t\r\n\f]')
 
 # Relationships
 REL_PARENT = ' '
@@ -19,6 +19,8 @@
 REL_HAS_SIBLING = ':~'
 REL_HAS_CLOSE_SIBLING = ':+'
 
+NS_XHTML = 'http://www.w3.org/1999/xhtml'
+
 
 class CSSMatch:
     """Perform CSS matching."""
@@ -29,9 +31,6 @@ def __init__(self, selectors, namespaces, flags):
         self.selectors = selectors
         self.namespaces = namespaces
         self.flags = flags
-        self.mode = flags & util.MODE_MSK
-        if self.mode == 0:
-            self.mode == util.DEFAULT_MODE
 
     def get_namespace(self, el):
         """Get the namespace for the element."""
@@ -45,18 +44,12 @@ def get_namespace(self, el):
     def supports_namespaces(self):
         """Check if namespaces are supported in the HTML type."""
 
-        return self.mode in (util.HTML5, util.XHTML, util.XML)
-
-    def is_xml(self):
-        """Check if document is an XML type."""
-
-        return self.mode in (util.XHTML, util.XML)
+        return self.is_xml or self.html_namespace
 
     def get_attribute(self, el, attr, prefix):
         """Get attribute from element if it exists."""
 
         value = None
-        is_xml = self.is_xml()
         if self.supports_namespaces():
             value = None
             # If we have not defined namespaces, we can't very well find them, so don't bother trying.
@@ -81,7 +74,7 @@ def get_attribute(self, el, attr, prefix):
                 # We can't match our desired prefix attribute as the attribute doesn't have a prefix
                 if prefix and not p and prefix != '*':
                     continue
-                if is_xml:
+                if self.is_xml:
                     # The prefix doesn't match
                     if prefix and p and prefix != '*' and prefix != p:
                         continue
@@ -140,17 +133,15 @@ def match_attributes(self, el, attributes):
         if attributes:
             for a in attributes:
                 value = self.get_attribute(el, a.attribute, a.prefix)
+                pattern = a.xml_type_pattern if not self.html_namespace and a.xml_type_pattern else a.pattern
                 if isinstance(value, list):
                     value = ' '.join(value)
-                if a.pattern is None and value is None:
-                    match = False
-                    break
-                elif a.pattern is not None and value is None:
+                if value is None:
                     match = False
                     break
-                elif a.pattern is None:
+                elif pattern is None:
                     continue
-                elif value is None or a.pattern.match(value) is None:
+                elif pattern.match(value) is None:
                     match = False
                     break
         return match
@@ -160,7 +151,7 @@ def match_tagname(self, el, tag):
 
         return not (
             tag.name and
-            tag.name not in ((util.lower(el.name) if not self.is_xml() else el.name), '*')
+            tag.name not in ((util.lower(el.name) if not self.is_xml else el.name), '*')
         )
 
     def match_tag(self, el, tag):
@@ -284,7 +275,7 @@ def match_nth_tag_type(self, el, child):
         """Match tag type for `nth` matches."""
 
         return(
-            (child.name == (util.lower(el.name) if not self.is_xml() else el.name)) and
+            (child.name == (util.lower(el.name) if not self.is_xml else el.name)) and
             (not self.supports_namespaces() or self.get_namespace(child) == self.get_namespace(el))
         )
 
@@ -295,8 +286,6 @@ def match_nth(self, el, nth):
 
         for n in nth:
             matched = False
-            if not el.parent:
-                break
             if n.selectors and not self.match_selectors(el, n.selectors):
                 break
             parent = el.parent
@@ -390,20 +379,22 @@ def match_nth(self, el, nth):
                 break
         return matched
 
-    def has_child(self, el):
-        """Check if element has child."""
-
-        found_child = False
-        for child in el.children:
-            if isinstance(child, util.CHILD):
-                found_child = True
-                break
-        return found_child
-
     def match_empty(self, el, empty):
         """Check if element is empty (if requested)."""
 
-        return not empty or (RE_NOT_EMPTY.search(el.text) is None and not self.has_child(el))
+        is_empty = True
+        if empty:
+            for child in el.children:
+                if isinstance(child, util.TAG):
+                    is_empty = False
+                    break
+                elif (
+                    (isinstance(child, util.NAV_STRINGS) and not isinstance(child, util.NON_CONTENT_STRINGS)) and
+                    RE_NOT_EMPTY.search(child)
+                ):
+                    is_empty = False
+                    break
+        return is_empty
 
     def match_subselectors(self, el, selectors):
         """Match selectors."""
@@ -417,9 +408,10 @@ def match_subselectors(self, el, selectors):
     def match_contains(self, el, contains):
         """Match element if it contains text."""
 
+        types = (util.NAV_STRINGS,) if not self.is_xml else (util.NAV_STRINGS, util.CDATA)
         match = True
         for c in contains:
-            if c not in el.get_text():
+            if c not in el.get_text(types=types):
                 match = False
                 break
         return match
@@ -428,7 +420,6 @@ def match_selectors(self, el, selectors):
         """Check if element matches one of the selectors."""
 
         match = False
-        is_html = self.mode != util.XML
         is_not = selectors.is_not
         for selector in selectors:
             match = is_not
@@ -441,10 +432,10 @@ def match_selectors(self, el, selectors):
             if not self.match_empty(el, selector.empty):
                 continue
             # Verify id matches
-            if is_html and selector.ids and not self.match_id(el, selector.ids):
+            if selector.ids and not self.match_id(el, selector.ids):
                 continue
             # Verify classes match
-            if is_html and selector.classes and not self.match_classes(el, selector.classes):
+            if selector.classes and not self.match_classes(el, selector.classes):
                 continue
             # Verify attribute(s) match
             if not self.match_attributes(el, selector.attributes):
@@ -464,10 +455,27 @@ def match_selectors(self, el, selectors):
 
         return match
 
+    def is_html_ns(self, el):
+        """Check if in HTML namespace."""
+
+        ns = getattr(el, 'namespace') if el else None
+        return ns and ns == NS_XHTML
+
     def match(self, el):
         """Match."""
 
-        return isinstance(el, util.TAG) and self.match_selectors(el, self.selectors)
+        doc = el
+        while doc.parent:
+            doc = doc.parent
+        root = None
+        for child in doc.children:
+            if isinstance(child, util.TAG):
+                root = child
+                break
+        self.html_namespace = self.is_html_ns(root)
+        self.is_xml = doc.is_xml and not self.html_namespace
+
+        return isinstance(el, util.TAG) and el.parent and self.match_selectors(el, self.selectors)
 
 
 class SoupSieve(util.Immutable):