Fix breadcrumb extraction when we have comments in the HTML #15

lopuhin · 2024-04-15T16:21:12Z

also might fix other extractions where comments are passed. See the breadcrumbs test, it fails with the following without the fix:

tests/test_breadcrumbs.py:69:                                                                                                                                                                 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
zyte_parsers/breadcrumbs.py:136: in extract_breadcrumbs                                                                                                                                       
    extract_breadcrumbs_rec(                                                                                                                                                                  
zyte_parsers/breadcrumbs.py:112: in extract_breadcrumbs_rec                                                                                                                                   
    extract_breadcrumbs_rec(                                                                                                                                                                  
zyte_parsers/breadcrumbs.py:85: in extract_breadcrumbs_rec                                                                                                                                    
    extract_text(node),                                                                                                                                                                       
zyte_parsers/utils.py:93: in extract_text                                                                                                                                                     
    node = input_to_element(node)                                                                                                                                                             
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
                                                                                                                                                                                              
node = <!---->                                                                                                                                                                                
                                                                                                                                                                                              
    def input_to_element(node: SelectorOrElement) -> HtmlElement:                                                                                                                             
        """Convert a supported input object to a HtmlElement."""                                                                                                                              
        if isinstance(node, HtmlElement):                                                                                                                                                     
            return node                                                                                                                                                                       
>       return node.root                                                                                                                                                                      
E       AttributeError: 'HtmlComment' object has no attribute 'root'                                                                                                                          
                                                                                                                                                                                              
zyte_parsers/api.py:20: AttributeError                                                                                                                                                        
================================================================================== short test summary info ===================================================================================
FAILED tests/test_breadcrumbs.py::test_extract_breadcrumbs[[comments1.html] - https://www.hurriyet.com.tr/yerel-haberler/ankara/abbden-balciya-ulus-yaniti-41547223] - AttributeError: 'Htm...
========================================================================== 1 failed, 301 passed, 9 xfailed in 0.56s ==========================================================================

also might fix other extractions where comments are passed

lopuhin · 2024-04-16T10:59:38Z

Thanks for review @Gallaecio and @kmike

kmike · 2024-04-16T11:11:11Z

a great catch @lopuhin!

fix breadcrumb extraction when we have comments

0300f11

also might fix other extractions where comments are passed

Gallaecio approved these changes Apr 15, 2024

View reviewed changes

kmike merged commit 8b161cf into zytedata:main Apr 16, 2024
9 checks passed

lopuhin deleted the fix-breadcrumbs-on-comments branch April 16, 2024 10:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix breadcrumb extraction when we have comments in the HTML #15

Fix breadcrumb extraction when we have comments in the HTML #15

lopuhin commented Apr 15, 2024

lopuhin commented Apr 16, 2024

kmike commented Apr 16, 2024

Fix breadcrumb extraction when we have comments in the HTML #15

Fix breadcrumb extraction when we have comments in the HTML #15

Conversation

lopuhin commented Apr 15, 2024

lopuhin commented Apr 16, 2024

kmike commented Apr 16, 2024