Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix breadcrumb extraction when we have comments in the HTML #15

Merged
merged 1 commit into from
Apr 16, 2024

Conversation

lopuhin
Copy link
Contributor

@lopuhin lopuhin commented Apr 15, 2024

also might fix other extractions where comments are passed. See the breadcrumbs test, it fails with the following without the fix:

tests/test_breadcrumbs.py:69:                                                                                                                                                                 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
zyte_parsers/breadcrumbs.py:136: in extract_breadcrumbs                                                                                                                                       
    extract_breadcrumbs_rec(                                                                                                                                                                  
zyte_parsers/breadcrumbs.py:112: in extract_breadcrumbs_rec                                                                                                                                   
    extract_breadcrumbs_rec(                                                                                                                                                                  
zyte_parsers/breadcrumbs.py:85: in extract_breadcrumbs_rec                                                                                                                                    
    extract_text(node),                                                                                                                                                                       
zyte_parsers/utils.py:93: in extract_text                                                                                                                                                     
    node = input_to_element(node)                                                                                                                                                             
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
                                                                                                                                                                                              
node = <!---->                                                                                                                                                                                
                                                                                                                                                                                              
    def input_to_element(node: SelectorOrElement) -> HtmlElement:                                                                                                                             
        """Convert a supported input object to a HtmlElement."""                                                                                                                              
        if isinstance(node, HtmlElement):                                                                                                                                                     
            return node                                                                                                                                                                       
>       return node.root                                                                                                                                                                      
E       AttributeError: 'HtmlComment' object has no attribute 'root'                                                                                                                          
                                                                                                                                                                                              
zyte_parsers/api.py:20: AttributeError                                                                                                                                                        
================================================================================== short test summary info ===================================================================================
FAILED tests/test_breadcrumbs.py::test_extract_breadcrumbs[[comments1.html] - https://www.hurriyet.com.tr/yerel-haberler/ankara/abbden-balciya-ulus-yaniti-41547223] - AttributeError: 'Htm...
========================================================================== 1 failed, 301 passed, 9 xfailed in 0.56s ==========================================================================

also might fix other extractions where comments are passed
@kmike kmike merged commit 8b161cf into zytedata:main Apr 16, 2024
9 checks passed
@lopuhin lopuhin deleted the fix-breadcrumbs-on-comments branch April 16, 2024 10:59
@lopuhin
Copy link
Contributor Author

lopuhin commented Apr 16, 2024

Thanks for review @Gallaecio and @kmike

@kmike
Copy link
Contributor

kmike commented Apr 16, 2024

a great catch @lopuhin!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants