Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add search query support to the job posting spider. #115

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

wRAR
Copy link
Member

@wRAR wRAR commented Dec 30, 2024

This worked for indeed but didn't work for glassdoor and we should understand why.

Copy link

codecov bot commented Dec 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.09%. Comparing base (eff912f) to head (4a5be97).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #115      +/-   ##
==========================================
+ Coverage   96.06%   96.09%   +0.02%     
==========================================
  Files          26       26              
  Lines        2595     2613      +18     
==========================================
+ Hits         2493     2511      +18     
  Misses        102      102              
Files with missing lines Coverage Δ
zyte_spider_templates/spiders/job_posting.py 95.87% <100.00%> (+0.93%) ⬆️
🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wRAR wRAR marked this pull request as draft December 30, 2024 15:38
@Gallaecio
Copy link
Contributor

Gallaecio commented Dec 31, 2024

didn't work for glassdoor and we should understand why.

I don’t see SearchAction metadata in the HTML, and Formasaurus seems to fail:

import asyncio
from base64 import b64decode

from form2request import form2request
from formasaurus import build_submission
from parsel import Selector
from zyte_api import AsyncZyteAPI


async def main():
    client = AsyncZyteAPI()
    url = "https://www.glassdoor.com/Job/index.htm"
    result = await client.get({"url": url, "httpResponseBody": True})
    html = b64decode(result["httpResponseBody"]).decode()
    selector = Selector(text=html, base_url=url)
    form, data, submit_button = build_submission(selector, "search", {"search query": "foo"})
    request_data = form2request(form, data, click=submit_button)
    print(request_data)


asyncio.run(main())
$ python test.py 
Request(url='https://www.glassdoor.com/Job/index.htm', method='GET', headers=[], body=b'')

So I would say it “fails as expected”.

@wRAR
Copy link
Member Author

wRAR commented Dec 31, 2024

That is unfortunate and I wonder what other steps we can make to support it.

@wRAR wRAR marked this pull request as ready for review January 6, 2025 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants