Skip to content
This repository has been archived by the owner on Jan 28, 2023. It is now read-only.
/ qichacha-crawler Public archive

This is repo provides some functions (in utils.py) for you to crawl qichacha.com company information.

Notifications You must be signed in to change notification settings

kaimo455/qichacha-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 

Repository files navigation

Instruction

This crawler is designed for specific use case, you can find that I only parsed some specific information. However the code's logic is easy to read and you can customize it to meet you personal demand.

  • This crawler use cookies to authenticate login
  • Please check the function in utils.py which includes all core codes

Functions details in utils.py

get_firm_uid(header_uids: dict, name_list: list) -> List[str]

This function takes:

# @header_uids: a header dictionary for requests
# @name_list: a list of company names

# @return: a list of company's uid

The uid acts as a Unique Identifier for a company, it is assigned for every company by qichacha, we need that uid to get the specific website for each company.

get_basic_info_soup(header_basic_info: dict, uid_list: list) -> List[BeautifulSoup]

# @header_basic_info: a header dictionary for request basic info
# @uid_list: a list of company's uid

# @preturn: a list of BeautifulSoup instance, which contains the basic info page htlm contents

This function is used for fetching the basic information for a company. Specifically, there are two main pages I want to parse, one is the basic information page (基本信息 in chinese), another is development page (企业发展 in chinese). So, the function just fetch the basic information page html content.

get_dev_info_soup(header_dev_info: dict, uid_list: list) -> List[BeautifulSoup]

# @header_dev_info: a header dictionary for request development info
# @uid_list: a list of company's uid

# @preturn: a list of BeautifulSoup instance, which contains the development info page htlm contents

This function is used for fetching the development information for a company as I metion above.

parse_basic_info(basic_soup: BeautifulSoup) -> dict

# @basic_soup: a list of BeautifulSoup instance, which contains the basic info page htlm contents

# @preturn: a dictionary that contains selected basic information. key is the name of info, value is the value of that info. E.g. "website": "www.example.com"

parse_dev_info(dev_soup: BeautifulSoup) -> dict

# @dev_soup: a list of BeautifulSoup instance, which contains the development info page htlm contents

# @preturn: a dictionary that contains selected development information. key is the name of info, value is the value of that info. E.g. "total_profits": "xxx"

fill_excel(path_to_sample_file: str, output_dir: str, all_df: pd.DataFrame) -> None

# @path_to_sample_file: path to sample excel form file that you want to fill
# @output_dir: output directory
# @all_df: the pandas.DataFrame that contains all information, each row corresponds to each company, columns is the name of info

To do

For the future, maybe I will write a cli wraper for this crawler, let's see.

About

This is repo provides some functions (in utils.py) for you to crawl qichacha.com company information.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages