GitHub - kaimo455/qichacha-crawler: This is repo provides some functions (in utils.py) for you to crawl qichacha.com company information.

Instruction
Functions details in utils.py
To do

Instruction

This crawler is designed for specific use case, you can find that I only parsed some specific information. However the code's logic is easy to read and you can customize it to meet you personal demand.

This crawler use cookies to authenticate login
Please check the function in utils.py which includes all core codes

Functions details in utils.py

get_firm_uid(header_uids: dict, name_list: list) -> List[str]

This function takes:

# @header_uids: a header dictionary for requests
# @name_list: a list of company names

# @return: a list of company's uid

The uid acts as a Unique Identifier for a company, it is assigned for every company by qichacha, we need that uid to get the specific website for each company.

get_basic_info_soup(header_basic_info: dict, uid_list: list) -> List[BeautifulSoup]

# @header_basic_info: a header dictionary for request basic info
# @uid_list: a list of company's uid

# @preturn: a list of BeautifulSoup instance, which contains the basic info page htlm contents

This function is used for fetching the basic information for a company. Specifically, there are two main pages I want to parse, one is the basic information page (基本信息 in chinese), another is development page (企业发展 in chinese). So, the function just fetch the basic information page html content.

get_dev_info_soup(header_dev_info: dict, uid_list: list) -> List[BeautifulSoup]

# @header_dev_info: a header dictionary for request development info
# @uid_list: a list of company's uid

# @preturn: a list of BeautifulSoup instance, which contains the development info page htlm contents

This function is used for fetching the development information for a company as I metion above.

parse_basic_info(basic_soup: BeautifulSoup) -> dict

# @basic_soup: a list of BeautifulSoup instance, which contains the basic info page htlm contents

# @preturn: a dictionary that contains selected basic information. key is the name of info, value is the value of that info. E.g. "website": "www.example.com"

parse_dev_info(dev_soup: BeautifulSoup) -> dict

# @dev_soup: a list of BeautifulSoup instance, which contains the development info page htlm contents

# @preturn: a dictionary that contains selected development information. key is the name of info, value is the value of that info. E.g. "total_profits": "xxx"

fill_excel(path_to_sample_file: str, output_dir: str, all_df: pd.DataFrame) -> None

# @path_to_sample_file: path to sample excel form file that you want to fill
# @output_dir: output directory
# @all_df: the pandas.DataFrame that contains all information, each row corresponds to each company, columns is the name of info

To do

For the future, maybe I will write a cli wraper for this crawler, let's see.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
README.md		README.md
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Instruction

Functions details in utils.py

get_firm_uid(header_uids: dict, name_list: list) -> List[str]

get_basic_info_soup(header_basic_info: dict, uid_list: list) -> List[BeautifulSoup]

get_dev_info_soup(header_dev_info: dict, uid_list: list) -> List[BeautifulSoup]

parse_basic_info(basic_soup: BeautifulSoup) -> dict

parse_dev_info(dev_soup: BeautifulSoup) -> dict

fill_excel(path_to_sample_file: str, output_dir: str, all_df: pd.DataFrame) -> None

To do

About

Releases

Packages

Languages

kaimo455/qichacha-crawler

Folders and files

Latest commit

History

Repository files navigation

Instruction

Functions details in utils.py

get_firm_uid(header_uids: dict, name_list: list) -> List[str]

get_basic_info_soup(header_basic_info: dict, uid_list: list) -> List[BeautifulSoup]

get_dev_info_soup(header_dev_info: dict, uid_list: list) -> List[BeautifulSoup]

parse_basic_info(basic_soup: BeautifulSoup) -> dict

parse_dev_info(dev_soup: BeautifulSoup) -> dict

fill_excel(path_to_sample_file: str, output_dir: str, all_df: pd.DataFrame) -> None

To do

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages