Skip to content

Mccranky83/aistudy-docs-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AiStudy Documents Crawler

This project encompasses a sophisticated web crawler engineered to systematically acquire educational resources from the 上海市中小学数字教学系统.

The crawler leverages Puppeteer, a Node.js library, to simulate human-like interactions with the Chromium browser, enabling the efficient extraction of download links. Subsequently, the tool employs the curl command-line utility to facilitate the recursive downloading of these resources to the local system.

README_demo.mov

Installation

# Clone the repository
npm i # Installs project dependencies, including compatible Chrome
npm run start # Executes the start script, which runs `app/start.js`

Examples

Crawl first

prompt> npm run start

Directly download or crawl first? (d/C) 
Run in headless mode? (Y/n) 
subjectIndex [1-17]: 1
Crawl documents or answer sheets? (D/a) 
subjectIndex [1-2]: 1
Startup grade [Default: 0]: 
Offset [Default: 100]: 
Startup semester [Default: 0]: 
Offset [Default: 100]: 
Startup unit [Default: 0]: 
Offset [Default: 100]: 
Startup course [Default: 0]: 
Offset [Default: 100]: 

Direct download

prompt> npm run start

Directly download or crawl first? (d/C) d
劳动 - 6.json
sitemapName:

No available linkmaps

prompt> npm run start

Directly download or crawl first? (d/C) d
No linkmaps available!