-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathprogscrape.1
74 lines (74 loc) · 2.79 KB
/
progscrape.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
.TH progscrape 1 "September 2010"
.SH NAME
progscrape \- \fBXarn\fR's Shiichan webscraper
.SH SYNOPSIS
\fBprogscrape\fR [\fIOPTION\fR]... [\fIDB\fR]
.SH DESCRIPTION
This is a webscraper for Shiichan textboards, originally designed for world4ch's /prog/.
.br
\fBprogscrape\fR determines what to scrape based on Shiichan's subject.txt, and can scrape over either the usual HTML interface or world4ch's JSON interface. The content scraped is placed in an SQLite 3 database (\fBBOARD.db\fR by default).
.SH OPTIONS
\fB\-\-json\fR
Use the JSON interface, if possible. (default)
.TP
\fB\-\-html\fR, \fB\-\-no-json
Use the HTML interface.
.TP
\fB\-\-verify-trips\fR
When using the JSON interface, drop to the HTML interface to verify if ambiguous tripcodes are legitimate. (default)
.TP
\fB\-\-no\-verify\-trips\fR
When using the JSON interface, do not try to verify ambiguous tripcodes.
.TP
\fB\-\-aborn\fR
When using the JSON interface, keep deleted posts. They will exist in the database with author "SILENT!ABORN", content "SILENT", and timestamp 1234. (default)
.TP
\fB\-\-no\-aborn\fR
When using the JSON interface, ignore deleted posts.
.TP
\fB\-\-no\-html\fR
Equivalent to \fB\-\-json \-\-no\-verify\-trips\fR.
.TP
\fB\-\-progress\-bar\fR
Display an animated progress bar during scraping. (default)
.TP
\fB\-\-no\-progress\-bar\fR
Display the traditional progress report instead.
.TP
\fB\-\-base\-url\fR=\fIURL\fR
Specify base URL of the board to scrape. (default: \fBhttp://dis.4chan.org\fR)
.TP
\fB\-\-port\fR=\fIURL\fR
Specify the port the webserver is running on. (default: \fB80\fR)
.TP
\fB\-\-board\fR=\fIBOARD\fR
Specify board to scrape. (default: \fB/prog/\fR)
.TP
\fB\-\-charset\fR=\fICHARSET\fR
Specify the character encoding the board uses. (default: \fButf-8\fR)
.TP
\fB\-\-partial\fR
Read a list of thread IDs on standard input and only scrape those (provided they're valid IDs and need updating).
.TP
\fB\-\-threads\fR=\fITHREADS\fR
How many scraper threads to use. If this is set to \fBauto\fR, progscrape will try to determine a sensible number based on the number of threads it has to scrape. (default \fBauto\fR)
.TP
\fB\-\-dry\-run\fR
Calculate how many threads and posts would need to be fetched to bring the database up to date, but don't actually fetch the posts.
.TP
\fB\-\-no\-dry\-run\fR
Turn off dry run mode. (default)
.TP
\fB\-\-index\fR=\fIIDIR\fR
Also index scraped content in a Whoosh index, for easier full-text search with \fBprogsearch\fR. Omit not to index.
.TP
\fB\-h\fR, \fB\-\-help\fR
Display help message and exit.
.SH "REPORTING BUGS"
Report \fBprogscrape\fR bugs to [email protected], or on the Github bugtracker: <http://github.com/Cairnarvon/progscrape/issues>.
.SH COPYRIGHT
Copyright \(co 2008, 2009, 2010 Koen Crolla.
.br
Licensed as free software under the MIT license.
.SH "SEE ALSO"
\fBsqlite3\fR(1)