The entire thing requires Python 3.5 or newer.
The "Extracting titles and date information from the EGS website" described below requires GNU wget.
Well, firstly, you need to find a way to keep NewFiles.txt and possibly also Date2Id.txt up to date - these are not designed to be updated manually, but the way which I use only works on my system as it reads from a carefully organised and filenamed offline mirror of EGS which is present on my system, and to distribute such a mirror here would be both large and a blatant violation of Dan's copyright. A dedicated JSON editor, if such a thing exists (I haven't checked), might be helpful.
Reddit Titles simply contains index pages saved (as "HTML only") from /r/elgoonishshive at different times. Some overlap is fine.
910 Raw DBs was for the old, sadly deceased, pre-crash 910 Forum. No equivalent system for the present forum exists at the moment.
titlebank.dat["modes"] needs to be updated every time a new storyline starts but is designed to be updated manually in a text editor. It was originally a Python expression; it is currently designed to be parsed as YAML (JSON is not suitable due to the use of integers as dictionary keys) but remains a valid Python expression for the time being.
The following takes place in "(Misc Source Material)/Spiders" unless otherwise specified.
You may need to edit the scripts to use the correct path to a GNU wget. Also, you need a GNU wget or clone (but not busybox).
To update the databases on official titles and date anomalies:
- Delete a.txt
- Rename metadataegs4.txt to a.txt (this is used by the script as the starting point).
- Run htmlonlyyoslug.py
- Press RETURN (Enter) after the script has finished.
- Force copy metadataegs4.txt into the repository root.
To check for any anomalous date lookup successes (this should not be done too frequently):
- Delete b.txt
- Rename dateswork.txt to b.txt (this is used by the script as the starting point).
- Run datecheckyo.py
- Press RETURN (Enter) after the script has finished.
- Force copy dateswork.txt to "(Misc Source Material)" (the parent directory)
- In that directory, run slurp_dateswork.py
- Force copy DatesWorkProcessed.txt into the repository root.
The simplest and fastest way is to run rebuild.py in the repository root, under Python 2.7.
The chain is composed of a sequence of modular operations.
The output will appear in the "out" directory.
file(s) | description |
---|---|
rebuild.bat | loads rebuild.py (sometimes means less typing on Windows, and makes it easy to specify a path to Python by editing it) |
rebuild.py | runs the process modules, in order, keeping the database in memory. |
file(s) | description |
---|---|
databases.py | access to the various database files, also defines a Unicode error handler for use in UTF-8 / WinLatin-1 detwingling. |
utility.py | assorted code useful for multiple processes. sorted into more detailed headings in the module itself. |
file(s) | description |
---|---|
"910 Raw DBs/" and "Classics 910/" | rest in peace for now. |
"(Misc Source Material)/" | source material for some of the other databases, notably the code for updating metadataegs3.txt. |
"Reddit Titles/" | various index pages saved (as "HTML only") from /r/elgoonishshive at different times - for titles and reaction links. |
alldates*.txt | output of test_get_all_dates.py, used by extract_theads_new910 for anomaly detection |
BgNames.txt and BgDescriptions.txt | metadata of legacy backgrounds. this will not change in the foreseeable future. |
Date2Id.txt | mapping of dates to IDs. not entirely reliable in event of multi-SB days. still used although there is no need to. see also NewFiles.txt |
DatesWorkProcessed.txt | data about what can and cannot be looked up using a date-scheme URL. |
HayloList.html | data from Haylo's fan-site regarding strip titles and reaction links. no longer accessible at the original site I don't think. |
NewFiles.txt | date-ID and filename-title data for, in the current version, actually all comics. |
Megathread.dat | data from Reddit about the assigned titles, title assigner and discussion URL from the megathread for the 17-SB day. |
metadataegs3.txt | metadata obtained from the website itself - do not attempt to edit this directly, see "Extracting titles and date information from the EGS website" above. |
Ookii.dat | the Ookii database (by strip, not by character) saved using the internal AJAX-JSON API. stored as an uncompressed tarball to save disk footprint (many, many files significantly below 4k is a worst-case scenario for size-footprint ratio). |
suddenlaunch.dat | URLs for reaction threads on the briefly-used Suddenlaunch forum. |
titlebank.dat | assorted titles, as well as storyline boundary information. human-readable and designed to be edited in a text editor. |
titleharjit.py | titles by HarJIT. |
Transcripts.dat | another uncompressed tarfile, extract this the parent folder (it contains a directory called "Transcripts") if you want to do anything with it. |
zorua_db.dat | Zorua's EGS-NP titles and appearance data, obtained from the now-dead pre-crash 910 forum. |
file(s) | description |
---|---|
extract_date2id.py, extract_newfiles.py, extract_bg_title_db.py | these only do anything on my system. |
extract_classics_910.py | extract information from the "Classics 910" directory, for what use it is to anyone now. |
extract_reddit_info.py | extract information from the "Reddit Titles" directory. |
extract_threads_new910.py | extract information from the "910 Raw DBs" directory, for what use it is to anyone anymore. |
extract_haylo_list.py, extract_haylo_hierarchy.py | process HayloList.html |
file(s) | description |
---|---|
megadb_generate_initial.py | generate the portion of the database covered by the Ookii database, using that as a framework, but adding information from other sources. |
megadb_fetch_haylonew.py | further generate database entries for those Story comics covered by HayloList.html but not Ookii.dat |
megadb_fetch_newfiles.py | add the remaining database entries using NewFiles.txt and building upon it. |
megadb_fetch_tss.py | fetch transcripts and obtain titles for some strips (titlebank.dat and titleharjit.py and others) |
megadb_fetch_zorua.py | add Zorua information from zorua_db.dat |
megadb_indextransforms.py | reorganise Story database to the arc-line hierarchy used by EGS, and reorganise SB by year. |
megadb_pull_bg.py | add entries for legacy backgrounds. |
file(s) | description |
---|---|
export_json.py | store the entire database as a JSON file (AllMegaDb.txt). |
export_html.py | generate a HTML index of EGS strips (index.html). |
export_titles_template.py | generate a MediaWiki template containing all titles (titles.txt). |
export_titles_template_lite.py | generate a MediaWiki template containing official titles (titles_lite2.txt). |
export_numberdatemaps.py | generate MediaWiki tables describing sequential numbering, lookup IDs and date information (*map.txt). |
These are not connected with rebuild.py, and tend not to be part of a typical build sequence.
file(s) | description |
---|---|
test_get_all_dates.py | generate alldates*.txt allowing up-to-date meaningful warning output from extract_theads_new910.py (this should not be necessary in the shortly foreseeable future). |
test_d2a.py | tool for looking for possible errors in date-to-ID mapping. |
file(s) | description |
---|---|
AllMegaDb.txt | JSON of the entire database. |
index.html | HTML index of EGS strips. |
titles.txt | MediaWiki template database of titles. |
titles_lite2.txt | MediaWiki template database of titles - only official titles to save space. |
*map.txt | various MediaWiki tables describing sequential numbering, lookup IDs and date information (*map.txt). |