Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readme sample txt file #6

Open
headwinds opened this issue Jun 9, 2019 · 1 comment
Open

readme sample txt file #6

headwinds opened this issue Jun 9, 2019 · 1 comment

Comments

@headwinds
Copy link

this line in the readme

python train.py file.xt

Is that supposed to be file.txt and if so can you provide a sample file? perhaps the one used for the 1000 word model. Just want to know how to setup up the raw text - it is just a dump of text or does it need to formatted in any way?

@ippo615
Copy link
Contributor

ippo615 commented Jul 12, 2020

Almost any plain text file will work. Basically, you'll have a bunch of sentences in a text file without formatting (ie you'll want to avoid markdown, html, etc..). For example, the following is a good sample (note: you'll need a lot more text for good training):

Wikipedia is a multilingual online encyclopedia created and maintained as an open collaboration project by a community of volunteer editors using a wiki-based editing system. It is the largest and most popular general reference work on the World Wide Web. It is also one of the 15 most popular websites ranked by Alexa, as of June 2020. It features exclusively free content and no commercial ads and is owned and supported by the Wikimedia Foundation, a non-profit organization funded primarily through donations.

Wikipedia was launched on January 15, 2001, and was created by Jimmy Wales and Larry Sanger. Sanger coined its name as a portmanteau of the words "wiki" (Hawaiian for "quick") and "encyclopedia". Initially an English-language encyclopedia, versions of Wikipedia in other languages were quickly developed. With 6.1 million articles, the English Wikipedia is the largest of the more than 300 Wikipedia encyclopedias. Overall, Wikipedia comprises more than 54 million articles attracting 1.5 billion unique visitors per month.

The script will handle tokeninzing that text into sentences and words. Documents do not matter for this type of training. Only sentences and words do - so you can put all of the text you want into one giant file. I find one giant file is hard to manage so you could instead have a bunch of plain text files in a directory and the script will work with all the files in the directory (or with a list of files).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants