Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update 05-preprocessing-dataset.md #9

Open
wants to merge 1 commit into
base: gh-pages
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions _episodes/05-preprocessing-dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@ keypoints:

## Data Preparation

Text data comes in different forms. You might want to analyse a document in one file or an entire collection of documents (a corpus) stored in multiple files. In this part of the lesson we will show you how to load a single document and how to load the text of an entire corpus into Python for further analysis.
Text data comes in different forms. You might want to analyse a document in one file or an entire collection of documents (a corpus) stored in multiple files. In this part of the lesson, we will show you how to load a single document and how to load the text of an entire corpus into Python for further analysis.

### Download some data

Firstly, please download a dataset and make a note of where it is saved on your computer. We need the path to dataset in order to load and read it for further processing.

We will use the [Medical History of British India](https://data.nls.uk/data/digitised-collections/a-medical-history-of-british-india/) collection provided by the [National Libarry of Scotland](https://www.nls.uk) as an example:
We will use the [Medical History of British India](https://data.nls.uk/data/digitised-collections/a-medical-history-of-british-india/) collection provided by the [National Library of Scotland](https://www.nls.uk) as an example:

<img src="../fig/mhbi.png" width="700">

Expand Down Expand Up @@ -55,7 +55,7 @@ lower_india_tokens[0:10]

We can do the same for an entire collection of documents (a corpus). Here we choose a collection of raw text documents in a given directory. We will use the entire Medical History of British India collection as our dataset.

To read the text files in this collection we can use the ```PlaintextCorpusReader``` class provided in the ```corpus``` package of NLTK. You need to specify the collection directory name and a wildcard for which files to read in the directory (e.g. ```.*``` for all files) and the text encoding of the files (in this case ```latin1```). Using the ```words()``` method provided by NLTK, the text is automatically tokenised and stored in a list of words. As before, we can then lowercase the words in the list.
To read the text files in this collection we can use the ```PlaintextCorpusReader``` class provided in the ```corpus``` package of NLTK. You need to specify the collection directory name and a wildcard for which files to read in the directory (e.g., ```.*``` for all files) and the text encoding of the files (in this case ```latin1```). Using the ```words()``` method provided by NLTK, the text is automatically tokenised and stored in a list of words. As before, we can then lowercase the words in the list.

```python
from nltk.corpus import PlaintextCorpusReader
Expand All @@ -74,7 +74,7 @@ lower_corpus_tokens[0:10]

> ## Task 1: Print slice of tokens in list
>
> Print out a larger slice of the list of tokens in the Medical History of British India collection, e.g. the first 30 tokens.
> Print out a larger slice of the list of tokens in the Medical History of British India collection, e.g., the first 30 tokens.
>
> > ## Answer
> > ~~~python
Expand Down