forked from tjmahr/readtextgrid
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
257 lines (197 loc) · 7.39 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# readtextgrid <img src="man/figures/logo.png" width = "150" align="right" />
<!-- badges: start -->
[data:image/s3,"s3://crabby-images/a4d26/a4d26344ffdc2f24a349904c7c3d7d26f8645093" alt="CRAN status"](https://CRAN.R-project.org/package=readtextgrid)
[data:image/s3,"s3://crabby-images/04805/048057425e6e101b54d08be5c8ab58ab5de8c59a" alt="R-CMD-check"](https://github.com/tjmahr/readtextgrid/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->
readtextgrid parses Praat textgrids into R dataframes.
## Installation
Install from CRAN:
``` r
install.packages("readtextgrid")
```
Install the development version from Github:
``` r
install.packages("remotes")
remotes::install_github("tjmahr/readtextgrid")
```
## Basic example
Here is the example textgrid created by Praat. It was created using
`New -> Create TextGrid...` with default settings in Praat.
<img src="man/figures/demo-textgrid.png" width="600" />
This textgrid is bundled with this R package. We can locate the file with
`example_textgrid()`. We read in the textgrid with `read_textgrid()`.
```{r example, R.options = list(tibble.width = 100)}
library(readtextgrid)
# Locates path to an example textgrid bundled with this package
tg <- example_textgrid()
read_textgrid(path = tg)
```
The dataframe contains one row per annotation: one row for each interval on an
interval tier and one row for each point on a point tier. If a point tier has no
points, it is represented with single row with `NA` values.
The columns encode the following information:
- `file` filename of the textgrid. By default this column uses the filename in
`path`. A user can override this value by setting the `file` argument in
`read_textgrid(path, file)`, which can be useful if textgrids are stored in
speaker-specific folders.
- `tier_num` the number of the tier (as in the left margin of the textgrid
editor)
- `tier_name` the name of the tier (as in the right margin of the textgrid
editor)
- `tier_type` the type of the tier. `"IntervalTier"` for interval tiers and
`"TextTier"` for point tiers (this is the terminology used inside of the
textgrid file format).
- `tier_xmin`, `tier_xmax` start and end times of the tier in seconds
- `xmin`, `xmax` start and end times of the textgrid interval or point tier
annotation in seconds
- `text` the text in the annotation
- `annotation_num` the number of the annotation in that tier (1 for the first
annotation, etc.)
## Reading in directories of textgrids
Suppose you have data on multiple speakers with one folder of textgrids per
speaker. As an example, this package has a folder called `speaker_data` bundled
with it representing 5 five textgrids from 2 speakers.
```
speaker-data
+-- speaker001
| +-- s2T01.TextGrid
| +-- s2T02.TextGrid
| +-- s2T03.TextGrid
| +-- s2T04.TextGrid
| \-- s2T05.TextGrid
\-- speaker002
+-- s2T01.TextGrid
+-- s2T02.TextGrid
+-- s2T03.TextGrid
+-- s2T04.TextGrid
\-- s2T05.TextGrid
```
First, we create a vector of file-paths to read into R.
```{r}
# Get the path of the folder bundled with the package
data_dir <- system.file(package = "readtextgrid", "speaker-data")
# Get the full paths to all the textgrids
paths <- list.files(
path = data_dir,
pattern = "TextGrid$",
full.names = TRUE,
recursive = TRUE
)
```
We can use `purrr::map_dfr()`--*map* the `read_textgrid` function over the
`paths` and combine the dataframes (`_dfr`)---to read all these textgrids into
R. But note that this way loses the speaker information.
```{r, R.options = list(tibble.width = 100)}
library(purrr)
map_dfr(paths, read_textgrid)
```
We can use `purrr::map2_dfr()` and some dataframe manipulation to add the
speaker information.
```{r, R.options = list(tibble.width = 100), message = FALSE, warning = FALSE}
library(dplyr)
# This tells read_textgrid() to set the file column to the full path
data <- map2_dfr(paths, paths, read_textgrid) |>
mutate(
# basename() removes the folder part from a path,
# dirname() removes the file part from a path
speaker = basename(dirname(file)),
file = basename(file),
) |>
select(
speaker, everything()
)
data
```
Another strategy would be to read the textgrid dataframes into a list column and
`unnest()` them.
```{r}
# Read dataframes into a list column
data_nested <- tibble(
speaker = basename(dirname(paths)),
data = map(paths, read_textgrid)
)
# We have one row per textgrid dataframe because `data` is a list column
data_nested
# promote the nested dataframes into the main dataframe
tidyr::unnest(data_nested, "data")
```
## Other tips
### Speeding things up
Do you have thousands of textgrids to read? The following workflow can speed
things up. We are going to read the textgrids in parallel. We use the future
package to manage the parallel computation. We use the furrr package to get
future-friendly versions of the purrr functions. We tell future to use a
`multisession` `plan` for parallelism: Do the extra computation on separate R
sessions in the background. Then everything else is the same. Just replace
`map()` with `future_map()`.
```{r, warning = FALSE}
library(future)
library(furrr)
plan(multisession)
data_nested <- tibble(
speaker = basename(dirname(paths)),
data = future_map(paths, read_textgrid)
)
```
### Helpful columns
The following columns are often helpful:
- `duration` of an interval
- `xmid` midpoint of an interval
- `total_annotations` total number of annotations on a tier
Here is how to create them:
```{r}
data |>
# grouping needed for counting annotations per tier per file per speaker
group_by(speaker, file, tier_num) |>
mutate(
duration = xmax - xmin,
xmid = xmin + (xmax - xmin) / 2,
total_annotations = sum(!is.na(annotation_num))
) |>
ungroup() |>
glimpse()
```
### Launching Praat
*This tip is written from the perspective of a Windows user who uses git bash
for a terminal*.
To open textgrids in Praat, you can tell R to call Praat from
the command line. You have to know where the location of the Praat binary is
though. I like to keep a copy in my project directories. So, assuming that
Praat.exe in my working folder, the following would open the 10 textgrids in
`paths` in Praat.
```{r, eval = FALSE}
system2(
command = "./Praat.exe",
args = c("--open", paths),
wait = FALSE
)
```
## Limitations
readtextgrid supports textgrids created by Praat by using `Save as text
file...`. It uses a parsing strategy based on regular expressions targeting
indentation patterns and text flags in the file format. The [formal
specification of the textgrid
format](http://www.fon.hum.uva.nl/praat/manual/TextGrid_file_formats.html),
however, is much more flexible. As a result, not every textgrid that Praat can
open---especially the minimal "short text" files---is compatible with this
package.
## Acknowledgments
readtextgrid was created to process data from the [WISC Lab
project](https://kidspeech.wisc.edu/). Thus, development of this package was
supported by NIH R01DC009411 and NIH R01DC015653.
***
Please note that the 'readtextgrid' project is released with a
[Contributor Code of Conduct](https://www.contributor-covenant.org/version/1/0/0/code-of-conduct.html).
By contributing to this project, you agree to abide by its terms.