Skip to content

Commit

Permalink
Add library usage explanation
Browse files Browse the repository at this point in the history
Change-Id: I4401ebd7218a3e699efb73917aa5de8baf6f17f8
  • Loading branch information
Akron committed Sep 6, 2023
1 parent 8e80393 commit 78d270d
Showing 1 changed file with 36 additions and 1 deletion.
37 changes: 36 additions & 1 deletion Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ high-performance large-scale natural language tokenization,
based on a finite state
transducer generated with [Foma](https://fomafst.github.io/).

The library contains precompiled tokenizer models for
The repository currently contains precompiled tokenizer models for

- [german](testdata/tokenizer_de.matok)
- [english](testdata/tokenizer_en.matok)
Expand All @@ -18,6 +18,8 @@ The focus of development is on the tokenization of
[DeReKo](https://www.ids-mannheim.de/digspra/kl/projekte/korpora),
the german reference corpus.

Datok can be used as a standalone tool or as a library in Go.

## Performance

![Speed comparison of german tokenizers](https://raw.githubusercontent.com/KorAP/Datok/master/misc/benchmarks.svg)
Expand Down Expand Up @@ -54,6 +56,7 @@ The special `END OF TRANSMISSION` character (`\x04`) can be used to mark the end
> *Caution*: When experimenting with STDIN and echo,
> you may need to disable [history expansion](https://www.gnu.org/software/bash/manual/html_node/History-Interaction.html).

## Conversion

```
Expand All @@ -68,6 +71,38 @@ Flags:
representation
```

## Library

```go
package main

import (
"github.com/KorAP/datok"
"os"
"strings"
)

func main () {

// Load transducer binary
dat := datok.LoadTokenizerFile("tokenizer_de.matok")
if dat == nil {
panic("Can't load tokenizer")
}

// Create a new TokenWriter object
tw := datok.NewTokenWriter(os.Stdout, datok.TOKENS|datok.SENTENCES)
defer tw.Flush()

// Create an io.Reader object refering to the data to tokenize
r := strings.NewReader("Das ist <em>interessant</em>!")

// The transduceTokenWriter accepts an io.Reader
// object and a TokenWriter object to transduce the input
dat.TransduceTokenWriter(r, tw)
}
```

## Conventions

The FST generated by [Foma](https://fomafst.github.io/) must adhere to
Expand Down

0 comments on commit 78d270d

Please sign in to comment.