Skip to content

irinarenteria/ruby_tokenizer

Repository files navigation

RubyTokenizer

Code Climate Build Status Test Coverage Gem Version

RubyTokenizer is a simple language processing command-line tool, modeled loosely after Apache Solr's Classic Tokenizer. It performs low-level tokenization through word-segmentation by filtering whitespaces, punctuation marks, parantheses and other special characters, and returns the top 10 most frequent words in a body of text. At the moment it's only available for English texts in UTF-8, Unicode 6.3 format. All results are case-insensitive.

Installation

To use this tool, you need to have Ruby installed. You can find more detailed instructions here: http://www.ruby-lang.org/en/downloads/

Add this line to your application's Gemfile:

gem 'ruby_tokenizer'

And then execute:

$ bundle

Or install it yourself as:

$ gem install ruby_tokenizer

Usage

To tokenize a text, enter the command "tokenizer" followed by the paths to the targeted file(s):

$ tokenizer /file1/path/here.txt /file2/path/here.txt

To use the files that come bundled with this gem, clone the repo and while in the root directory input:

$ tokenizer lib/samples/frankenstein.txt lib/samples/war_of_the_worlds.txt

If you are in the folder containing the targeted text files:

$ tokenizer file1.txt file2.text

This is the expected output:

$ [["the", 1782],
  ["and", 855],
  ["to", 790],
  ["a", 672],
  ["of", 610],
  ["she", 533],
  ["it", 463],
  ["said", 457],
  ["in", 416],
  ["alice", 384]]

If only the 'tokenizer' command is entered, then the user will be prompted to enter a string:

$ tokenizer
$ "--- Please input your text below ----"
$

If the file path cannot be found or the file has a format that cannot be read, a LoadError will be displayed:

$ `read_file': File not found: Please try again. (LoadError)

Special Cases

RubyTokenizer accounts for e-mail addresses, URLs, hyphenated words and certain abbreviations as follows:

Email addresses:

URLs:

$ ["www.frankestein.com"]

Hyphenated words:

$ ["Chicago-based"]

Abbreviations:

$ ["U.S.A"]

Numbers (phone numbers and numbers with a comma format are not tokenized):

$ ["3.50"]

Development

The following dependencies are required: Bundler, Rake, RSpec, and Pry. To install these dependencies manually:

$ gem install name

To run the test suite, fork the repo, clone it to a local directory and in the root directory run the following command:

$ bundle exec rspec

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/irinarenteria/ruby_tokenizer.

License

The gem is available as open source under the terms of the MIT License.

About

Simple NLP tokenizer in Ruby

Resources

License

Stars

Watchers

Forks

Packages

No packages published