RubyTokenizer

RubyTokenizer is a simple language processing command-line tool, modeled loosely after Apache Solr's Classic Tokenizer. It performs low-level tokenization through word-segmentation by filtering whitespaces, punctuation marks, parantheses and other special characters, and returns the top 10 most frequent words in a body of text. At the moment it's only available for English texts in UTF-8, Unicode 6.3 format. All results are case-insensitive.

Installation

To use this tool, you need to have Ruby installed. You can find more detailed instructions here: http://www.ruby-lang.org/en/downloads/

Add this line to your application's Gemfile:

gem 'ruby_tokenizer'

And then execute:

$ bundle

Or install it yourself as:

$ gem install ruby_tokenizer

Usage

To tokenize a text, enter the command "tokenizer" followed by the paths to the targeted file(s):

$ tokenizer /file1/path/here.txt /file2/path/here.txt

To use the files that come bundled with this gem, clone the repo and while in the root directory input:

$ tokenizer lib/samples/frankenstein.txt lib/samples/war_of_the_worlds.txt

If you are in the folder containing the targeted text files:

$ tokenizer file1.txt file2.text

This is the expected output:

$ [["the", 1782],
  ["and", 855],
  ["to", 790],
  ["a", 672],
  ["of", 610],
  ["she", 533],
  ["it", 463],
  ["said", 457],
  ["in", 416],
  ["alice", 384]]

If only the 'tokenizer' command is entered, then the user will be prompted to enter a string:

$ tokenizer
$ "--- Please input your text below ----"
$

If the file path cannot be found or the file has a format that cannot be read, a LoadError will be displayed:

$ `read_file': File not found: Please try again. (LoadError)

Special Cases

RubyTokenizer accounts for e-mail addresses, URLs, hyphenated words and certain abbreviations as follows:

Email addresses:

$ ["[email protected]"]
$ ["[email protected]"]
$ ["[email protected]"]

URLs:

$ ["www.frankestein.com"]

Hyphenated words:

$ ["Chicago-based"]

Abbreviations:

$ ["U.S.A"]

Numbers (phone numbers and numbers with a comma format are not tokenized):

$ ["3.50"]

Development

The following dependencies are required: Bundler, Rake, RSpec, and Pry. To install these dependencies manually:

$ gem install name

To run the test suite, fork the repo, clone it to a local directory and in the root directory run the following command:

$ bundle exec rspec

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/irinarenteria/ruby_tokenizer.

License

The gem is available as open source under the terms of the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
bin		bin
lib		lib
spec		spec
.gitignore		.gitignore
.rspec		.rspec
.travis.yml		.travis.yml
Gemfile		Gemfile
LICENSE.txt		LICENSE.txt
README.md		README.md
Rakefile		Rakefile
benchmark.rb		benchmark.rb
ruby_tokenizer-0.1.0.gem		ruby_tokenizer-0.1.0.gem
ruby_tokenizer.gemspec		ruby_tokenizer.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RubyTokenizer

Installation

Usage

Special Cases

Development

Contributing

License

About

Releases

Packages

Languages

License

irinarenteria/ruby_tokenizer

Folders and files

Latest commit

History

Repository files navigation

RubyTokenizer

Installation

Usage

Special Cases

Development

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages