Skip to content

mnarayan01/word2vec-ruby

Repository files navigation

word2vec-ruby

Minor rewrite of, and ruby C-bindings for, the distance program from word2vec. This could almost certainly be done with e.g. the rb-libsvm gem, but this does so directly.

N.B.: This does not currently include bindings for the main word2vec program; model creation must be done with the C program. See the original or a fork on GitHub for how to do so.

Installation

Add this line to your application's Gemfile:

gem 'word2vec-ruby'

And then execute:

bundle install

Or install it yourself as:

gem install word2vec-ruby

Usage

Here we assume you already have a model file generated by word2vec (e.g. vector.bin); if this is not the case, you should probably start here.

Assuming the model file is at data/vector.bin, the following shows some basic usage patterns:

require "word2vec/native_model"

# Load the model file.
model = Word2Vec::NativeModel.parse_file("data/vector.bin")

# Get the index of some word in the model's vocabulary:
model.index("cat")
# => 1980

# Get the nearest neighbors for a word:
model.nearest_neighbors(%w(cat), neighbors_count: 3)
# => { "dog" => 0.7418528199195862, "cats" => 0.711361825466156, "puppy" => 0.6765584349632263 }

Caveats

String encoding

In the native C code, we use rb_utf8_str_new_cstr rather than rb_str_new_cstr to create ruby strings from C strings (e.g. here and here). This means that any strings coming out of (and, to some extent, going into) Word2Vec::NativeModel will (should) be marked as having Encoding::UTF_8. We do this, rather than using Encoding::ASCII_8BIT, as it is generally more convenient.

If the underlying word2vec model file contains strings which are not UTF-8 encoded, then you should (hopefully?) be able to use String#force_encoding to mark them as the appropriate encoding when they come out of Word2Vec::NativeModel. If this become an issue, then it would be fairly straightforward to add an #encoding attribute to Word2Vec::Model, which would default to Encoding::UTF_8 but could be set to anything else.

word2vec references

Some useful references on word2vec:

  1. The original. Contains links to relevant academic papers.
  2. A well-written high level summary of word vectors.
  3. Wikipedia.
  4. Deeplearning4j's description of word2vec. Java-specific, but still a good reference.
  5. A fork of the original on GitHub.

About

Ruby port of word2vec's distance program

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published