Minor rewrite of, and ruby C-bindings for, the distance program from word2vec.
This could almost certainly be done with e.g. the rb-libsvm gem, but this
does so directly.
N.B.: This does not currently include bindings for the main word2vec program; model creation must be done with
the C program. See the original or a
fork on GitHub for how to do so.
Add this line to your application's Gemfile:
gem 'word2vec-ruby'And then execute:
bundle installOr install it yourself as:
gem install word2vec-rubyHere we assume you already have a model file generated by word2vec (e.g. vector.bin); if this is not the case, you
should probably start here.
Assuming the model file is at data/vector.bin, the following shows some basic usage patterns:
require "word2vec/native_model"
# Load the model file.
model = Word2Vec::NativeModel.parse_file("data/vector.bin")
# Get the index of some word in the model's vocabulary:
model.index("cat")
# => 1980
# Get the nearest neighbors for a word:
model.nearest_neighbors(%w(cat), neighbors_count: 3)
# => { "dog" => 0.7418528199195862, "cats" => 0.711361825466156, "puppy" => 0.6765584349632263 }In the native C code, we use rb_utf8_str_new_cstr
rather than rb_str_new_cstr to create ruby strings
from C strings (e.g. here
and here). This means that
any strings coming out of (and, to some extent, going into) Word2Vec::NativeModel will (should) be marked as having
Encoding::UTF_8. We do this, rather than using Encoding::ASCII_8BIT, as it is generally more convenient.
If the underlying word2vec model file contains strings which are not UTF-8 encoded, then you should (hopefully?) be
able to use String#force_encoding to mark them as the appropriate encoding when they come out of
Word2Vec::NativeModel. If this become an issue, then it would be fairly straightforward to add an #encoding
attribute to Word2Vec::Model, which would default to Encoding::UTF_8 but could be set to anything else.
Some useful references on word2vec:
- The original. Contains links to relevant academic papers.
- A well-written high level summary of word vectors.
- Wikipedia.
- Deeplearning4j's description of
word2vec. Java-specific, but still a good reference. - A fork of the original on GitHub.