Semantic Features

About

semantic-features is an extensible, easy-to-use library for performing semantic analysis on words in context.

Usage

Embedding Extraction

To extract and save embeddings from a new model, use extract_embs_main.py, such as:

python3 extract_embs_main.py --model=bert-base-uncased --dataset=bnc/less_token_lists --save_dir=saved_embs_test --gpu=3

Parameters:

Parameter	Required?	Value(s)	Note
model	required	str, name of LM from HuggingFace	autoregressive LLMs not recommended
dataset	required	str, path to dataset	BNC-like formatting and directory structure expected
save_dir	required	str, path to directory to save embeddings to
gpu	optional	int, which cuda to use	highly recommended

After extracting the embeddings, you can use check_savings.py to make sure that the saved embeddings are of the correct dimensions and format:

python3 check_savings.py --embs=saved_embs_test/bert-base-uncased --dataset=bnc/less_token_lists

Parameters:

Parameter	Required?	Values
embs	required	str, path to embeddings extracted using `extract_embs_main.py`
dataset	required	str, path to dataset used to extract embeddings

Model Training and Optimization

To train a feature norm predictor model, use model.py, such as:

python model.py --norm=mcrae --norm_file=feature-norms/binder/WordSet1_Ratings.csv --embedding_dir=saved_embeddings/bert-base-uncased --lm_layer=10 --num_layers=2 --hidden_size=128 --dropout=0.1 --save_dir=models --save_model_name=test_name

To enable optimization, use the optimize and prune arguments. If optimizing, can also use gpu to accelerate training:

python model.py --norm=buchanan --norm_file=feature-norms/buchanan/cue_feature_words.csv --embedding_dir=saved_embeddings/albert-xxlarge-v2 --save_dir=saved_models/new/albert_models_all --optimize --gpu=2 --num_layers=2 --dropout=0.5 --num_epochs=100 --weight_decay=0 --early_stopping=6 --lm_layer=5 --save_model_name=albert_to_buchanan_layer5 --prune --normalized_buchanan

Parameters:

Parameter	Required?	Values	Note
norm	required	"binder", "buchanan", or "mcrae"
norm_file	required	str, path to csv containing norm data
embedding_dir	required	str, path to directory in which `extract_embs_main.py` saved embeddings
lm_layer	required	int, layer of embeddings to use as source
save_model_name	required	str, desired name of MLP	if optimizing, only best model saved
save_dir	required	str, path to save MLP to
optimize	optional	present/not present	if present, hyperparameter tuning used
prune	optional	present/not present	only used if `--optimize` flag present, prunes unpromising trials
gpu	optional	int, which cuda to use	only used if `--optimize` flag present
num_layers	optional	int, default=2	number of layers in MLP
hidden_size	optional	int, default=100	hidden size of MLP, ignored if optimizing
dropout	optional	float, default=0.1	dropout rate of MLP
num_epochs	optional	int, default=10	number of epochs to train for, maximum if using early stopping
batch_size	optional	int, default=32	batch size, ignored if optimizing
learning_rate	optional	float, default=0.001	learning rate, ignored if optimizing
weight_decay	optional	float, default=0.0	weight decay
early_stopping	optional	int, no default	number of epochs without progress to wait before early stopping
raw_buchanan	optional	present/not present	if present, indicates not to use translated values for Buchanan norms, ignored for other norms
normal_buchanan	optional	present/not present	if present, indicates to use normalized values for Buchanan norms, ignored for other norms

To train a classifier for each layer of the LM, use the train_all_layers.sh script by running ./train_all_layers.sh and following the prompts. All hyperparameters will be the same for each model except for the layer of the LM.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
datives		datives
feature-norms		feature-norms
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
check_savings.py		check_savings.py
extract_embeddings.py		extract_embeddings.py
extract_embs_main.py		extract_embs_main.py
get_dataset_dictionaries.py		get_dataset_dictionaries.py
model.py		model.py
test_model.py		test_model.py
train_all_layers.sh		train_all_layers.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Features

About

Usage

Embedding Extraction

Model Training and Optimization

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

jwalanthi/semantic-features

Folders and files

Latest commit

History

Repository files navigation

Semantic Features

About

Usage

Embedding Extraction

Model Training and Optimization

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages