Skip to content

LiJiefei/TWTM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

Tag-Weighted Topic Model for Mining Semi-Structured Documents

The code of http://dl.acm.org/citation.cfm?id=2540540
Author: Shuangyin Li, Jiefei Li, Rong Pan
Sun Yat-sen University

Any question about code please contact us by email lijiefei AT mail2.sysu.edu.cn.

License

Copyright 2013 Shuangyin Li, Jiefei Li, Rong Pan
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Install

cd src/ && make

Usage

###Input file format:
DocNumLabels label1 label2 ... @ DocNumWords word1 word2 ...
DocNumLabels label1 label2 ... @ DocNumWords word1 word2 ...
DocNumLabels label1 label2 ... @ DocNumWords word1 word2 ...

Each row represent one document with labels. DocNumLables means the number labels of document. DocNumWords means the number words of document. Each label is integer and represent one label. Each word is integer and represent one word.

demo/twtm.demo.input is a simple demo input file.
demo/label.txt is the label dictionary file. The word in row 1 means the label0.
demo/words.dic is the word dictionary file.


###Training:

./twtm est <input data file> <setting.txt> <num_topics> <model save dir>

Example:

./src/twtm est demo/twtm.demo.input src/setting.txt 10 demo/model

Some model training parameters are set in the file "setting.txt".

###Inference:
There are two methods to inference a new document's topic distribution.
One is still using the labels of new document to inference.

./twtm inf <input data file> <setting.txt> <model dir> <prefix> <output dir>

Example:

./src/twtm inf demo/twtm.demo.input src/setting.txt demo/model/ final demo/output/

We can get the doc-topics-dis.txt file in output dir. The file indicates the topic distribution in input data file. The values in the file should be exp(.) so that we can konw that exact probablility.

One is just using the words of new document. So with the TWTM model, we can inference some new document without any label just like LDA model.

./twtm lda-inf <input data file> <setting.txt> <model dir> <prefix> <output dir>

Example:

./src/twtm lda-inf demo/twtm.demo.input src/setting.txt demo/model/ final demo/output/

About

TWTM Code

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages