The runnable jar can be invoked in two ways, training and testing.
To train, use:
java -jar sbt1.0.0.jar train <input_file> <model_output_file> <configuration_file>
To test, use:
java -jar sbt1.0.0.jar test <model_file> <test_file> <configuration_file> <num_docs_in_test_file> <num_docs_to_test>
To get started, you can try the following simple example that uses the below sample files (subsets of the Reuters RCV1 corpus):
java -jar sbt1.0.0.jar train smalltrain.dat smalltrain.model smalltrain.config
That trains a small (18 topic) model and saves it to smalltrain.model. It should take a moment (something like five minutes for an mid-range 2015 workstation) to train. Then, you can test the output model using:
java -jar sbt1.0.0.jar test smalltrain.model test.dat test.config 1000 100
which outputs the total log likelihood of the model on the first 100 documents of the test corpus, along with the number of tokens tested. Our run resulted in a log likelihood of -120405, which over 16660 tokens corresponds to a perplexity of 1376.
To obtain a more accurate model you can try training with smalltrain-paper.config, which results in models more like those tested in the paper. Training with those settings yielded a log likelihood of -770989, testing over 121376 tokens (i.e. using all 1000 docs) in test.dat---a perplexity of 574, better than any of the corresponding results in the paper (see Table 3). The improvement vs. the paper is due to the use of expansion (which also increases the total number of sampling passes) plus the improved gradient-ascent hyperparameter tuning in the latest codebase.
Training and test files must be in the following SVMlight-like format:
<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value>
where
<target>, <feature>, and <value> are positive integers
Configuration files are tab-delimited lines:
<line> .=. <variable>\t<value(s)>
<variable> is a settable configuration variable
<value(s)> is an integer or a space-delimited set of integers
The example configuration files above contain comments detailing the meaning of each parameter they contain.
Bibtex:
@inproceedings{downey2015efficient,
title={Efficient Methods for Inferring Large Sparse Topic Hierarchies},
author={Doug Downey and Chandra Sekhar Bhagavatula and Yi Yang},
booktitle={Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing},
year={2015}
}
This page was last updated on August 12, 2015.