Sparse Backoff Trees Release v1.0.0

Sparse Backoff Trees v1.0.0

Welcome to the project page for Sparse Backoff Trees. Thank you for your interest.

Download

You can download the:
runnable jar or
source from github

Documentation

The runnable jar can be invoked in two ways, training and testing.

To train, use:
java -jar sbt1.0.0.jar train <input_file> <model_output_file> <configuration_file>

To test, use:
java -jar sbt1.0.0.jar test <model_file> <test_file> <configuration_file> <num_docs_in_test_file> <num_docs_to_test>

Example

To get started, you can try the following simple example that uses the below sample files (subsets of the Reuters RCV1 corpus):

smalltrain.dat -- training file
smalltrain.config -- training config file
test.dat -- test file
test.config -- test config file

With those files and the SBT jar in the same directory, run:

java -jar sbt1.0.0.jar train smalltrain.dat smalltrain.model smalltrain.config

That trains a small (18 topic) model and saves it to smalltrain.model. It should take a moment (something like five minutes for an mid-range 2015 workstation) to train. Then, you can test the output model using:

java -jar sbt1.0.0.jar test smalltrain.model test.dat test.config 1000 100

which outputs the total log likelihood of the model on the first 100 documents of the test corpus, along with the number of tokens tested. Our run resulted in a log likelihood of -120405, which over 16660 tokens corresponds to a perplexity of 1376.

To obtain a more accurate model you can try training with smalltrain-paper.config, which results in models more like those tested in the paper. Training with those settings yielded a log likelihood of -770989, testing over 121376 tokens (i.e. using all 1000 docs) in test.dat---a perplexity of 574, better than any of the corresponding results in the paper (see Table 3). The improvement vs. the paper is due to the use of expansion (which also increases the total number of sampling passes) plus the improved gradient-ascent hyperparameter tuning in the latest codebase.

File Formats

Training and test files must be in the following SVMlight-like format:
<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value>
where <target>, <feature>, and <value> are positive integers

Configuration files are tab-delimited lines:
<line> .=. <variable>\t<value(s)>
<variable> is a settable configuration variable
<value(s)> is an integer or a space-delimited set of integers

The example configuration files above contain comments detailing the meaning of each parameter they contain.

References

For more information see (and if you use the code please cite) the associated ACL 2015 paper: Efficient Methods for Inferring Large Sparse Topic Hierarchies. Doug Downey, Chandra Sekhar Bhagavatula, Yi Yang. In Proceedings of ACL-IJCNLP, 2015.

Bibtex:

@inproceedings{downey2015efficient, title={Efficient Methods for Inferring Large Sparse Topic Hierarchies}, author={Doug Downey and Chandra Sekhar Bhagavatula and Yi Yang}, booktitle={Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing}, year={2015} }

Questions

Contact Doug Downey (d-downey at northwestern dot edu).

Acknowledgments

This research was supported in part by NSF grants IIS-1065397 and IIS-1351029, DARPA contract D11AP00268, and the Allen Institute for Artificial Intelligence.

This page was last updated on August 12, 2015.