Web SAIL

What is 3W?

Wikipedia’s links are very valuable for many NLP tasks, but only a fraction of the text is annotated with hyperlinks. Our goal is to produce additional links to the articles at high-precision to facilitate other NLP systems. 3W is a system that identifies and links phrases in Wikipedia articles to their referent concept. 3W leverages rich information present in Wikipedia article to achieve high precision, yet yeild radically more new links than baseline.

Links

We provide 2 versions of new links: Baseline and 3W. Please refer to our Publication for further explanation of both versions

  • Baseline (1.1 GB)
  • 3W (1.3 GB) (from ~2.6 million articles)

Both versions have the same format as following:

<source article id>\t<start offset>\t<end offset>\t<link target article id>\t<confidence>

The offsets are computed from a parsed Wikipedia article provided in Wikipedia Resources section.

Publication

The paper that describes this project is Adding High-Precision Links to Wikipedia.

  • Paper
  • Poster
  • BibTex:
    @InProceedings{noraset-bhagavatula-downey:2014:EMNLP2014,
    	author    = {Noraset, Thanapon  and  Bhagavatula, Chandra  and  Downey, Doug},
    	title     = {Adding High-Precision Links to Wikipedia},
    	booktitle = {Proceedings of the 2014 Conference on Empirical Methods 
    			in Natural Language Processing (EMNLP)},
    	month     = {October},
    	year      = {2014},
    	address   = {Doha, Qatar},
    	publisher = {Association for Computational Linguistics},
    	pages     = {651--656},
    	url       = {http://www.aclweb.org/anthology/D14-1072}
    }

Experimental Data

We provide data generated in Adding High-Precision Links to Wikipedia. The data include:

The format of mention and link files is described in the Links section. Note that <confidence> is not in Extracted mentions and Hand-labeled links, and <link target article id> is not in Extracted mentions. The link files might have slight difference from the one reported in the paper because we re-run the experiments. The confidence threshold is 0.934 for 3W, and 0.90 for Baseline.

Wikipedia Resources

In addition to links and experimental data, we think that it will be useful to provide Wikipedia-related data that we preprocess and use in many of our projects. Warning: these files are very large.

  • Parsed articles (4.8 GB): All parsed articles using custom-made Sweble parser. Each file is an article and named by the article ID.
  • Article ID Map (108 MB): <article id>\t<title>
  • Dependency-parsed articles (15 GB): Dependency of all articles using Stanford Dependency Parser. There are 2 files: dependency file (each line is an article) and position file (<article id>\t<start byte offset>).
  • Wikipedia Links (0.8 GB): All links in the articles in the link file format described in Links section, but there is no <confidence>.

All resources are built using English Wikipedia as of September 2013. The resources do not include information about templates and tables in the articles. For table-related data, please checkout WikiTables.

Team

Also visit our group website for other projects.

Acknowledgement

This work was supported in part by DARPA contract D11AP00268 and Allen Institute for Artificial Intelligence.