Abstract

Web tables form a valuable source of relational data. The Web contains an estimated 154 million HTML tables of relational data, with Wikipedia alone containing 1.6 million high-quality relational tables. Extracting the semantics of Web tables to produce machine-understandable knowledge has become an active area of research.

A key step in extracting the semantics of Web content is entity linking (EL): the task of mapping a phrase in text to its referent entity in a knowledge base (KB). In this paper we present TabEL, a new EL system for Web tables. TabEL differs from previous work by weakening the assumption that the semantics of a table can be mapped to pre-defined types and relations found in the target KB. Instead, TabEL enforces soft constraints in the form of a graphical model that assigns higher likelihood to sets of entities that tend to co-occur in Wikipedia documents and tables. In experiments, TabEL significantly reduces error when compared to current state-of-the-art table EL systems, including over 75% error reduction on Wikipedia tables and 60% error reduction on Web tables. We also make the Wikipedia table corpus and all test datasets publicly available for future work.

Paper

TabEL: Entity Linking in Web Tables

Datasets and APIs

- Web_Manual Dataset
- Wiki_Links-Random
- - Test tables
  The file contains, per line, two tab-separated fields: page ID and table ID. These are references in the 1.6M tables dataset. In the JSON file, the respective field names are "pgId" and "tableId".
- TabEL_35K
- - Test Mentions
  This file contains IDs of 35,000 mentions. The IDs are references in the "Existing Links" dataset and the field that contains an ID is "_id.$oid".
- - Training Mentions
- - Error Analysis

Download

- A dataset of 1.6M Wikipedia Tables in JSON format (link)
- A dataset of existing links in Wikipedia tables as Mentions in JSON format (link)

Code

Will be released later.

Our Team

We are from Northwestern University. Our team members are:

License

This work is licensed under a Creative Commons Attribution 4.0 International License.