Knowledge bases extracted automatically from the Web present new opportunities for data mining and exploration. Given a large, heterogeneous set of extracted relations, new tools are needed for searching the knowledge and uncovering relationships of interest. We present WikiTables, a Web application that enables users to interactively explore tabular knowledge extracted from Wikipedia.
In experiments, we show that WikiTables substantially outperforms baselines on the novel task of automatically joining together disparate tables to uncover "interesting" relationships between table columns. We find that a "Semantic Relatedness" measure that leverages the Wikipedia link structure accounts for a majority of this improvement. Further, on the task of keyword search for tables, we show that WikiTables performs comparably to Google Fusion Tables despite using an order of magnitude fewer tables. Our work also includes the release of a number of public resources, including over 15 million tuples of extracted tabular data, manually annotated evaluation sets, and public APIs.
The paper related to this project is Methods for Exploring and Mining Tables on Wikipedia .
If you use the dataset or APIs, please site the paper
There are currently two demo applications (listed below). Both of them show how tables on the internet (Wikipedia) can be very useful when mined and preprocessed well. The first application joins many tables together, it allows user to see columns from multiple tables side by side, and perhaps find an interesting correlation in the data. The second one is a search engine that exploit additional information from table to improve search result.
Here is a few examples of interesting correlations we found using Relevant Join application.
Datasets and APIs
- - Table Extraction Labaled Dataset
- - Random articles used for evaluation on the Relevant Join task
- Semantic Relatedness API:
- where <pageID> is the Wikipedia page ID of an article.
- Example: http://downey-n2.cs.northwestern.edu:8080/wikisr/sr/sID/19908980/langID/1
Extracted tables, and cell data is available in this section.
Please read this file before downloading README.
The dataset can be downloads
(IMPORTANT: We have a new version of the Dataset. Please visit this page for the latest dataset)
The source code of this project (not include UI) is available on Bitbucket. You can simply execute command: git clone email@example.com:csbhagav/kgrid-server.git
We are from Northwestern University. Our team members include:
- - Doug Downey
- - Chandra Sekhar Bhagavatula
- - Thanapon Noraset