WebIsADb is a publicly available database containing more than 400 million hypernymy relations we extracted from the CommonCrawl web corpus. This collection of relations represents a rich source of knowledge and may be useful for many researchers. We offer the tuple dataset for public download and an application programming interface to help other researchers programmatically query the database.

News

2016-05-30: WebIsADb involved @OKE2016 @ESWC2016
S. Faralli and S. P. Ponzetto. DWS at the 2016 Open Knowledge Extraction Challenge: A Hearst-like Pattern-Based Approach to Hypernym Extraction and Class Induction.
Candidated as ESWC 2016 Best Challenges Paper:
2016-01-01: A paper about the WebIsADb was accepted at LREC 2016 .
Julian Seitner, Christian Bizer, Kai Eckert, Stefano Faralli, Robert Meusel, Heiko Paulheim and Simone Paolo Ponzetto, 2016. A Large Database of Hypernymy Relations Extracted from the Web. Proceedings of the 10th edition of the Language Resources and Evaluation Conference. Portorož, Slovenia.

1. WebIsADb

Our approach to "isa" relation extraction can be divided in three main steps:

Table 1: The list of patterns used for the tuples extraction
(NP_t indicates the hyponym and NP_h the hypernym). For
each pattern we also report the estimated precision and
the number of resulting matches.

Pattern	Precision	# match

Pattern	Precision	# match
NP_t and any other NP_h	0.76	975,735
NP_t and other NP_h	0.70	45,900,092
NP_t or other NP_h	0.70	13,392,348
NP_t is adj_sup NP_h	0.63	6,150,245
NP_t is adj_sup most NP_h	0.63	2,286,478
NP_h such as NP_t	0.58	70,337,543
such NP_h as NP_t	0.58	5,755,389
NP_t are a NP_h	0.57	15,141,131
NP_t and some other NP_h	0.54	296,524
NP_t which is called NP_h	0.50	119,317
NP_t are adj_sup most NP_h	0.49	860,770
examples of NP_h are NP_t	0.45	267,764
NP_t, kinds of NP_h	0.45	4,618,873
NP_h including NP_t	0.44	80,640,885
NP_t is a NP_h	0.44	187,644,160
NP_h other than NP_t	0.44	7,175,087
NP_t were a NP_h	0.42	3,206,238
NP_t are adj_sup NP_h	0.41	1,393,484
NP_t was a NP_h	0.39	39,585,428
NP_t, one of the NP_h	0.38	4,200,376
NP_t is example of NP_h	0.36	292,706
examples of NP_h is NP_t	0.33	267,021
NP_h e.g. NP_t	0.33	1,973,022
NP_t, forms of NP_h	0.33	3,326,957
NP_t like other NP_h	0.31	402,388
NP_h for example NP_t	0.31	2,356,522
adj_sup most NP_h is NP_t	0.31	2,999,877
NP_t or the many NP_h	0.31	15,192
NP_h i.e. NP_t	0.29	2,114,793
NP_h which is similar to NP_t	0.29	63,713
NP_h notably NP_t	0.28	1,154,745
NP_h which are similar to NP_t	0.28	17,304
NP_t which is named NP_h	0.26	19,122
NP_h principally NP_t	0.26	455,578
adj_sup NP_h is NP_t	0.25	10,360,953
NP_h in particular NP_t	0.25	2,354,596
NP_h example of this is NP_t	0.25	14,237
NP_h among them NP_t	0.23	524,784
NP_h mainly NP_t	0.22	4,792,792
NP_h except NP_t	0.22	9,648,662
adj_sup most NP_h are NP_t	0.21	2,357,968
NP_t are examples of NP_h	0.20	2,205,089
NP_h especially NP_t	0.19	20,872,227
adj_sup NP_h are NP_t	0.19	3,755,893
NP_h particularly NP_t	0.19	11,656,254
NP_t, a kind of NP_h	0.18	1,452,822
NP_t, a form of NP_h	0.18	1,127,173
NP_t which sound like NP_h	0.18	32,730
NP_h examples of this are NP_t	0.18	1,515
NP_t sort of NP_h	0.18	7,884,398
NP_h types NP_t	0.17	11,080,276
NP_h compared to NP_t	0.17	346,525
NP_h mostly NP_t	0.16	8,383,063
compare NP_t with NP_h	0.16	340,636
NP_t, one of those NP_h	0.15	99,241
NP_t, one of these NP_h	0.13	53,235
NP_t which look like NP_h	0.13	68,945
NP_h whether NP_t or	0.13	2,800,349

WebDataCommons framework

To extract a large collection of "isa" relations from the Web, we decided to rely on the largest publicly available Web corpus i.e. the crawl corpora provided by the CommonCrawl Foundation. The original corpus contains over 2.1 billion crawled web pages, consisting of over 38000 Web ARChive, ISO 28500:2009 (WARC) files with a total packed size of 168TB. To efficiently extract "isa" relations from the crawl corpora, our implementation of relations extraction directly synchronizes with the framework of the WebDataCommons project;
Extraction and filtering

The extraction of "isa" relations from text is based on Hearst-like patterns. We collected a set of 59 patterns (see Table 1 for the full list). The patterns identified are then translated into regular expressions which we use to match with the incoming text. In order to test the quality of the above defined regular expressions, we extracted a random 1% portion of the entire corpus and analyzed 100 matches per pattern. With that evaluation, we estimated the precision of each pattern, as shown in Table 1. Patterns with a very low precision have then been excluded before performing the subsequent steps. Since both the Web and the extraction phase are sources of noise, some post-processing and filtering is required. To facilitate a sensible trade-off between coverage and precision, we try to remove only the obvious noise, while keeping as much coverage as possible. This strategy comes from the idea that some task may need ``less precise'' but ``more covering'' data. For supporting use cases where more precision is required, we provide metadata for each tuple, which allows for additional filtering techniques on the client side. As basic filtering techniques, we i) remove duplicates i.e. tuples that occur more than once under the same pay level domain are removed; ii) transform all the capital letters to lower case and removed all leading and trailing punctuations; iii) remove all quotation marks and apostrophes, since apostrophes are frequently used as replacement for quotation marks. The extraction and filtering of the tuples took around 2,200 computing hours and was run using 100 servers in parallel in less than 24 hours;
Indexing

To store and access all the extracted relations we created a MongoDB database.

2. Online Demo

To have a direct look at the WebIsA database, a demo Web application is also avaliable here (see Figure 1).

3. Resources

We offer both data and tools to re-generate/access the WebIsADb:

Figure 1: A screenshot of the Web application.

Data

The collection of relations can be downloaded in two formats:
1. MongoDB database:
  
  The database consists of two MongoDB database instances: the first collecting the "isa" tuples (but not the metadata corresponding to the extraction contexts) and the second the extraction contexts (i.e. the pay-level domains and the sentences from where a tuple was extracted). The two database can be separately downloaded and instantiated in the same or in two different machines.
  - Download the database dumps:
    1. Tuples: tuples-webisadb-april-2016.tar.gz (204309024768 byte when un-compressed)
    2. Contexts: contexts-webisadb-april-2016.tar.gz (117844946944 byte when un-compressed)
  - Install MongoDB and restore the Tuples and Contexts dumps:
    Download and install the MongoDB server (we recommend the installation of the version v3.0.7) on your machine/machines please follow the instructions reported at the official guide Install MongoDB. To instanciate our dumps on your target MongoDB servers, please follow the instructions reported at the official guide Restore a MongoDB database. Our suggestion is to use the "mongorestore" tools by passing as <path to the backup> argument the path to the folders "tuples-webisadb-april-2016" and "contexts-webisadb-april-2016" of the uncompressed dumps for tuples and contexts respectively.
2. Comma-separated values files:
Java API

The following archive contains the Java API to programmatically query the MongoDB instances of the WebIsADb:
WebIsADb-Java_API-src-maven_project-april-2016.tar.gz
The above package includes: a "readme.txt" with the instruction to configure the API and the "apidocs". Examples are also included in the main method of the file: "src/de/unima/webtuples/WebIsADb.java"
Relations extractor

The following archive contains the Java source code of the classes under the namespace "org.webdatacoomons.isadb":
webdatacommons_webisadb-tuple_extractor-src-april_2016.tar.gz
The above package requires the CommonCrawl framework and can be used to re-build a new WebIsADb from fresh CommonCrawl dumps.

4. Citing the WebIsA Database

Feel free to cite one or more of the following papers depending on what you are using.

If you use the database:
Julian Seitner, Christian Bizer, Kai Eckert, Stefano Faralli, Robert Meusel, Heiko Paulheim and Simone Paolo Ponzetto, 2016. A Large Database of Hypernymy Relations Extracted from the Web. Proceedings of the 10th edition of the Language Resources and Evaluation Conference. Portorož, Slovenia;
If you use the web application:
Stefano Faralli, Christian Bizer, Kai Eckert, Robert Meusel and Simone Paolo Ponzetto. A Repository of Taxonomic Relations from the Web. Proceedings of the 15th International Semantic Web Conference (Posters & Demos) 2016.

The WebIsADb and the API are licensed under a Creative Commons Attribution-Non Commercial-Share Alike 3.0 License.

Acknowledgements

This work was partially funded by the Deutsche Forschungsgemeinschaft within the JOIN-T project (research grant PO 1900/1-1). Part of the computational resources used for this work were provide by an Amazon AWS in Education Grant award.

link	description
tuplesdb.1.tar.gz	all the tuple grouped by the instance string value
tuplesdb.2.tar.gz	all the tuple grouped by the class string value

Web Data Commons - WebIsA Database

News

Contents

1. WebIsADb

WebDataCommons framework

Extraction and filtering

Indexing

2. Online Demo

3. Resources

Data

MongoDB database:

Comma-separated values files:

Java API

Relations extractor

4. Citing the WebIsA Database

Acknowledgements