Generating a "IsA" Database out of the Web
Christian Bizer
Kai Eckert
Stefano Faralli
Robert Meusel
Heiko Paulheim
Simone Paolo Ponzetto

WebIsADb is a publicly available database containing more than 400 million hypernymy relations we extracted from the CommonCrawl web corpus. This collection of relations represents a rich source of knowledge and may be useful for many researchers. We offer the tuple dataset for public download and an application programming interface to help other researchers programmatically query the database.

News

Contents

1. WebIsADb

Our approach to "isa" relation extraction can be divided in three main steps:

Table 1: The list of patterns used for the tuples extraction
(NPt indicates the hyponym and NPh the hypernym). For
each pattern we also report the estimated precision and
the number of resulting matches.

PatternPrecision# match
PatternPrecision# match
NPt and any other NPh0.76975,735
NPt and other NPh0.7045,900,092
NPt or other NPh0.7013,392,348
NPt is adjsup NPh0.636,150,245
NPt is adjsup most NPh0.632,286,478
NPh such as NPt0.5870,337,543
such NPh as NPt0.585,755,389
NPt are a NPh0.5715,141,131
NPt and some other NPh0.54296,524
NPt which is called NPh0.50119,317
NPt are adjsup most NPh0.49860,770
examples of NPh are NPt0.45267,764
NPt, kinds of NPh0.454,618,873
NPh including NPt0.4480,640,885
NPt is a NPh0.44187,644,160
NPh other than NPt0.447,175,087
NPt were a NPh0.423,206,238
NPt are adjsup NPh0.411,393,484
NPt was a NPh0.3939,585,428
NPt, one of the NPh0.384,200,376
NPt is example of NPh0.36292,706
examples of NPh is NPt0.33267,021
NPh e.g. NPt0.331,973,022
NPt, forms of NPh0.333,326,957
NPt like other NPh0.31402,388
NPh for example NPt0.312,356,522
adjsup most NPh is NPt0.312,999,877
NPt or the many NPh0.3115,192
NPh i.e. NPt0.292,114,793
NPh which is similar to NPt0.2963,713
NPh notably NPt0.281,154,745
NPh which are similar to NPt0.2817,304
NPt which is named NPh0.2619,122
NPh principally NPt0.26455,578
adjsup NPh is NPt0.2510,360,953
NPh in particular NPt0.252,354,596
NPh example of this is NPt0.2514,237
NPh among them NPt0.23524,784
NPh mainly NPt0.224,792,792
NPh except NPt0.229,648,662
adjsup most NPh are NPt0.212,357,968
NPt are examples of NPh0.202,205,089
NPh especially NPt0.1920,872,227
adjsup NPh are NPt0.193,755,893
NPh particularly NPt0.1911,656,254
NPt, a kind of NPh0.181,452,822
NPt, a form of NPh0.181,127,173
NPt which sound like NPh0.1832,730
NPh examples of this are NPt0.181,515
NPt sort of NPh0.187,884,398
NPh types NPt0.1711,080,276
NPh compared to NPt 0.17346,525
NPh mostly NPt0.168,383,063
compare NPt with NPh 0.16340,636
NPt, one of those NPh0.1599,241
NPt, one of these NPh0.1353,235
NPt which look like NPh0.1368,945
NPh whether NPt or 0.132,800,349
  1. WebDataCommons framework


    To extract a large collection of "isa" relations from the Web, we decided to rely on the largest publicly available Web corpus i.e. the crawl corpora provided by the CommonCrawl Foundation. The original corpus contains over 2.1 billion crawled web pages, consisting of over 38000 Web ARChive, ISO 28500:2009 (WARC) files with a total packed size of 168TB. To efficiently extract "isa" relations from the crawl corpora, our implementation of relations extraction directly synchronizes with the framework of the WebDataCommons project;
  2. Extraction and filtering


    The extraction of "isa" relations from text is based on Hearst-like patterns. We collected a set of 59 patterns (see Table 1 for the full list). The patterns identified are then translated into regular expressions which we use to match with the incoming text. In order to test the quality of the above defined regular expressions, we extracted a random 1% portion of the entire corpus and analyzed 100 matches per pattern. With that evaluation, we estimated the precision of each pattern, as shown in Table 1. Patterns with a very low precision have then been excluded before performing the subsequent steps. Since both the Web and the extraction phase are sources of noise, some post-processing and filtering is required. To facilitate a sensible trade-off between coverage and precision, we try to remove only the obvious noise, while keeping as much coverage as possible. This strategy comes from the idea that some task may need ``less precise'' but ``more covering'' data. For supporting use cases where more precision is required, we provide metadata for each tuple, which allows for additional filtering techniques on the client side. As basic filtering techniques, we i) remove duplicates i.e. tuples that occur more than once under the same pay level domain are removed; ii) transform all the capital letters to lower case and removed all leading and trailing punctuations; iii) remove all quotation marks and apostrophes, since apostrophes are frequently used as replacement for quotation marks. The extraction and filtering of the tuples took around 2,200 computing hours and was run using 100 servers in parallel in less than 24 hours;
  3. Indexing


    To store and access all the extracted relations we created a MongoDB database.

2. Online Demo

To have a direct look at the WebIsA database, a demo Web application is also avaliable here (see Figure 1).

3. Resources

We offer both data and tools to re-generate/access the WebIsADb:

    Figure 1: A screenshot of the Web application.

  1. Data


    The collection of relations can be downloaded in two formats:
    1. MongoDB database:


      The database consists of two MongoDB database instances: the first collecting the "isa" tuples (but not the metadata corresponding to the extraction contexts) and the second the extraction contexts (i.e. the pay-level domains and the sentences from where a tuple was extracted). The two database can be separately downloaded and instantiated in the same or in two different machines.
      • Download the database dumps:
        1. Tuples: tuples-webisadb-april-2016.tar.gz (204309024768 byte when un-compressed)
        2. Contexts: contexts-webisadb-april-2016.tar.gz (117844946944 byte when un-compressed)
      • Install MongoDB and restore the Tuples and Contexts dumps:
        Download and install the MongoDB server (we recommend the installation of the version v3.0.7) on your machine/machines please follow the instructions reported at the official guide Install MongoDB. To instanciate our dumps on your target MongoDB servers, please follow the instructions reported at the official guide Restore a MongoDB database. Our suggestion is to use the "mongorestore" tools by passing as <path to the backup> argument the path to the folders "tuples-webisadb-april-2016" and "contexts-webisadb-april-2016" of the uncompressed dumps for tuples and contexts respectively.
    2. Comma-separated values files:


      • Tuples CSV files:
        linkdescription
        tuplesdb.1.tar.gzall the tuple grouped by the instance string value
        tuplesdb.2.tar.gzall the tuple grouped by the class string value
      • The two above archives contain a set of tar file. Each tar file contains a set of "csv" file with the following format:
        the first line of the csv files contains a comma separated list of the fields name i.e "_id,instance,class,frequency,pidspread,pldspread,modifications".
        The following lines of the files respect the schema of the head line e.g.:
        "286980418,aang,character,41,11,32,modification"
        where:
        • "286980418": the record identifier of the tuple as in the mongodb instance;
        • "aang": the instance string value;
        • "character": the class string value;
        • "41": number of matches;
        • "11": number of matching patterns;
        • "32": number of matching pay-level domains;
        • modification: is a JSON representation of the list of all the variants of the relation "(aang,character)". Each item of the JSON representation includes:
          • ipremod: the pre-modifier of the "instance";
          • ipostmod: the post-modifier of the "instance";
          • cpremod: the pre-modifier of the "class";
          • cpostmod: the post-modifier of the "class";
          • frequency: the frequency of the "isa" relation "(ipremod +"aang"+ipostmod, cpremod+"character"+cpostmod)"
          • pidspread: the number of matching pattern of the relation "(ipremod +"aang"+ipostmod,cpremod+"character"+cpostmod)";
          • pldspread: the number of pay level domains where we extracted the relation "(ipremod +"aang"+ipostmod,cpremod+"character"+cpostmod)";
          • pids: a semicolon separeted list of the correpsonding pay-level domains (e.g. including "appszoom.com");
          • plds: a semicolon separeted list of the matching pattern labels (e.g. "p1,p25,p10,p8a,p3a" correponding to the following patterns: "NPt and other NPh", "NPh except NPt", "such NPh as NPt", "NPt is a NPh" and "NPh including NPt" respectively);
          • provids: a semicolon separeted list of contexts id, one can use to retrive te whole context of the extractions (e.g. including the context identifier "383416952", one can use to serach the corresponding context in the resource described in the next paragraph).

      • Contexts CSV files:
        linkdescription
        contextsdb.tar.gzall the contexts of the extracted tuples
        The above archive contains a set of tar file. Each tar file contains a set of "csv" file with the following format:
        the first line of the csv files contains a comma separated list of the fields name i.e "_id,provid,sentence,pld".
        The following lines of the files respect the schema of the head line e.g.:
        "93459608,383416952,"This application has this option included too. Movie and TV series feature such characters as Aang, Prince Zuko, Katara, Sokka, Uncle Iroh, Commander Zhao, Fire Lord Ozai, Princess Yue, Katara's Grandma, Master Pakku, Monk Gyatso, Azula, Old Man in Temple, Zhao's Assistant, Earthbending Father.",appszoom.com"
        where:
        • "93459608": the record identifier of the context as in the mongodb instance;
        • "383416952": the context id that may be included in the "provids" filed of a tuple;
        • "This application has this option included too. Movie and TV series feature such characters as Aang, Prince Zuko, Katara, Sokka, Uncle Iroh, Commander Zhao, Fire Lord Ozai, Princess Yue, Katara's Grandma, Master Pakku, Monk Gyatso, Azula, Old Man in Temple, Zhao's Assistant, Earthbending Father.": the sentence of the context;
        • "appszoom.com": the pay-level domain.;
  2. Java API


    The following archive contains the Java API to programmatically query the MongoDB instances of the WebIsADb:
    WebIsADb-Java_API-src-maven_project-april-2016.tar.gz
    The above package includes: a "readme.txt" with the instruction to configure the API and the "apidocs". Examples are also included in the main method of the file: "src/de/unima/webtuples/WebIsADb.java"
  3. Relations extractor


    The following archive contains the Java source code of the classes under the namespace "org.webdatacoomons.isadb":
    webdatacommons_webisadb-tuple_extractor-src-april_2016.tar.gz
    The above package requires the CommonCrawl framework and can be used to re-build a new WebIsADb from fresh CommonCrawl dumps.

4. Citing the WebIsA Database


Feel free to cite one or more of the following papers depending on what you are using.

Acknowledgements

This work was partially funded by the Deutsche Forschungsgemeinschaft within the JOIN-T project (research grant PO 1900/1-1). Part of the computational resources used for this work were provide by an Amazon AWS in Education Grant award.