This page offers the WDC Block benchmark for download. WDC Block is a benchmark for comparing the performance of
blocking methods that are used as part of entity resolution pipelines. WDC Block features a maximum Cartesian
product of 200 billion pairs as well as training sets of different sizes for evaluating supervised blockers.
WDC Block to evaluate several state-of-the-art blocking systems,
including CTT, Auto, JedAI, Sudowoodo, SBERT, BM25 and SC-Block.
- 2023-06-22: We have released WDC Block a benchmark for comparing the performance of blocking
methods. WDC Block features a maximal Cartesian product of 200 billion pairs of product offers which were
extracted form 3,259 e-shops.
- 2023-02-15: We have released an initial version of WDC Block.
Table of Contents
Entity resolution aims to identify records in two datasets (A and B) that describe the same real-world
Since comparing all record pairs between two datasets can be computationally expensive, entity resolution is
approached in two steps, blocking and matching.
Blocking applies a computationally cheap method to remove non-matching record pairs and produces a smaller set
candidate record pairs reducing the workload of the matcher.
During matching a more expensive pair-wise matcher produces a final set of matching record pairs[2, 8].
Existing benchmark datasets for blocking and matching are rather small with respect to the Cartesian product AxB
for comparing all records and the vocabulary size.
If blockers are evaluated only on these small datasets, effects resulting from a high number of records or
from a large vocabulary size (large number of unique tokens that need to be indexed) may be missed.
The Web Data Commons Block (WDC-Block) is a new blocking benchmark that provides much larger datasets and thus
requires blockers that address these scalability challenges.
Additionally, we provide three development sets with different sizes (~1K pairs, ~5K pairs & ~20K pairs) to
experiment with different amounts of training data for the blockers.
WDC Block is based on product data that has been extracted
in 2020 from 3,259 e-shops that mark up product offers within their
HTML pages using the schema.org vocabulary. The largest variant of WDC Block uses offers from 2 million
different products. Multiple offers referring to the same product are identified based on GTIN and MPN numbers
provided by the e-shops.
2 Benchmark Creation
The WDC Block benchmark was created in three steps: (i) we select a difficult variant of the WDC Products
entity matching benchmark as the seed dataset for WDC Block,
(ii) we split the dataset into two separate datasets A and B, (iii) we enlarge the dataset by adding offers for
additional non-matching products from the WDC Product Data Corpus
V2020, (iv) we prepare three development sets (~1K pairs, ~5k pairs & ~20k pairs). This section
gives an overview of the four steps.
- We choose the large, pairwise, 80% corner cases and 20% random training set and the 50%-unseen test set
from the WDC Products
entity matching benchmark as the seed dataset for our benchmark.
We selected the large version to start with a large number of product offers that are difficult to match
and th e50% seen/ 50% unseen test set to have a
trade-off between product offers that are part of the
training set and product offers that the blockers did not see during training.
- We split the original record pairs into two datasets A and B to follow the common setup of entity
matching datasets in the related work. Both
datasets A and B are deduplicated to obtain clean datasets.
- Depending on the specific benchmark (small, medium, large), we populate the datasets A and B with additional
randomly selected offers from the WDC Product Data Corpus
We make sure that the randomly selected records do not match any of the existing records in the datasets to
avoid introducing additional matching pairs.
- We derive the splits train, validation & test and transfer them to the
format of the entity matching datasets in the related work.
For the large development set (~20k pairs), the initially derived pairs are retained.
For the medium development set (~5k pairs) & small development set (~1k pairs), the train & validation set
are down-sampled such that the distributions of the large development set are comparable.
The test set remains the same for all versions of this benchmark to make the results comparable.
3 Benchmark Profiling
The WDC Block benchmark consists of a total of 2,073,224 real-word products across all subsets which
are described by 2,100,000 product offers.
Each product offer in WDC Block has five attributes, title, description, price, priceCurrency
WDC Block comes with nine configurations, which are derived of the two dimensions dataset size (small, medium,
described in Table 1 and development set sizes (~1k, ~5k, ~20k) described in Table 2.
Table 1 shows the number of records in Table A and Table B, the number of positive and negative pairs in the
test set as well as the vocabulary size and the Cartesian product of the different dataset sizes.
The vocabulary size represents the number of unique tokens after concatenating the attribute values of the
product offers in Table A and Table B and tokenizing the concatenated attribute values by whitespace.
The cartesian product is the maximum number of record comparisons between Table A and Table B (AxB).
Table 1: Benchmark Statistics.
||Pos. Pairs Test
||Neg. Pairs Test
|WDC Block small
|WDC Block medium
|WDC Block large
Table 2: Development Set Statistics.
|Development Set Size
||Pos. Pairs Train
||Neg. Pairs Train
||Pos. Pairs Val.
||Neg. Pairs Val.
To demonstrate the usefulness of WDC Block, we evaluate the blocking methods BM25, BM25 with
Barlow Twins ,
and SC-Block on WDC-Block.
Detailed explanations of the experiments can be found in the corresponding paper .
4.1 Blocking Systems
BM25: BM25 is a sparse bag-of-words retrieval model.
It is an unsupervised nearest neighbour blocker that uses a vector space model and the BM25 term weighting
scheme to compute a similarity score for the nearest neighbour search.
Recent work shows the effectiveness of BM25 for blocking.
We evaluate BM25 with whitespace tokenization referred to as BM25 and BM25 with tri-grams referred to as
CTT: Cross Tuple Training (CTT) uses fasttext to embed the tokens of the record texts
and aggregates these token embeddings into a single embedding.
For CTT, the embeddings are sent through a Siamese summarizer. Then, a classifier learns to detect matches
based on the element-wise difference of the created embeddings.
CTT is trained on synthetically produced training data derived from the two blocked datasets.
The trained encoder embeds the record texts for the nearest neighbour search.
Auto: Autoencoder (Auto) uses fasttext to embed the tokens of the record texts and
aggregates these token embeddings into a single embedding.
For Auto, the embeddings are sent through an autoencoder. Auto is self-supervised and requires no labelled data
The trained encoder embeds the record texts for the nearest neighbour search.
Barlow Twins: The use of Barlow Twins (BT) for blocking is inspired by Sudowoodo.
During self-supervised training a batch of record texts is built. Each record is duplicated, and both the
original record text and the duplicate are augmented by dropping a random token of the record text.
The record texts are embedded through a pre-trained roberta language model and mean-pooled. For BT, a linear
layer projects the embeddings to 4096 dimensions.
During training, the empirical cross-correlation of the augmented originals and the augmented duplicates is
measured and moved close to the identity matrix.
SimCLR: The use of SimCLR for blocking is inspired by Sudowoodo.
During self-supervised training of both models, a batch of record texts is built.
Each record is duplicated, and both the original record text and the duplicate are augmented by dropping a
random token of the record text.
The record texts are embedded through a pre-trained roberta language model and mean-pooled.
SimCLR maximizes the agreement of embeddings representing the two previously augmented records and minimizes the
agreement of embeddings representing distinct records within a batch of records.
JedAI: JedAI defines a blocking key value (BKV) to block records that share six-gram
blocks and prunes the candidate set by removing pairs with low matching likelihood to obtain a candidate set.
SBERT: Sentence-Bert (SBERT) is a sentence embedding framework. Using SBERT's
framework we train a supervised nearest neighbour blocker. The blocker requires labelled pairs of matching and
non-matching records for training. It embeds record text pairs in a siamese fashion using a pre-trained roberta
language model and mean-pools the embeddings.
SC-Block: SC-Block uses supervised contrastive learning to position record embeddings describing the
same real-world entity close to each other in an embedding space.
During the self-supervised training of the model, a batch of record texts is built.
SC-Block maximizes the agreement of embeddings representing the same real-world entity according to the training
set and minimizes the agreement of embeddings representing distinct records within a batch of records.
4.2 Benchmark Results for nearest neighbour blockers with 𝑘 = 5
We first analyze all nearest neighbour blockers with a fixed number of nearest neighbours 𝑘 = 5.
We use recall and precision to evaluate the candidate sets with respect to the test sets of the datasets.
By fixing the hyperparameter 𝑘 differences in recall and precision become visible that are not visible if 𝑘 is
The recall and precision results in Table 3 show that 𝑘 = 5 is not sufficient for the benchmarked blockers to achieve
a recall above 75%.
The results also show that both increasing the dataset size and decreasing the development set size increase the
difficulty of the benchmark.
Table 3: Recall (R) and Precision (P) of the candidate sets generated by all nearest neighbour
blockers with 𝑘 = 5 on the test sets
of the datasets. The highest Recall and Precision values are marked in bold. ’timeout’ indicates a timeout
after 48h and ’OOM’ indicates an out-of-memory error.
4.3 Benchmark Results for nearest neighbour blockers with recall >99.5% on validation set
We analyze how tuning the hyperparameter 𝑘 affects the recall of the nearest neighbour blockers on WDC Block.
Increasing 𝑘 increases the probability of finding a matching pair resulting in a higher
recall. Higher values of 𝑘 produce larger candidate sets because the matcher has more candidate pairs to compare.
the matching phase of the entity resolution pipeline. Our main goal in tuning 𝑘 is to add all matching pairs to the
while keeping the candidate set as small as possible. To achieve this goal, we evaluate each nearest neighbour
blocker with increasing
values of 𝑘 starting with 𝑘 = 1 on the validation set. Once the recall of the candidate set exceeds 99.5% on the
validation set, 𝑘 is found.
To limit the search space, we set a maximum value of 𝑘 = 50 on WDC-Blocksmall , 𝑘 = 100 on
𝑘 = 200 on WDC-Blocklarge.
The results in Table 4 show how challenging WDC Block is for the blockers because except for SC BlockLarge Dev.
Set all blockers fail to generate candidate sets that exceed the 99.5% recall threshold on the validation
WDC Blocklarge is the most challenging benchmark dataset due to the high number of records and the large
vocabulary of unique tokens.
The blockers BM253, JedAI, Auto and CTT fail to generate candidate sets on WDC Blocklarge due
to time and memory constraints.
Table 4: 𝑘 per blocker and dataset after tuning 𝑘 for recall 99.5% on the respective validation
set. Recall (R) and candidate set
size (|C|) of all blockers on the test set. The highest recall, as well as the lowest 𝑘 and |C| values per
dataset, are marked in bold.
’timeout’ indicates a timeout after 48h and ’OOM’ indicates an out-of-memory error.
We offer the WDC Block benchmark for public download. The benchmark is available as
a single zip file for each configuration. Each dataset contains the two datasets A and B as well as the
train, validation and test split.
The code used for the creation of the corpus can be found on github.
The files are represented using the CSV format and can for example be easily processed using the pandas Python library.
import pandas as pd
df = pd.read_csv('file_name.csv')
id: An integer id that uniquely identifies an offer.
brand: The brand attribute, usually a short string with a median of 1 word, e.g.
"Nikon", "Crucial", ...
title: The title attribute, a string with a median of 8 words, e.g. "Nikon AF-S
NIKKOR 50mm f1.4G Lens", "Crucial 4GB (1x4GB) DDR3l PC3-12800 1600MHz SODIMM Module", ...
description: The description attribute, a longer string with a median of xxx words (271 characters).
price: The price attribute, a string containing a number, e.g. "749", "20.57", ...
priceCurrency: The priceCurrency attribute, usually containing a three character
string, e.g. "AUD", "GBP"
cluster_id: The integer cluster_id referring to the cluster an offer belongs to. All
offers in a cluster refer to the same real-word entity.
Please send questions and feedback to the Web Data
Commons Google Group.
More information about Web Data Commons is found here.
A. Brinkmann, R.Shraga and C. Bizer, 2023,
"SC-Block: Supervised Contrastive Blocking within Entity
Resolution Pipelines", arXiv:2303.03132 [cs].
P. Christen, 2012, "Data matching : concepts and
techniques for record linkage",
entity resolution, and duplicate detection. Springer, Berlin, Heidelberg.
V. Christophides, V. Efthymiou, T. Palpanas, G.
and K. Stefanidis, 2021, "An Overview of End-to-End Entity Resolution for
Big Data", in ACM Comput. Surv. 53, 6 (Nov. 2021), 1–42.
S. Thirumuruganathan, H. Li, N. Tang, M. Ouzzani,
Y. Govind, D. Paulsen, G. Fung, and A. Doan, 2021, "Deep learning
for blocking in entity matching: a design space exploration" in
Proceedings of the VLDB Endowment, vol. 14, no. 1, pp. 2459–2472.
N. Reimers and I. Gurevych, 2019, "Sentence-BERT:
Sentence Embeddings using Siamese BERT-Networks" in
Proceedings of Empirical Methods in Natural Language Processing and the 9th International Joint
Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992.
R. Wang, Y. Li, and J. Wang, 2022, "Sudowoodo:
Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation" on
H. Köpcke, A. Thor, and E. Rahm, 2010,
"Evaluation of entity resolution approaches on real-world match problems" in
Proceedings of the VLDB Endowment, vol. 3, no. 1, pp. 484–493.
G. Papadakis, D. Skoutas, E. Thanos, and T.
Palpanas, 2021, "Blocking and Filtering Techniques for Entity Resolution: A Survey" in
ACM Computing Surveys, vol. 53, no. 2, pp. 1–42.
V. Efthymiou, G. Papadakis, G.
Papastefanatos, K. Stefanidis, and T. Palpanas, 2017, "Parallel meta-blocking for scaling entity
resolution over big heterogeneous data" in
Information Systems, vol. 53, pp. 137–157.
D. Paulsen, Y. Govind, and A. Doan.,
"Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching" in
Proceedings of the VLDB Endowment, vol. 16, pp. 1507–1519.