This page offers the WDC Block benchmark for download. WDC Block is a benchmark for comparing the performance of blocking methods that are used as part of entity resolution pipelines. WDC Block features a maximum Cartesian product of 200 billion pairs as well as training sets of different sizes for evaluating supervised blockers. We use WDC Block to evaluate several state-of-the-art blocking systems, including CTT, Auto, JedAI, Sudowoodo, SBERT, BM25 and SC-Block.

News

2023-06-22: We have released WDC Block a benchmark for comparing the performance of blocking methods. WDC Block features a maximal Cartesian product of 200 billion pairs of product offers which were extracted form 3,259 e-shops.
2023-02-15: We have released an initial version of WDC Block.

1 Introduction
2 Benchmark Creation
3 Benchmark Profiling
4 Experiments
5 Downloads
6 Feedback
7 References

1 Introduction

Entity resolution aims to identify records in two datasets (A and B) that describe the same real-world entity [2,3,4]. Since comparing all record pairs between two datasets can be computationally expensive, entity resolution is approached in two steps, blocking and matching. Blocking applies a computationally cheap method to remove non-matching record pairs and produces a smaller set of candidate record pairs reducing the workload of the matcher. During matching a more expensive pair-wise matcher produces a final set of matching record pairs[2, 8]. Existing benchmark datasets for blocking and matching are rather small with respect to the Cartesian product AxB for comparing all records and the vocabulary size[7]. If blockers are evaluated only on these small datasets, effects resulting from a high number of records or from a large vocabulary size (large number of unique tokens that need to be indexed) may be missed. The Web Data Commons Block (WDC-Block) is a new blocking benchmark that provides much larger datasets and thus requires blockers that address these scalability challenges. Additionally, we provide three development sets with different sizes (~1K pairs, ~5K pairs & ~20K pairs) to experiment with different amounts of training data for the blockers.

WDC Block is based on product data that has been extracted in 2020 from 3,259 e-shops that mark up product offers within their HTML pages using the schema.org vocabulary. The largest variant of WDC Block uses offers from 2 million different products. Multiple offers referring to the same product are identified based on GTIN and MPN numbers provided by the e-shops.

2 Benchmark Creation

The WDC Block benchmark was created in three steps: (i) we select a difficult variant of the WDC Products entity matching benchmark as the seed dataset for WDC Block, (ii) we split the dataset into two separate datasets A and B, (iii) we enlarge the dataset by adding offers for additional non-matching products from the WDC Product Data Corpus V2020, (iv) we prepare three development sets (~1K pairs, ~5k pairs & ~20k pairs). This section gives an overview of the four steps.

We choose the large, pairwise, 80% corner cases and 20% random training set and the 50%-unseen test set from the WDC Products entity matching benchmark as the seed dataset for our benchmark. We selected the large version to start with a large number of product offers that are difficult to match and th e50% seen/ 50% unseen test set to have a trade-off between product offers that are part of the training set and product offers that the blockers did not see during training.
We split the original record pairs into two datasets A and B to follow the common setup of entity matching datasets in the related work. Both datasets A and B are deduplicated to obtain clean datasets.
Depending on the specific benchmark (small, medium, large), we populate the datasets A and B with additional randomly selected offers from the WDC Product Data Corpus V2020. We make sure that the randomly selected records do not match any of the existing records in the datasets to avoid introducing additional matching pairs.
We derive the splits train, validation & test and transfer them to the format of the entity matching datasets in the related work. For the large development set (~20k pairs), the initially derived pairs are retained. For the medium development set (~5k pairs) & small development set (~1k pairs), the train & validation set are down-sampled such that the distributions of the large development set are comparable. The test set remains the same for all versions of this benchmark to make the results comparable.

3 Benchmark Profiling

The WDC Block benchmark consists of a total of 2,073,224 real-word products across all subsets which are described by 2,100,000 product offers. Each product offer in WDC Block has five attributes, title, description, price, priceCurrency and brand. WDC Block comes with nine configurations, which are derived of the two dimensions dataset size (small, medium, large) described in Table 1 and development set sizes (~1k, ~5k, ~20k) described in Table 2. Table 1 shows the number of records in Table A and Table B, the number of positive and negative pairs in the test set as well as the vocabulary size and the Cartesian product of the different dataset sizes. The vocabulary size represents the number of unique tokens after concatenating the attribute values of the product offers in Table A and Table B and tokenizing the concatenated attribute values by whitespace. The cartesian product is the maximum number of record comparisons between Table A and Table B (AxB).

**Table 1: Benchmark Statistics.**
Dataset	Table A	Table B	Pos. Pairs Test	Neg. Pairs Test	Vocabulary Size	Cartesian Product
WDC Block small	5,000	5,000	500	4,000	67,294	25M
WDC Block medium	5,000	200,000	500	4,000	1,174,280	1,000M
WDC Block large	100,000	2,000,000	500	4,000	6,880,107	200,000M

**Table 2: Development Set Statistics.**
Development Set Size	Pos. Pairs Train	Neg. Pairs Train	Pos. Pairs Val.	Neg. Pairs Val.	Total Pairs
Small ~1k	266	408	133	203	1,011
Medium ~5k	1,559	1,880	779	939	5,192
Large ~20k	6,454	10,502	3,226	5,250	21,932

4 Experiments

To demonstrate the usefulness of WDC Block, we evaluate the blocking methods BM25, BM25 with trigrams (BM25-3), CTT [4], Auto [4], JedAI [3], SBERT [5], Barlow Twins [6], SimCLR [6], and SC-Block[1] on WDC-Block. Detailed explanations of the experiments can be found in the corresponding paper [1].

4.1 Blocking Systems

4.2 Benchmark Results for nearest neighbour blockers with 𝑘 = 5

We first analyze all nearest neighbour blockers with a fixed number of nearest neighbours 𝑘 = 5. We use recall and precision to evaluate the candidate sets with respect to the test sets of the datasets. By fixing the hyperparameter 𝑘 differences in recall and precision become visible that are not visible if 𝑘 is tuned. The recall and precision results in Table 3 show that 𝑘 = 5 is not sufficient for the benchmarked blockers to achieve a recall above 75%. The results also show that both increasing the dataset size and decreasing the development set size increase the difficulty of the benchmark.

Table 3: Recall (R) and Precision (P) of the candidate sets generated by all nearest neighbour blockers with 𝑘 = 5 on the test sets of the datasets. The highest Recall and Precision values are marked in bold. ’timeout’ indicates a timeout after 48h and ’OOM’ indicates an out-of-memory error.

WDC Block_small

WDC Block_medium

WDC Block_large

Blocker

Recall

Precision

Recall

Precision

Recall

Precision

BM25

59.42%

39.73%

53.81%

45.28%

41.70%

54.39%

BM25₃

52.47%

38.87%

46.64%

43.88%

time out

SC Block_{Large Dev. Set}

71.52%

57.27%

66.37%

63.52%

56.73%

74.41%

SC Block_{Medium Dev. Set}

58.52%

51.58%

50.67%

58.25%

37.00%

62.74%

SC Block_{Small Dev. Set}

42.15%

46.88%

26.46%

46.83%

17.49%

47.56%

Barlow

31.61%

34.73%

21.30%

35.58%

12.56%

33.14%

SimCLR

34.75%

39.24%

21.08%

38.52%

2.69%

33.33%

SBERT_{Large Dev. Set}

45.29%

48.91%

34.98%

55.12%

24.44%

56.77%

Auto

43.20%

39.48%

35.80%

40.68%

OOM

CTT

42.60%

38.10%

34.80%

40.56%

OOM

4.3 Benchmark Results for nearest neighbour blockers with recall >99.5% on validation set

We analyze how tuning the hyperparameter 𝑘 affects the recall of the nearest neighbour blockers on WDC Block. Increasing 𝑘 increases the probability of finding a matching pair resulting in a higher recall. Higher values of 𝑘 produce larger candidate sets because the matcher has more candidate pairs to compare. This prolongs the matching phase of the entity resolution pipeline. Our main goal in tuning 𝑘 is to add all matching pairs to the candidate set while keeping the candidate set as small as possible. To achieve this goal, we evaluate each nearest neighbour blocker with increasing values of 𝑘 starting with 𝑘 = 1 on the validation set. Once the recall of the candidate set exceeds 99.5% on the validation set, 𝑘 is found. To limit the search space, we set a maximum value of 𝑘 = 50 on WDC-Block_small , 𝑘 = 100 on WDC-Block_medium and 𝑘 = 200 on WDC-Block_large.

The results in Table 4 show how challenging WDC Block is for the blockers because except for SC Block_{Large Dev.
Set} all blockers fail to generate candidate sets that exceed the 99.5% recall threshold on the validation set. WDC Block_large is the most challenging benchmark dataset due to the high number of records and the large vocabulary of unique tokens. The blockers BM25₃, JedAI, Auto and CTT fail to generate candidate sets on WDC Block_large due to time and memory constraints.

Table 4: 𝑘 per blocker and dataset after tuning 𝑘 for recall 99.5% on the respective validation set. Recall (R) and candidate set size (|C|) of all blockers on the test set. The highest recall, as well as the lowest 𝑘 and |C| values per dataset, are marked in bold. ’timeout’ indicates a timeout after 48h and ’OOM’ indicates an out-of-memory error.

WDC Block_small

WDC Block_medium

WDC Block_large

Blocker

Recall

|C|

Recall

|C|

Recall

|C|

BM25

96.86%

250k

100

97.76%

500k

83.18%

20M

BM25₃

94.17%

250k

100

93.95%

500k

timeout

SC Block_{Large Dev. Set}

93.50%

70k

91.93%

100k

89.46%

SC Block_{Medium Dev. Set}

92.60%

250k

100

86.55%

500k

77.80%

500k

SC Block_{Small Dev. Set}

71.75%

250k

100

52.02%

500k

42.20%

500k

Barlow

66.59%

250k

100

42.60%

500k

27.80%

500k

SimCLR

69.51%

250k

100

45.96%

500k

36.10%

500k

JedAI

55.40%

51k

80.60%

561k

timeout

SBERT_{Large Dev. Set}

78.48%

250k

100

63.39%

500k

58.74%

500k

Auto

85.20%

250k

100

80%

500k

out-of-memory

CTT

83%

250k

100

78%

500k

out-of-memory

5 Downloads

We offer the WDC Block benchmark for public download. The benchmark is available as a single zip file for each configuration. Each dataset contains the two datasets A and B as well as the train, validation and test split.

7 References

[1]

A. Brinkmann, R.Shraga and C. Bizer, 2023, "SC-Block: Supervised Contrastive Blocking within Entity Resolution Pipelines", arXiv:2303.03132 [cs].

[2]

P. Christen, 2012, "Data matching : concepts and techniques for record linkage", entity resolution, and duplicate detection. Springer, Berlin, Heidelberg.

[3]

V. Christophides, V. Efthymiou, T. Palpanas, G. Papadakis, and K. Stefanidis, 2021, "An Overview of End-to-End Entity Resolution for Big Data", in ACM Comput. Surv. 53, 6 (Nov. 2021), 1–42.

[4]

S. Thirumuruganathan, H. Li, N. Tang, M. Ouzzani, Y. Govind, D. Paulsen, G. Fung, and A. Doan, 2021, "Deep learning for blocking in entity matching: a design space exploration" in Proceedings of the VLDB Endowment, vol. 14, no. 1, pp. 2459–2472.

[5]

N. Reimers and I. Gurevych, 2019, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks" in Proceedings of Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992.

[6]

R. Wang, Y. Li, and J. Wang, 2022, "Sudowoodo: Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation" on arXiv.

[7]

H. Köpcke, A. Thor, and E. Rahm, 2010, "Evaluation of entity resolution approaches on real-world match problems" in Proceedings of the VLDB Endowment, vol. 3, no. 1, pp. 484–493.

[8]

G. Papadakis, D. Skoutas, E. Thanos, and T. Palpanas, 2021, "Blocking and Filtering Techniques for Entity Resolution: A Survey" in ACM Computing Surveys, vol. 53, no. 2, pp. 1–42.

[9]

V. Efthymiou, G. Papadakis, G. Papastefanatos, K. Stefanidis, and T. Palpanas, 2017, "Parallel meta-blocking for scaling entity resolution over big heterogeneous data" in Information Systems, vol. 53, pp. 137–157.

[10]

D. Paulsen, Y. Govind, and A. Doan., 2023, "Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching" in Proceedings of the VLDB Endowment, vol. 16, pp. 1507–1519.

Dataset Size	Small development set (~1k)	Size	Medium development set (~5k)	Size	Large development set (~20k)	Size
Small	Size S, Train S	2.2MB	Size S, Train M	2.2MB	Size S, Train L	2.5MB
Medium	DS M, Train S	38MB	DS M, Train M	38MB	DS M, Train L	39MB
Large	DS L, Train S	385MB	DS L, Train M	385MB	DS L, Train L	385MB

WDC Block: A Blocking Benchmark

News

Table of Contents

1 Introduction

2 Benchmark Creation

3 Benchmark Profiling

4 Experiments

4.1 Blocking Systems

4.2 Benchmark Results for nearest neighbour blockers with 𝑘 = 5

4.3 Benchmark Results for nearest neighbour blockers with recall >99.5% on validation set

5 Downloads

6 Feedback

7 References