GDC-SM is a schema matching benchmark derived from a real-world biomedical data harmonization scenario in cancer genomics. It is based on the study by
Li et al. (2023), where datasets from ten independent tumor-related studies were integrated and mapped to the
Genomics Data Commons (GDC) standard used by the U.S. National Cancer Institute. To support schema matching research, the original GDC graph-based model was transformed into a simplified relational schema containing only column names and value domains, producing a unified target table with 736 attributes.
The benchmark includes 10 source-to-target table pairs with heterogeneous schemas (16–179 columns per source and 93–225 rows). Ground truth matches were manually curated by at least three biomedical experts per dataset, combining automated candidate generation with careful expert validation and consensus-based decision-making, reflecting the inherent ambiguity and difficulty of real biomedical schema alignment tasks.
Benchmark