The Biomedical Data Fabric (BDF) Toolbox project, led by the Advanced Research Projects Agency for Health (ARPA-H), aims to improve the integration and usability of biomedical data collected across thousands of labs, hospitals, and research centers. The project addresses challenges caused by incompatible data formats, siloed systems, and limited interoperability, enabling large-scale and privacy-preserving biomedical data analysis.

NYU BDF Harmonization Infrastructure diagram
The NYU BDF harmonization infrastructure. BDI-Kit provides a shared library of composable primitives that power BDI-Viz (interactive schema curation) and Harmonia (conversational harmonization agent), and is designed to be extended by contributors and integrated with external AI agents via MCP.

At VIDA Lab at NYU, we develop novel AI-powered methods and open-source systems to support biomedical data harmonization and discovery. Our research advances the state of the art in dataset discovery, schema matching, agentic data integration, and AI-powered scientific reasoning, combining large language models, interactive visualization, and conversational interfaces.

Novel Methods

Our research contributions include: cost-effective and accurate schema matching that outperforms state-of-the-art approaches on biomedical benchmarks; agentic harmonization that enables domain experts to construct reusable data pipelines through natural language; and LLM-powered dataset discovery in scientific publications.

Open-Source Systems

We have released five open-source tools — BDI-Kit, BDI-Viz, Harmonia, Data Gatherer, and Discovera — adopted by multiple BDF program teams and freely available to the research community.

Benchmarks

We have created three new benchmarks — two for biomedical schema and value matching and one for dataset discovery in scientific publications — available to the community.

News