The Biomedical Data Fabric (BDF) Toolbox project, led by the Advanced Research Projects Agency for Health (ARPA-H), aims to improve the integration and usability of biomedical data collected across thousands of labs, hospitals, and research centers. The project addresses challenges caused by incompatible data formats, siloed systems, and limited interoperability, enabling large-scale and privacy-preserving biomedical data analysis.
At VIDA Lab at NYU, we develop novel AI-powered methods and open-source systems to support biomedical data harmonization and discovery. Our research advances the state of the art in dataset discovery, schema matching, agentic data integration, and AI-powered scientific reasoning, combining large language models, interactive visualization, and conversational interfaces.
Our research contributions include: cost-effective and accurate schema matching that outperforms state-of-the-art approaches on biomedical benchmarks; agentic harmonization that enables domain experts to construct reusable data pipelines through natural language; and LLM-powered dataset discovery in scientific publications.
We have released five open-source tools — BDI-Kit, BDI-Viz, Harmonia, Data Gatherer, and Discovera — adopted by multiple BDF program teams and freely available to the research community.
We have created three new benchmarks — two for biomedical schema and value matching and one for dataset discovery in scientific publications — available to the community.