Research


Our research focuses on addressing two fundamental challenges in automating machine learning workflows: (i) the synthesis, analysis, and understanding of machine learning pipelines, and (ii) dataset search and discovery. A central goal of this work is to make machine learning accessible to users without extensive expertise in computer science or artificial intelligence, enabling domain experts to develop and deploy ML solutions more effectively. As machine learning systems become increasingly complex, there is a growing need for tools that not only automate the construction of ML workflows but also help users understand, refine, and trust the generated solutions. To address these challenges, we develop systems that combine AutoML, visual analytics, reinforcement learning, and human-in-the-loop interaction to support the end-to-end machine learning lifecycle.

Our work on pipeline synthesis and model understanding includes systems such as Visus, PipelineProfiler, AlphaD3M, and Alpha-AutoML, which support automated pipeline generation, interactive exploration of ML solutions, and integration of state-of-the-art machine learning techniques into extensible AutoML frameworks. These systems are designed to lower the barrier to developing ML applications while still providing transparency and control over the generated models and pipelines. Complementing these efforts, our work on dataset search and discovery is represented by Auctus, a dataset search engine that enables large-scale discovery and augmentation of structured data. Together, these systems aim to make machine learning more accessible, interpretable, and effective by supporting both the automated generation of ML workflows and the discovery of high-quality data needed to build them.

Visus


Visus is a system designed to support the construction, curation, and refinement of machine learning (ML) pipelines generated by AutoML systems. As AutoML approaches increasingly enable the automatic synthesis of end-to-end ML workflows, Visus addresses the need for effective human oversight by providing intuitive interactive interfaces tailored for domain experts with limited ML expertise. The system integrates visual analytics techniques to guide users throughout the model-building process, enabling interactive data augmentation, pipeline exploration, and visual model selection. Through these capabilities, Visus facilitates human-in-the-loop curation of AutoML-generated pipelines, helping users better understand, evaluate, and refine ML solutions.

Repository: gitlab.com/ViDA-NYU/auctus/auctus

Demo video: youtube.com/watch?v=lZQbh3ctq6Q

PipelineProfiler


PipelineProfiler is an interactive visualization tool designed to support the exploration, analysis, and comparison of machine learning (ML) pipelines generated by AutoML systems. As modern AutoML approaches can produce large and complex solution spaces, understanding, debugging, and evaluating the generated pipelines remains challenging for both developers and ML practitioners. PipelineProfiler addresses these challenges by providing visual analytics capabilities that help users inspect pipeline structures, compare model performance, and better understand the behavior of AutoML algorithms. Integrated with Jupyter Notebook and compatible with common data science tools, the system enables rich interactive analyses of AutoML-generated pipelines and supports informed decision-making when selecting or improving AutoML solutions. Its effectiveness has been demonstrated through real-world use cases and user studies involving data scientists developing and evaluating AutoML systems.

Repository: github.com/VIDA-NYU/PipelineVis

Demo video: youtube.com/watch?v=2WSYoaxLLJ8

Auctus


Auctus is a dataset search engine designed to automatically discover and index structured datasets available across the Web, including Web tables, open-data portals, and enterprise repositories. Unlike traditional dataset search systems, Auctus infers consistent metadata for indexing and supports advanced search capabilities such as join and union queries, enabling users to identify datasets that can be integrated or combined effectively. By addressing the challenges of structured data discovery, Auctus facilitates data exploration, augmentation, and integration at scale. The system has been successfully deployed in real-world environments, where it has been used to improve machine learning model performance and enrich analytical workflows through automated dataset discovery and augmentation.

Repository: gitlab.com/ViDA-NYU/d3m/ta3

Demo video: youtube.com/watch?v=EUn1qwXVFHs

AlphaD3M


AlphaD3M is an open-source AutoML system designed to support a wide range of machine learning tasks across diverse data types and application domains. The system automatically searches for models and constructs end-to-end ML pipelines that perform data ingestion, preprocessing, feature engineering, and model training. To efficiently explore the large search space of possible pipelines, AlphaD3M combines deep reinforcement learning with meta-learning, enabling the system to adapt to new tasks and improve over time through incremental learning. Beyond automated pipeline generation, AlphaD3M is integrated into a broader ecosystem of tools that support human-in-the-loop interactions, including pipeline selection, solution analysis, and the development of complex ML workflows. Experimental evaluations demonstrate that AlphaD3M produces high-quality pipelines with performance comparable or superior to state-of-the-art AutoML systems across a diverse set of problems.

Repository: gitlab.com/ViDA-NYU/d3m/alphad3m

Demo video: youtube.com/watch?v=9qJvOUOh2zM

Alpha-AutoML


Alpha-AutoML is an extensible open-source AutoML system designed to seamlessly integrate recent advances in machine learning into automated pipeline generation. Building upon the reinforcement learning and neural network components developed for AlphaD3M, Alpha-AutoML adopts a standard open-source infrastructure for defining, executing, and managing ML pipelines. By leveraging the Scikit-learn pipeline ecosystem, the system is fully compatible with widely used ML libraries and frameworks such as Scikit-learn, XGBoost, Hugging Face, Keras, and PyTorch. In addition, Alpha-AutoML supports the dynamic integration of new primitives through the standard Scikit-learn fit/predict API, enabling the system to rapidly incorporate emerging ML techniques and adapt to the fast-evolving machine learning landscape.

Repository: github.com/VIDA-NYU/alpha-automl