Here you will find the full schedule of the Spring 2023 VIDA Dataset Search and Discovery seminar.

You can import the schedule to your own calendar using the following links:

Schedule

DateSpeakerTalkLocation
2/15 11:00amAécio Santos (New York University)Sketching Methods for Efficient Correlated Dataset Search [Details](internal)
3/01 11:00amNatasha Noy (Google Research)Google Dataset Search: Building an open ecosystem for dataset discovery [Details]370 Jay St, Room 1201
3/17 11:00amRaul Castro Fernandez (University of Chicago)System foundations of data markets and their connection to data discovery [Details]370 Jay St, Room 1114
3/24 12:15pmFatemeh Nargesian (University of Rochester)Lakes of Data: From Semantic and Syntactic Dataset Discovery to Approximate Query Answering [Details]370 Jay St, Room 1113
4/7 11:00amLaura Koesten (University of Vienna)Data Discovery and Reuse: A Human-Centred View [Details]Online via Zoom; NYU: 370 Jay St, 1113
4/14 11:30amZiawasch Abedjan (Leibniz Universität Hannover, L3S Research Center)Data Discovery with Advanced Index Structures [Details]Online via Zoom; NYU: 370 Jay St, 1113
4/28 1:30pmAsterios Katsifodimos & Christos Koutras (Delft University of Technology)Matching for Dataset Discovery: Algorithms, Datasets and Benchmarks [Details]Online via Zoom; NYU: 370 Jay St, 1113

Past Talks

Asterios Katsifodimos & Christos Koutras (Delft University of Technology)

Time: April 28th, 2023. 1:30pm.

Location: Online via Zoom: https://nyu.zoom.us/j/99198491462 (We will also convene at NYU: 370 Jay St, Room 1113).

Title: Matching for Dataset Discovery: Algorithms, Datasets and Benchmarks.

Abstract: Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. This process has been traditionally taken care with schema matching techniques. After 20 years of research in schema matching, we are still missing a benchmark for schema matching, as well as proper datasets, and proper evaluation metrics! In this talk I will first present an overview of matching techniques for dataset discovery, and then present Valentine: an extensible open-source experiment suite to execute and organize large-scale automated matching experiments on tabular data. Valentine now includes implementations of 7 seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories. Finally, Valentine offers a data fabrication toolbox for constructing testing datasets with ground truth. I will conclude my talk with insights from a large set of experiments we have been performing at TU Delft, focusing on the strengths and weaknesses of existing techniques, that can serve as a guide for employing schema matching in future dataset discovery methods.

Bios:

Asterios: Asterios Katsifodimos is an Associate Professor at the Delft University of Technology, and a Visiting Academic at Amazon Web Services (AWS) - AI. Before that, Asterios worked at the SAP Innovation Center (Berlin), and at the Technical University (TU) of Berlin. Asterios obtained his PhD from INRIA Saclay & University Paris 11. Asterios’ research work spans the areas of parallel (streaming-) data processing & Cloud computing, optimization of ML-systems, and data integration. His research on fault tolerance, aggregation methods and benchmarking has influenced the design of open-source stream processing engines, while his research group develops and maintains the dataset discovery system Valentine. Asterios has received the ACM SIGMOD Research Highlights Award in 2016, as well as the best paper award at EDBT 2019 and ACM DEBS 2021, as well as the best demo paper award at EDBT 2023. He is the instructor of the online MOOC “Taming Massive Data Streams” and is invited regularly to give talks at industry and research venues. Asterios serves as an associate editor or a program committee member in the data management conferences such as VLDB, ICDE, SIGMOD and EDBT.

Christos Koutras: Christos is a PhD Candidate at the Delft University of Technology, supervised by Asterios Katsifodimos. His research focuses on Data Integration, Dataset Discovery, and spatial data management. Christos leads the efforts around Valentine, a matching framework for dataset discovery. He holds a Master of Philosophy (MPhil) in Computer Science from HKUST, where he was supervised by Prof. Dimitris Papadias. Prior to that, he obtained his 5-year Diploma in Electrical and Computer Engineering from National Technical University of Athens.

Ziawasch Abedjan (Leibniz Universität Hannover, L3S Research Center)

Time: April 14th, 2023. 11:30am.

Location: Online via Zoom: https://nyu.zoom.us/j/97420294195 (We will also convene at NYU: 370 Jay St, Room 1113).

Title: Data Discovery with Advanced Index Structures

Abstract: Data market places and data lake management are becoming more and more relevant in large organisations. One of the common use cases for using such data repositories is to obtain training data or features for a downstream machine learning (ML) task. The goal is to enrich a given table with additional columns obtained from related tables that reside inside the data lake. Existing methods rely on the discovery of related tables that join on single attributes of a given table. However, many candidate tables that re discovered this way will only loosely be relevant for the given input dataset or might not contain additional interesting features. To further restrict the set of candidates complex filters based on multi-column matches and correlation calculation can be applied, which are time-consuming using state-of-the-art data structures and indexes. We have recently introduced two index structures that support multi-column join discovery and correlation calculation. In this talk, I will present these data structures and discuss the achieved performance on very large data lakes.

Bio: Ziawasch Abedjan is Professor for “Databases and Information Systems” at Leibniz Universität Hannover and Visiting Academic at Amazon Search. He is Junior Fellow of the German Computer Science Society, Fellow of the Berlin institute on Foundation of Learning and Data and member of the L3S Research Center. He has published more than 60 peer-reviewed papers in the area of data integration and data analytics. Ziawasch Abedjan received his PhD at the Hasso-Plattner-Institute in Potsdam and received the best dissertation award of the University of Potsdam in 2014. After his PhD, he was a postdoctoral associate at MIT and Junior Professor at the TU Berlin. He is further recipient of the SIGMOD 2019 most reproducible paper award, SIGMOD 2015 best demonstration award, and CIKM 2014 best student paper award. His research is funded by the German Research Foundation (DFG) and the German Ministry of Research and Education (BMBF).

Laura Koesten (University of Vienna)

Time: April 7th, 2023. 11am.

Location: Online via Zoom: https://nyu.zoom.us/j/97184908181?pwd=R3NRWmI4YlZBQUJ4ZVZxV0wzbXJOQT09

Title: Data Discovery and Reuse: A Human-Centred View

Abstract: The web provides access to millions of datasets. These data can have additional impact when used beyond the context for which they were originally created. This is particularly relevant given the ever-increasing amounts of data being produced and made available and the creation of data-specific discovery tools and systems. Still, finding as well as reusing data remains challenging. While information-seeking in various settings has been well-researched within computer and information science, less is known about human-centered data discovery, or, in other words, how people discover, understand, and interact with data that others create. This talk will give an overview of research on how people evaluate and make sense of data they find and how data search systems and data repositories can be better designed to meet people’s needs.

Bio: Laura Koesten is a postdoctoral researcher at the University of Vienna in the Research Group for Visualization and Data Analysis and external researcher at King’s College London, UK. In her work, she is looking at ways to improve human-data interaction by studying sensemaking with data and visualisations, data discovery and reuse, as well as ethical and collaborative aspects of data-centric work. That means she researches how data is used, understood and presented by different user groups. She is PI of the Talking Charts project (https://talking-charts.vda.univie.ac.at/) and obtained her Ph.D. at the Open Data Institute + the University of Southampton, UK.

Fatemeh Nargesian (University of Rochester)

Time: March 24th, 2023. 11am.

Location: 370 Jay Street, Room 1113.

Title: Lakes of Data: From Semantic and Syntactic Dataset Discovery to Approximate Query Answering

Abstract: The number and variety of structured data sources available on open data portals, the web, and data markets have been increasing rapidly, making secondary data analysis much more attractive. In fact, data scientists on many occasions may be spoiled for choice. In this talk, first, I will describe how we can push down syntactic and semantic join operations into dataset search to discover and infer queries for integrating data. Next, we will broaden the scope to query answering and see how to obtain an IID sample over the union of join queries, to perform approximate query answering. Finally, I will conclude by discussing the future directions in the distribution-aware aspects of integrating secondary data.

Bio: Fatemeh Nargesian is an assistant professor of computer science at the University of Rochester. She obtained her PhD at the University of Toronto. Her research interests are in data management for AI-based data analytics and scientific time-series management. Her work has appeared at top-tier venues including VLDB, SIGMOD, and ICDE, and has won the best demo award of VLDB 2017.

Raul Castro Fernandez (University of Chicago)

Time: March 17th, 2023. 11am.

Location: 370 Jay Street, Room 1114

Title: System foundations of data markets and their connection to data discovery

Abstract: A data market is an environment where agents exchange data. From this “lens”, many familiar scenarios can be considered data markets. When individuals give their data to online services in exchange for their services (think Google or Meta), they participate in a data market. Online marketplaces where companies trade data are a data market. And when employees of an organization exchange data with each other, they are participating in a data market. In this talk, I will first discuss the benefits of applying this “data market lens” to data environments. I will then present the work our group is doing in advancing this agenda, which roughly differentiates between a data market design and data market platform implementation. I will then concentrate on a novel platform we are building to support the implementation of data markets: a data escrow, a trustworthy third party that supports delegated and auditable computation, two ingredients necessary to implement data markets. Before concluding, I will present my view on how data discovery and data markets will lead to interesting new scenarios and applications.

Bio: I am interested in understanding the economics and value of data, including the potential of data markets to unlock that value. Our group builds systems to share, discover, prepare, integrate, and process data. We use techniques from data management, statistics, and machine learning. I am an assistant professor in the Computer Science department at the University of Chicago. Before UChicago, I did a postdoc at MIT with Sam Madden and Mike Stonebraker. And before that, I completed my PhD at Imperial College London with Peter Pietzuch.

Natasha Noy (Google Research)

Time: March 1st, 2023. 11am.

Location: 370 Jay Street, Room #1201.

Title: Google Dataset Search: Building an open ecosystem for dataset discovery

Abstract: There are thousands of data repositories on the Web, providing access to tens of millions of datasets. National and regional governments, scientific publishers and consortia, commercial data providers, and others publish data for fields ranging from social science to life science to high-energy physics to climate science and more. Access to this data is critical to facilitating reproducibility of research results, enabling scientists to build on others’ work, and providing data journalists easier access to information and its provenance. The talk will discuss Dataset Search by Google, which provides search capabilities over potentially all dataset repositories on the Web. I will talk about the open ecosystem for describing datasets that we hope to encourage. I will discuss what we have learned by analyzing the corpus of more than 45 million dataset descriptions. Finally, I will discuss research challenges in building a vibrant, heterogeneous, and open ecosystem where data becomes a first-class citizen.

Bio: Natasha Noy is a scientist at Google Research where she works on making structured data accessible and useful. She leads the team building Dataset Search, a search engine for all the datasets on the Web. Prior to joining Google, she worked at Stanford Center for Biomedical Informatics Research where she made major contributions in the areas of ontology development and alignment, and collaborative ontology engineering. Dr. Noy is a Fellow of the Association for the Advancement of Artificial Intelligence (AAAI). She has served as the President of the Semantic Web Science Association from 2011 to 2017.

Aécio Santos (New York University)

Time: February, 15th, 2023. 11am.

Abstract: Dataset search is emerging as a critical capability in both research and industry: it has spurred many novel applications such as data augmentation for enriching data analyses and improving machine learning models. In this talk, we present our recent work that explores a new class of dataset search queries that uncovers data relationships in large table collections. In particular, we focus on join-correlation queries: given an input query table, find the top-k tables that are both joinable with it and contain columns strongly correlated with a column in the query table. Unfortunately, a naïve approach to evaluate these queries, which first finds joinable tables and then explicitly materializes joins and computes correlations between the query and all discovered tables, is prohibitively expensive. To solve this problem, we 1) present novel data sketching methods that enable the construction of an index for a large number of tables and that provide accurate estimates for join-correlation queries, and 2) explore different indexing and scoring strategies that effectively retrieve and rank the query results based on how well the columns are correlated with the query.

Bio: Aécio Santos is a Research Engineer in the Visualization, Imaging, and Data Analysis (VIDA) group at New York University (NYU). He received a Master’s Degree in Computer Science from the Federal University of Minas Gerais (in Brazil) and is currently a part-time Ph.D. candidate at New York University under the supervision of Prof. Juliana Freire. Over the years, he has worked, both in academia and industry, on a wide range of problems related to web crawling, news recommendation and classification, automated machine learning, and dataset search. He has served as a reviewer and has published papers at multiple premier data management and information retrieval conferences and journals such as VLDB, SIGMOD, WWW, WSDM, SIGIR, and CIKM.