讲座:Quantifying Data Value in Modern AI 发布时间:2025-11-06

  • 活动时间:
  • 活动地址:
  • 主讲人:

题 目:Quantifying Data Value in Modern AI

嘉 宾:郑舒冉 助理教授 清华大学

主持人:张大可 助理教授 上海交通大学安泰经济与管理学院

时 间:20251112日(周13:30-15:00

 上海交通大学 徐汇校区安泰浩然楼306

 

内容简介:

Data lies at the heart of modern artificial intelligence, with the quality of data increasingly recognized as a crucial determinant of model performance. Data evaluation is also a key focus of the finance research community, whether as samples for financial machine-learning models or as data assets to be evaluated. The first part of this talk will focus on recent advances in quantifying the value of individual data points, highlighting approaches based on influence functions and Shapley values.

The second part will discuss our work on evaluating the effectiveness of data curation methods. We argue that conventional evaluations—those based solely on model performance on fixed benchmarks—can be misleading, as they may incentivize curators to make training data similar to the test sets. This issue exemplifies Goodhart’s law: when a measure becomes a target, it ceases to be a good measure. To address this, we propose an information-theoretic framework for evaluating data curation methods, where dataset quality is measured by its informativeness about the true model parameters using the Blackwell ordering. We compare informativeness by the Shannon mutual information of the evaluated data and the test data, and we propose a novel method for estimating the mutual information of datasets by training Bayesian models on embedded data and computing the mutual information from the model’s parameter posteriors. Experiments on real-world data demonstrate that our mutual information-based evaluation assigns appropriately lower scores to data curation strategies that reduce dataset informativeness, while traditional test score-based evaluation methods may favor data curation strategies that overfit to the test set but compromise the training data's informativeness.

 

演讲人简介

Shuran Zheng is a tenure-track Assistant Professor in the Institute for Interdisciplinary Information Sciences at Tsinghua University. She obtained my Ph.D. in Computer Science from Harvard University, and was a postdoctoral researcher at Carnegie Mellon University, a Student Researcher in the Market Algorithms Group at Google Research NYC.

Her research lies at the intersection of Computer Science and Economics, and she is particularly interested in understanding the value of data and information. She explores various areas including data valuation, data markets, information elicitation, information aggregation, and information design.

 

 

欢迎广大师生参加!