The National Science Foundation has awarded a $1.6 million BIGDATA grant to Drexel University, in collaboration with the University of Washington, University of Michigan and University of Massachusetts Amherst to research and develop responsible data science methods targeting the early stages of the data life cycle.
Julia Stoyanovich, Assistant Professor of Computer Science at Drexel University, is the Principal Investigator (PI) for the grant, which provides funding from September 2017 through August 2021. Stoyanovich will work with Co-PIs Bill Howe, Director of Urbanalytics and Associate Professor in the Information School at UW, H. V. Jagadish, the Bernard A. Galler Collegiate Professor of Electrical Engineering and Computer Science at UM, and Gerome Miklau, Professor in the College of Information and Computer Sciences at UMA.
The project, Foundations of Responsible Data Management, develops techniques and practices to reduce the introduction of algorithmic bias and privacy leaks, while supporting transparency, in the pre-processing of big data. In contrast to existing work on these topics, which has focused on data mining, modeling, analysis and machine learning, this project addresses upstream processes that generate input data for analysis, from discovery and acquisition to querying, ranking and the generation of synthetic data. This work is part of Data, Responsibly, based at Drexel University, which builds tools to embed legal and ethical norms into data sharing, collection and analysis.
This work establishes the properties of fairness, representativeness and diversity as necessary components in data management systems through three aims:
Aim 1: Responsible Data Discovery, Profiling, and Integration: This aim addresses algorithmic bias that can be introduced in the selection of non-representative samples, the manipulation of data prior to aggregation and sharing, the combination of disparate records with limited metadata, and the presence of heterogeneous or weakly structured datasets. Tasks will include attaching metadata to datasets to facilitate search and downstream analysis for data sources such as open data portals that often use free-text descriptions or tagging, or datasets with incomplete metadata, to provide scalable curation through automatic annotation.
Aim 2: Responsible Query Processing: This aim addresses tasks such as querying and ranking that take place after data acquisition and integration into a relational view, and in preparation for modeling and analysis. The goal is to prevent under-representation or over-representation of protected groups, which can unintentionally reinforce discrimination. The team will develop techniques to refine queries automatically or semi-automatically to mitigate prior assumptions in the criteria used to categorize data subjects. This will be accomplished by generating “bias factors” that are specific to each group of data subjects defined by the user. The PIs will also develop a comprehensive methodology for assessing the bias in rankers and supporting transparency in ranking schemes. The techniques developed will require each demographic category to be represented in the output of a query to support diversity. For example, altering a query upstream can support statistical parity (ensuring that the demographics of a sample represent the whole), which can influence research outputs.
Aim 3: Data Protection: The team will incorporate data protection into the data life cycle by developing techniques to facilitate sharing of sensitive data, such as access and usage controls, while balancing the trade-offs between privacy, provenance and transparency. This aim will use differential privacy methods that rely on a third-party to manage sensitive data and provide synthetic data with a limited scope to outside analysists in response to queries. This will include studying the relationship between privacy protection, the evaluation of bias factors and query refinement, as fairness and diversity are harder to evaluate in narrow synthetic datasets; and developing a method for aggregate provenance recording that protects privacy following data manipulation (removing outliers and incomplete data, adding missing values or combining data sources). This aim builds on DataSynthesizer, created by a research team led by Howe and Stoyanovich to generate synthetic data that is structurally and statistically similar to a sensitive dataset without compromising privacy.
An example of responsible data techniques in practice uses the scenario of improving accuracy in a model of urban homelessness to prevent disparate impact in the resulting interventions and policies. To account for low accuracy in data on young adults, researchers could expand the selection criteria for homelessness status, change the cutoff age for the young adult sub-group, or incorporate mental health service data. If adding this data would endanger the privacy of minors, a synthetic data set could be used.
The PIs will design qualitative and quantitative engagement studies with user communities, partnering with Impact Lab in Chicago to explore the use cases of workforce retraining programs, energy conservation and interventions for the homeless; and with the eScience Institute at UW to study transportation behavior and the relationship between housing stability and education; in addition to working with the US Census Bureau and MetroLab Network partners.
In the paper Fides: Towards a Platform for Responsible Data Science, Stoyanovich, Howe, Miklau and co-authors Serge Abiteboul, Arnaud Sahuguet and Gerhard Weikum discuss the idea of a responsible data sharing and collaborative analytics platform called Fides that would incorporate fairness, accountability and transparency properties as upstream database system issues. This software-as-a-service platform would utilize techniques developed through this grant to improve metadata, curation, annotation of sensitive data, access and version controls, and bias detection in statistical results.
The algorithms, techniques and implementation methods developed through this grant will be tested using open data portals, the publicly available SQLShare workload and synthetically generated datasets based on sensitive data from project partners. SQLShare, a cloud-based database service developed at UW with funding from NSF and the Gordon and Betty Moore Foundation to simplify the process of uploading and querying data, will also be used to manage internal partner data, incorporate results and disseminate algorithmic techniques. The project outcomes will be incorporated into courses for graduate and undergraduate students at all four universities. Data and algorithm implementations will be released open source.