Urban Institute
To build and test a prototype validation server that enables privacy-preserving research on administrative tax data
The grant funds a project by Claire Bowen, Lead Data Scientist at the Urban Institute, to facilitate more research on IRS data. Such data is extremely valuable for probing a series of pressing questions in social science, but the data is extremely sensitive. Strict privacy protection laws inhibit access to this data, such that hardly anyone outside the IRS has ever laid eyes on it. Bowen and her team propose to exploit recent mathematical advances in the theory of what’s called “differential privacy” to create tools that can be used to increase researcher access to IRS data without fear of violating the privacy of the American taxpayer. The project is divided into two parts. In the first, Bowen will create a high-quality synthetic dataset from original IRS data. Mathematical theory shows how to safely do this by reconstructing microdata details from statistical tables to which small bits of noise have been added. For many research questions, queries of this “noisy” synthetic dataset will provably yield the same answer that the same query would yield of the original IRS data, without the danger of exposing the identity of any of the individuals in the data. For some more complicated research questions--nonlinear calculations such as correlations or regression coefficients, for instance—there is no guarantee that queries of the synthetic noisy dataset will yield the same results as similar queries of the original. Without a means for further testing, researchers cannot be certain whether a relationship they find in the synthetic data is real or an artifact. The second part of Bowen’s project will address this concern through the construction of a “verification server.” The server, which would have access to the original IRS data, can verify whether a result reached through analysis of the synthetic dataset is consistent with the original data, guaranteeing the fidelity of research results without allowing researchers to see the sensitive data. If successfully constructed, this two-pronged system—synthetic dataset plus verification server—promises to provide researchers with reliable but privacy-protecting access to one of the most valuable datasets in social science.