Dimension Reduction for Large-Scale Federated Data: Statistical Rate and Asymptotic Inference

Principal component analysis (PCA) is one of the most popular methods for dimension reduction. In light of rapidly increasing large-scale data in federated ecosystems, where data cannot leave individual warehouses, such as banks and healthcare systems, the traditional PCA method is often not applicable due to privacy protection considerations and large computational burden. Fast PCA algorithms have been proposed to lower the computational cost for large-scale data, but they cannot handle federated data. Distributed PCA algorithms have been developed to handle federated data by applying traditional PCA to data at each site and aggregating site-specific PCA results. However, they are not computationally efficient and not scalable when data at each site are large with many samples and variables, such as biobanks. In this paper, we propose the FAst DIstributed (FADI) PCA method that performs PCA analysis of large federated data with high computational efficiency and low statistical error without the need of sharing the data across sites. Specifically, FADI applies fast PCA to site-specific data using multiple random sketches and aggregates the fast PCA results across sites. We perform a non-asymptotic theoretical study to show that under some regularity conditions, FADI enjoys the same error rate as the traditional full sample PCA and a significantly smaller order of computational burden compared to the existing methods. We perform extensive simulation studies to compare the finite sample performance of FADI with the existing algorithms, and show that FADI substantially outperforms the other methods in computational efficiency without sacrificing statistical accuracy. We apply FADI to the analysis of the 1000 Genomes data to study the population structure. [Download PDF]

Dimension Reduction for Large-Scale Federated Data: Statistical Rate and Asymptotic Inference

Shuting Shen, Junwei Lu, and Xihong Lin

12/29/2021