The International Conference for High Performance Computing, Networking, Storage and Analysis
Enabling Scalable Data Analysis of Computational Structural Biology Datasets on Distributed Memory Systems Supported by the MapReduce Paradigm.
Student: Boyu Zhang (University of Delaware)
Advisor: Michela Taufer (University of Delaware)
Abstract: In this research project, we focus on scalable and accurate classification and clustering analyses of large computational structure biology datasets on large distributed memory systems. We propose a transformative data analysis method that comprises of two general steps. The first step extracts concise properties or features of each data record in parallel and represents them as metadata. The second step performs the analysis (i.e., classification or clustering) on the extracted properties. Our method naturally fits in the MapReduce paradigm; we adapt it for different MapReduce frameworks (i.e., Hadoop, MapReduce-MPI, and DataMPI). We use the frameworks for three scientific datasets of RNA secondary structures, ligand conformations, and folding proteins. The evaluation results show that our method can perform scalable classification and clustering analyses on large-scale datasets that are generated and stored in a distributed manner. Moreover, our method achieves better accuracy comparing to the traditional approaches.