The International Conference for High Performance Computing, Networking, Storage and Analysis
Multi-Level Hashing Dedup in HPC Storage Systems.
Student: Eric E. Valenzuela (Texas Tech University)
Supervisor: Yong Chen (Texas Tech University)
Abstract: Reaching high ratios of data deduplication in High Performance Computing (HPC) is highly achievable. Prior art demonstrates magnitudes of reduction possible and 15 to 30 percent of redundant data can be removed on average using deduplication techniques. The objective of this research study is to design and experiment a dedup system to provide 100% data integrity without a possibility of losing data while reducing the need of costly byte-by-byte comparisons. Because data deduplication uses hashing algorithms, hash collisions will occur. Prior systems ignore byte-by-byte comparisons that are needed to handle collisions citing the probability is low. Our research focuses on investigating a multi-level dedup method to reduce byte-by-byte comparisons while providing 100% data integrity, and the implementation of multi-level hash functions while talking advantage of Xeon Phi many-core architecture to compute cryptographic fingerprints concurrently. Our current proof-of-concept evaluations with a deduplication file system, Lessfs, show promising results.