Article

GSTF Journal on Computing (JoC)

, 3:22

First online:

Open Access This content is freely available online to anyone, anywhere at any time.

Multiple Buffering for Parallel Approximate Sequence Matching using Disk-based Suffix Tree on Multi-core CPU

  • Keiichi TamuraAffiliated withDepartment of Intelligent Systems, Graduate School of Information Sciences, Kyushu UniversityInformation Science, Hiroshima City University Email author 
  • , Yousuke WatanukiAffiliated withDepartment of Intelligent Systems, Graduate School of Information Sciences, Kyushu University
  • , Hajime KitakamiAffiliated withDepartment of Intelligent Systems, Graduate School of Information Sciences, Kyushu UniversityTohoku University
  • , Yoshifumi TakahashiAffiliated withMaster of Information Engineering degree, Graduate School of Information Science, Hiroshima City University

Abstract

Suffix trees, which are trie structures that present the suffixes of sequences (e.g., strings), are widely used for sequence search in different application domains such as, text data mining, bioinformatics and computational biology. In particular, suffix trees are useful in bioinformatics applications, because they can search similar sub-sequences and extract frequent sequence patterns efficiently. In recent years, efficient construction of a suffix tree that allows faster sequence searches has become one of the most important challenges, because the number and size of the data that are stored in sequence databases have been increasing exponentially. This paper proposes a novel parallelization model for approximate sequence matching that uses disk-based suffix trees, which are built on hard disks not on memory, on a multi-core CPU. In the proposed parallelization model, we divide an entire sequence database into two or more sub-databases called partitions. For each partition, we build a disk-based suffix tree and define a task as an approximate sequence matching on one disk-based suffix tree. Moreover, the proposed parallelization model involves a multiple buffering management system to avoid conflicts among CPU-cores. We evaluated the proposed parallelization model using an actual amino acid sequence database on a PC. The experimental results show a substantial improvement in computation performance.

Index Terms

parallel processing suffix tree multi-core CPU buffer management approximate sequence matching