Discovered motifs with using parallel Mprefixspan method

,


INTRODUCTION
A motif is a featured pattern in amino acid sequences.It is related to a function of the protein that has been preserved in the evolutionary process of an organism.The amino acid sequence is composed of 20 kinds of alphabets.The featured pattern, which includes wild cards, is discovered from frequent patterns extracted in the sequences.A protein sequence motif, signature or consensus pattern is a short sequence that is embedded within the sequences of the same protein family (Bork and Koonin, 1996).By identifying protein sequence motifs, an unknown sequence can be quickly classified into its computationally predicted protein family/families for further biological analysis.
In past years, many algorithms for finding protein sequence motifs have been proposed.Sequence motif discovery algorithms can be generally categorized into 3 types: 1) String alignment algorithms, 2) Exhaustive enumeration algorithms, and 3) Heuristic methods.String alignment algorithms (Waterman et al., 1984;Delcoigne and Hansen, 1975;Needleman and Wunsch, 1970) find sequence motifs by minimizing a cost function which is *Corresponding author.E-mail: hossein.shirgahi@gmail.com.related to the edit distances between sequences.Multiple alignments of sequences are NP-hard problem and its computational time increases exponentially with the sequence size.Local search algorithms such as Gibbs sampling (Lawrence et al., 1993) expectation maximization (Bailey and Elkan, 1995) may end up in a local optimum instead of finding the best motif (Buhler and Tompa, 2001).Exhaustive enumeration algorithms (Blanchette et al., 2000;Brazma et al., 1998) are guaranteed to find the optimal motif, but run in exponential time with respect to the length of motif.Heuristic methods (Jonassen et al., 1995;Sagot and Viari, 1996) can have a better performance but are usually less flexible.The existing other pattern extraction algorithms, which include multiple alignment, statistical method, present some problem (Kitakami et al., 2002;Wang and Parthasarathy, 2004;Syed et al., 2010;Wang et al., 2004;Chang et al., 2006).These algorithms are neither functional nor fast for the discovery of motifs from large-scale amino-acid sequences.
To solve the functional problem, we apply some changes on prefixspan method, (Chang et al., 2006) which introduces a mechanism to limit the number of wild cards to the prefixspan method.Using this modifications was successful in the deletion of extra patterns and in the

Sequence_id Sequence
reduction of computational time for about 1200 amino acid sequences, each of which ranges in length from 20 to 3800 characters.To achieve faster pattern extraction from large-scale sequence databases, the parallelization of the Pspan method by the use of multiple computers connected in local area network was used and is explained in this paper.To achieve better parallelization we used a dynamic load balancing method (Cong et al., 2004;Di Fattaand Berthold, 2006).One computer, which manages parallel processing, divides the entire extraction into multiple k-length frequent-pattern extractions.The k-length frequent pattern extraction is distributed to the computers, which perform parallel processing.As a feature of the Pspan method, the workload of each k-length frequent pattern extraction processing is different.After a k-length frequent pattern is extracted by a computer, the step is repeated.Other computers continue processing, one after another, while a designated computer extracts a k-length frequent pattern.

Previous studies about motif discovery
Here, related work for motif discovery methods in the field of molecular biology is described.Let L be the average sequence length for the set of sequences, {S1, S2 ,.., SN} defined over some alphabet Σ.Multiple alignment requires a large amount of time to find only one optimal solution, since the time complexity for the multiple alignment is O(LN).Also multiple alignments does not reveal other motifs that include the set of sequences.In order to solve the problem, Timothy and Charles (1994) developed a new system named MEME, that used an algorithm named maximation which is a statistical method to extract frequent pattern.The time complexity of MEME is less than the multiple alignment O((NM)2W).It is difficult to find the relationship between frequent pattern and motifs including such motif databases (Sonnhamer et al., 1997).
The Pratt system proposed by Jonassen et al. (1995) is another algorithm for motifs discovering.The Pratt system uses an algorithm that recursively generates (k+1)-length frequent patterns from k-length frequent patterns.In this algorithm if the maximum length that a useful pattern can have is not predictable for users, an error occurred.Another algorithm for motif discovering is TEIRESIAS algorithm which it solves Pratt system Rokny et al. 4221 problem (Rigoutsos and Floratos, 1998).However, TEIRESIAS does not afford the user the capability to directly specify the maximum size V of a wild card region between alphabet characters included in any elementary frequent pattern (can not specify patterns length for discovering frequent pattern).Moreover, it does not follow any principle when the user tries to enter the average length of sequence.The main objective of TEIRESIAS is to find the maximal frequent patterns, not all of the frequent patterns.
Other algorithms have been designed to solve motif discovery problems.Most of these algorithms, including ALIGNACE (Hughes et al., 2000), MEME (Bailey and Elkan, 1995), PATTERN BRANCHING (Price et al., 2003) and YMF (Sinha, 2003), are not specifically designed for discriminative motif discovery.Some algorithms, such as WEEDER (Pavesi et al., 2004), do make use of a set of negative sequences in scoring candidate motifs.A few algorithms have been developed specifically for discriminative motif discovery, including ALSE (Leung and Chin, 2006), DIPS (Sinha, 2006), DME (Smith et al., 2005).There are a lot of algorithms for solving this problem but none of them are as useful as our algorithm (Wang and Parthasarathy, 2004;Chang and Halgamuge, 2002;De Amo et al., 2008;Li and Wang, 2008) our represented method for motifs extraction is suitable than explained methods.Applied modification on prefixspan made it appropriate.This algorithm also can be suitable to find all of the frequent pattern.In this method it is not necessary to specify the average length of sequences.

MPrefixspan method
In here, we present a brief description of this algorithm.Let DB be a sequence dataset (Table 1).The algorithm starts with a scan of DB to identify the frequent 1-sequences.Then, a second scan of DB constructs the projected datasets for the frequent 1-sequences.Let i be a sequence, a projection i of DB, denoted as P(i,DB), is a set of subsequences, which are made up of the sequences in DB containing i after deleting the events appearing before the first occurrences of i within each sequence for instance, Table 1 shows a sequence dataset, With the support threshold as 2 in the projected dataset for sequence AB is P(AB,DB) = {C,CB ,C,BCA} as you see sequences of AB routes be deleted and remain subsequence constitute this set (Han and Kamber, 2006;Pei et al., 2001).Performance studies (Pei et al., 2007;Wang and Han, 2004) have shown that the prefixspan algorithms is more efficient than the other algorithm ( (Zaki, 2001;Yan et al., 2003;Agrawal and Srikant, 1995).
After the projected datasets are built, algorithm searches each projected dataset and selects the sequential patterns.The existing prefixspan method cannot generate frequent patterns including some wild cards.For instance, "P=ATT***TTC*GGATT**T*T*CCC*GG" are considered to be a protein sequence.Where * indicates one wild card symbol.It can be every symbol.the common prefixspan method considered TT***T and TT**T as a TTT pattern, but in our suggested algorithm they known as TT2T and TT3T.The prefixspan method generates (k+1)-length frequent patterns from each k-length frequent pattern in a set of sequences.The last character of each (k+1)-length frequent pattern is found from one of the characters that exist among the next position of a k-length frequent pattern and the last position in the sequences.First of all, the Mprefixspan method extracts, the 1-length frequent patterns with considering their min sup.This method extracts the 2length frequent patterns from one 1-length frequent pattern.In fact, this method extracts the (k+1)-length frequent pattern from k-length frequent pattern.For instance, Figure 1 shows frequent patterns that are extracted in the two sequences, the number of wild cards is 3 (Table 2).Figure 1 show how this represented algorithm work.This method extracts 1-length frequent patterns, "K, L, M, N, P, R, S, T" that support min sup.Next, this method extracts 2-length frequent patterns from 1-length frequent patterns, When the 1-length frequent pattern is "K" the 2-length frequent patterns are "K*L".Because of the different number of wild cards in the two sequences, the 2-length frequent pattern,"K1L" is extracted.Next, this method extracts 3length frequent patterns from the 2-length frequent pattern, "K*L".The extracted 3-length frequent pattern is "K*LR" when the 1-length frequent pattern is "M" the 2-length frequent patterns are "MN".Because of the different number of wild cards in the two sequences, the 2-length frequent pattern, "MN" is not extracted.The 2-length frequent pattern "MS" is also not extracted because of the different number of wild cards.Thus this algorithm use Mprefixspan to extract wild cards and number of them.

Parallel frequent pattern extraction in Mprefixspan method
Our algorithm follows three steps: (1) Step 1: Identify the frequent 1-sequences.
The projected datasets of the frequent 1-sequences are independent.Given a 1-sequence, say i, only the suffixes that follow the first occurrences of i in each sequence are the projection of the dataset along i.Therefore, the closed sequential-patterns mined from the dataset projection along i1 all start with i1 as the prefix while the patterns discovered from i2's projections all start with i2.A partition strategy like the one just described is convenient for task decomposition.Since the projected datasets are independent, they can be assigned to different thread.Then, each thread can mine the assigned projected data sets independently.No inter-processor communication is needed during the local mining.Work procedure of algorithm when min sup = 2 and thread = 4 shown in Figure 2 (nodes of each sub tree are called jobs).Amino acids correspond to the letters "A, C, D, E, F, G, H, I, K, L, M, N, P,Q, R, S, T, V, W, Y" of the alphabet.
In Figure 2, because the min sup = 2, the combination of alphabets of the amino acid shows 400 jobs, which are 20 letters times 20 letters.When the jobs are extracted, for instance, with four computers, each computer will do 100 jobs on the average.Our strategy for the parallel mining of patterns is as follows: (1) Each thread counts the occurrence of 1-sequences in a different part of the dataset.
(2) For each frequent 1-sequence a very compact representation of the dataset projections, called pseudo-projections, is built.This is done in parallel by assigning a different part of the dataset to each thread.In fact preparation of pseudo-projections from purposed data set and abandon of data set to each thread both done in the same time.
(3) Dynamic scheduler distributes the projection across the thread for processing.This algorithm is shown in Figure 3.

Threads scheduling
Next, we discuss the mechanism that we use to assign projections to thread, to reduce load imbalance thread Mprefixspan using dynamic scheduling in our implementation.There is a master thread which maintains a queue of pseudo-projection identifiers.Each of the other thread is initially assigned a projection.After a thread completes the mining of a projection, it sends a request to the master thread for another projection.The master thread replies with the index of the next projection in the queue and removes it from the queue.This process continues until the queue of projections is empty.The requests and Algorithm GapPPS(I, DB,min_sup,M,N) Input:(1) I is the processor ID, (2) DB is a portion of the dataset assigned to processor I, (3) min_sup is the minimum support threshold, (4) M and N: the parameters of a gap constraint.Output: the set sequence Proteins with Wild Card.I 1 : find the set of length-1frequent sequential patterns, L 1 ; I 2 : PSP = projection(L 1 ,DB); I 3 : GPSP = broadcast (PSP); I 4 : S = selective sampling(L 1 ,I,DB,min_sup,M,N); I 5 : L 2 = partition(L 1 ,S);//the new set of projections to L 2 .I 6 : if (I == 0) then I 7 : accept requests from slave nodes and reply to each; request with a new identifier from set L 2 ; I 8 : else I 9 : send request for a projection identifier to the master node; I 10 : stop if all projections have been assigned; I 11 : apply algorithm to element of GPSP assigned by the master processor; I 12 : enter patterns into output set send request operation; I 13 : end if replies are short messages and, therefore, this request and reply time is usually negligible relative to the mining time.Dynamic scheduling is quite effective when the sub thread is of similar size and the number of threads is equal with the number of processors.For the datasets we used in our experiments, the cost of mining the projections may vary greatly.The relatively large mining time of some projected datasets may result in extremely imbalanced workload.

Implementation
The process, which generates the job and manages other processes, is called the master process.This process uses multithreading to manage the job dynamically.This process generates threads of number of other processes.The other processes are called the slave processes.The slave process extracts the frequent pattern by the Mprefixspan method.Each slave process communicates with each thread of master process.The management of multiple computers uses the MPI library.The selection of a job from the master process and communication between the master and slave processes are implemented in parallel.Figure 3 shows the processing flow of the parallel Mprefixspan method.The processing steps in Figure 4 are completed in the following order: (1) The master process accomplishes the frequentpattern extraction processing at the threshold.
(2) Multiple jobs generated from (1) are inserted in the global job pool, which stores jobs.
(3) The thread count is the same as the number of slave processes.Each thread takes jobs that are in the global job pool, and sends to slave process.(4) Each slave process extracts the frequent pattern by the Mprefixspan method.
(5) When the end of the slave process is confirmed, the steps are repeated from 2nd step.

Evolution
We implemented the suggested algorithm on a real machine.The cluster machine consists of 4 processors.Table 3 shows the detail of these data sets.These datasets are offered by NCBI. Figure 5 shows the  execution time and the performance ratio in when the number of processor is increased from two to eight.The performance ratio of each parameter is as follows: (1) Zince finger data set, min sup = 2, minimum support ratio 40, and wild cards number 2-7.
Figure 5 shows the total execution time and speed increasing for each data set.In this paper, execution time measured by second.When the number of computers is N, there are N+1 processes in our experimental environments.As a result, all processors extract frequent patterns from jobs.Figures 5A and B shows the number of computers in the x-axis and the performance ratio in the y-axis.From the result in Figure 5B, the performance ratio is about 2 times when 2 computers are used, 2.6 times when 3 computers are used, and 3.5 times when the maximum of 4 computers is used.Efficient parallel processing shortens the execution time to 1/N if parallelization is maintained with N computers.In this experiment, the execution time is about 1/7 at maximum.The cause of the difference is the overhead of communication.When the parameters, which are the support ratio, wild cards, and threshold, were changed, there was greater difference at the execution time of each slave process can be attributed to the jobs.Because of this the execution time of each job is different.If we can estimate the size of jobs, we expect that better the performance achieved.We tested the influence of changing minimum support threshold on the performance of our algorithm.The results are shown in Figure 6.Our algorithm shows stable performance with different support threshold.Also we tested the influence of changing wild cards number on the performance of our algorithm.The results are shown in Figure 7.In the result our algorithm shows stable performance with the  changing wild cards number.

CONCLUSIONS
In this paper, we propose a parallel mining algorithm.It can mine motif sequences with considering the wild cards.The PC cluster appeared to be effective, according to the results of the verification experiment.
The amino acid sequence used by the verification experiment was a small-scale sequence.It will be necessary to verify the results by using a variety of amino acid sequences in the future.

Figure 4 .
Figure 4. Processing flow of the parallel Modified prefixspan.
Figure 5. (A) Execution time and (B) Performance ratio.

Figure 7 .
Figure 7. Influence of changing wild cards.

Table 1 .
An example of sequence dataset.

Table 2 .
Dataset for example.
Figure 1.Extraction of frequent pattern of Mprefixspan method.

Table 3 .
Detail of used datasets.