论文部分内容阅读
针对现有DNA重复体频率统计算法效率低、灵活性差等不足,基于字符串多模式匹配的有限状态自动机,构造DNA子序列比对自动机,利用KMP算法对自动机进行状态转移优化,由此提出一种高效的重复体频率统计算法。该算法通过对DNA数据库的线性扫描,得到每个DNA子序列在全局数据库中重叠与非重叠的重复体频率统计信息以及指定DNA序列集合的最长公共子序列信息。实验结果表明,该算法具有效率高、匹配精确、信息获取方式灵活、支持在线操作等优势。
In order to overcome the shortcomings of low efficiency and poor flexibility of existing DNA repeat frequency statistics algorithms, a finite state automaton based on string multi-pattern matching is constructed to construct DNA sequence alignment automata. KMP algorithm is used to optimize the state transition of automaton. This proposed an efficient repetition frequency statistics algorithm. The algorithm scans the DNA database linearly to obtain the frequency statistics of overlapping and non-overlapping repetitions of each DNA sub-sequence in the global database and the longest common sub-sequence information of the specified DNA sequence sets. Experimental results show that the proposed algorithm has the advantages of high efficiency, accurate matching, flexible information acquisition and support for online operations.