Hardware and Software Acceleration of Genomic Searching Through Novel Algorithms
Research and development of a new and sensitive genomic search algorithm which uses fragment assembly and indexing techniques to operate up to an order of magnitude faster than existing algorithms. The APAC National Facility makes it possible to provide the benchmark data necessary to verify the speed and sensitivity characteristics compared to the existing algorithms.
|
Principal Investigator Greg KnowlesSchool of Informatics & Engineering Flinders University |
Project f80 |
|
Co-Investigator Paul Gardner-StephenSchool of Informatics & Engineering Flinders University |
RFCD Codes 270299, 280103, 280204, 280207, 280399, 280506 |
Significant Achievements, Anticipated Outcomes and Future Work
We have significantly completed the first round of benchmark data collation using an implementation (seqaln) of the Smith-Waterman algorithm for a group of 83 sequences. This information is being combined with the output of our novel algorithm, Dash, to provide insight and direction in the analysis and refinement of that algorithm. Of particular value has been the removal of "artifact chasing" which has resulted from benchmarking against other approximate algorithms. Each Smith-Waterman run required of the order of 100,000 CPU seconds. This contrasts to the existing approximate algorithms which we have compared to in the past, which typically require 10-100 CPU seconds.
Future work consists in the immediate term of comparing these "true" results against the approximate results of successive revisions of our algorithms. In the longer term we indend to obtain results for a larger corpus of searches to provide more statistically significant characterisation of our algorithms.
Computational Techniques Used
The Smith-Waterman algorithm is a dynamic programming algorithm which is employed to find the best possible alignment of two DNA or protein sequences against each other, given certain scoring parameters. It has complexity of the order of the product of the lengths of the sequences being aligned. Hence it becomes extremely expensive computationally when attempting alignment of sequences against entire genomes or databases, possibly consisting of tens of billions of residues. In return for this excessive cost, it does guarantee finding the optimal alignment in each case. This is what provides the value for our project where we consider algorithms which are some 10,000 to 1,000,000 times faster, but in return do not guarantee finding the optimal alignments. By having the optimal alignments available it becomes possible to objectively compare the performance of two approximate algorithms.
Publications, Awards and External Funding
External Funding and Awards
CRC for Sensor, Signal and Information Processing project "Firmware for Genomic Searching"
Publications
P. Gardner-Stephen and G. Knowles, "DASH: A New High Speed Genomic Search and Alignment Tool", 4th International Conference on Mathematics and Computers in Biology and Chemistry (MCBC'03),1, 2003, 121-127.