A high-performance algorithm for detecting low-complexity regions in long genomic sequences | Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie, vychislitelnaja tehnika i informatika – Tomsk State University Journal of Control and Computer Science. 2025. № 72. DOI: 10.17223/19988605/72/7

A high-performance algorithm for detecting low-complexity regions in long genomic sequences

Detection of Low Complexity Regions (LCR) in genomic sequences is a crucial task for numerous bioinformatics tools, including sequences alignment, probes design, variants calling. This study introduces DUSTSCAN as a modification of the DUST algorithm (score estimation of frequencies distributiion of unique triplets in a sequence) for identifying Low Complexity Regions, utilizing parallel computing to significantly accelerate calculations. This research presents a comparative analysis of DUSTSCAN with other versions of the DUST algorithm. The results demonstrate a significant improvement in detection speed, making the new approach particularly valuable for large-scale genomic data processing tasks. The developed tool can be effectively applied in various bioinformatics pipelines, enhancing the performance of tasks that require LCR identification in genomic sequences. Contribution of the authors: the authors contributed equally to this article. The authors declare no conflicts of interests.

Keywords

algorithm, parallel computing, low complexity regions

Authors

NameOrganizationE-mail
Vorobev Rostislav S.Cancer Research Institute, Tomsk National Research Medical Center, Russian Academy of Sciences; National Research Tomsk State Universitytsu@rvorobev.ru
Zamyatin Alexander V.National Research Tomsk State Universityzamyatin@mail.tsu.ru
Gerashchenko Tatiana S.Cancer Research Institute, Tomsk National Research Medical Center, Russian Academy of Sciencest_gerashchenko@oncology.tomsk.ru
Korobeynikova Anastasia A.Cancer Research Institute, Tomsk National Research Medical Center, Russian Academy of Sciencesshegolmay@gmail.com
Denisov Evgeny V.Cancer Research Institute, Tomsk National Research Medical Center, Russian Academy of Sciencesd_evgeniy@oncology.tomsk.ru
Всего: 5

References

Компо Ф., Певзнер. П. Алгоритмы биоинформатики. М. : ДМК-Пресс, 2023. 682 с.
Morgulis A., Gertz M., Schaffer A.A., Agarwala R. A Fast and Symmetric DUST Implementation to Mask Low-Complexity DNA Sequences // Journal of Computational Biology. 2006. V. 13 (5). P. 1028-1040. doi: 10.1089/cmb.2006.13.1028.
Orlov Y.L., Potapov V.N.Complexity: an internet resource for analysis of DNA sequence complexity // Nucleic Acids Research. 2004. V. 32. P. W628-W633. doi: 10.1093/nar/gkh466.
Orlov Y.L., Orlova N.G. Bioinformatics tools for the sequence complexity estimates // Biophysical Reviews. 2023. V. 15. P. 1367, doi: 10.1007/s12551-023-01140-y.
Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs // Nucleic Acids Research. 1997. V. 25, is. 17. P. 3389-3402. doi: 10.1093/nar/25.17.3389.
Goldfeder R.L., Priest J.R., Zook J.M., Grove M.E., Waggott D., Wheeler M.T., Salit M., Ashley E.A. Medical implications of technical accuracy in genome sequencing // Genome Medicine. 2016. V. 8. Art. 24. doi: 10.1186/s13073-016-0269-0.
Koboldt D.C. Best practices for variant calling in clinical sequencing // Genome Medicine. 2020. V. 12. Art. 91. doi: 10.1186/s13073-020-00791-w.
Lau T.Y. et al. The Neoantigen Landscape of the Coding and Noncoding Cancer Genome Space // The Journal of Molecular Diag nostic. 2022. V. 24 (6). P. 541-554. doi: 10.1016/j.jmoldx.2022.02.004 77 Обработка информации /Data processing.
Shalon D., Smith S.J., Brown P.O. A DNA microarray system for analyzing complex DNA samples using two-color fluorescent probe hybridization // Genome research. 1996. V. 6 (7). P. 639-645. doi: 10.1101/gr.6.7.639.
Haas B.J., Dobin A., Li B., Stransky N., Pochet N., Regev A. Accuracy assessment of fusion transcript detection via read-mapping and de novo fusion transcript assembly-based methods // Genome Biology. 2019. V. 20. Art. 213. doi: 10.1186/s13059-019-1842-9.
Feng Y., Guo Q., Chen W., Han C. A Low-Complexity Deep Learning Model for Predicting Targeted Sequencing Depth from Probe Sequence // Applied Sciences. 2023. V. 13 (12). Art. 6996. doi: 10.3390/app13126996.
Wootton J.C., Federhen S. Statistics of local complexity in amino acid sequences and sequence databases // Computers & Chemistry. 1993. V. 17, is. 2. P. 149-163. doi: 10.1016/0097-8485(93)85006-X.
Velichkovski G., Gusev M., Mileski D. CUDA Calculation of Shannon Entropy for a Sliding Window System // 32nd Telecommunications Forum (TELFOR), November 2024. Belgrade : IEEE, 2024. P. 1-4. doi: 10.1109/TELFOR63250.2024.10819103.
Frith M.C. A new repeat-masking method enables specific detection of homologous sequences // Nucleic Acids Research. 2011. V. 39, is. 4. Art. e23. doi: 10.1093/nar/gkq1212.
Chen S., Zhou Y., Chen Y., Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor // Bioinformatics. 2018. V. 34, is. 17. P. i884-i890. doi: 10.1093/bioinformatics/bty560.
Schmieder R., Edwards R. Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets // PLoS ONE. 2011. V. 6 (3). Art. e17288. doi: 10.1371/journal.pone.0017288.
Kirk D.B., Hwu W.W. Programming Massively Parallel Processors: A Hands-on Approach, 3rd ed. Morgan Kaufmann, 2016. xix, 258 p.
Sanders J., Kandrot E. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional, 2010. xix, 290 p.
Боресков А.В. и др. Параллельные вычисления на GPU. Архитектура и программная модель CUDA : учеб. пособие. М. : Изд-во Моск. ун-та, 2015. 336 с. (Суперкомпьютерное образование).
Jarnot P., Ziemska-Legiecka J., Grynberg M., Gruca A. Insights from analyses of low complexity regions with canonical methods for protein sequence comparison // Briefings in Bioinformatics. 2022. V. 23, is. 5. Art. bbac299. doi: 10.1093/bib/bbac299.
Замятин А.В. Интеллектуальный анализ данных: учебное пособие. Томск : Изд. Дом Том. гос. ун-та, 2020. 196 с.
 A high-performance algorithm for detecting low-complexity regions in long genomic sequences | Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie, vychislitelnaja tehnika i informatika – Tomsk State University Journal of Control and Computer Science. 2025. № 72. DOI: 10.17223/19988605/72/7

A high-performance algorithm for detecting low-complexity regions in long genomic sequences | Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie, vychislitelnaja tehnika i informatika – Tomsk State University Journal of Control and Computer Science. 2025. № 72. DOI: 10.17223/19988605/72/7

Download full-text version
Counter downloads: 66