SeQuiLa: an elastic, fast and scalable SQL-oriented solution for processing and querying genomic intervals

Abstract:

Efficient processing of large-scale genomic datasets has recently become possible due to the application of ‘big data’ technologies in bioinformatics pipelines. We present SeQuiLa—a distributed, ANSI SQL-compliant solution for speedy querying and processing of genomic intervals that is available as an Apache Spark package. Proposed range join strategy is significantly (∼22×) faster than the default Apache Spark implementation and outperforms other state-of-the-art tools for genomic intervals processing.The project is available at http://biodatageeks.org/sequila/.Supplementary data are available at Bioinformatics online.

Downloads:

BibTeX:

 {%raw%}@article{10.1093/bioinformatics/bty940,
  author = {Szmurło, Agnieszka and Wiewiórka, Marek and Gambin, Tomasz and Leśniewska, Anna and Stępień, Kacper and Borowiak, Mateusz and Okoniewski, Michał},
  title = ,
  year = {2018},
  month = nov,
  doi = {10.1093/bioinformatics/bty940},
  url = {https://doi.org/10.1093/bioinformatics/bty940},
  eprint = {http://oup.prod.sis.lan/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/bty940/27037165/bty940.pdf},
  public = {yes}
}
{%endraw%}