Fork me on GitHub

5. Benchmarking

5.1. Interval joins

5.1.1. Performance tests description

In order to evaluate our range join strategy we have run a number of tests using both one-node and a Hadoop cluster installations. In that way we were able to analyze both vertical (by means of adding computing resources such as CPU/RAM on one machine) as well as horizontal (by means of adding resources on multiple machines) scalability.

The main idea of the test was to compare performance of SeQuiLa with other tools like featureCounts and genAp that can be used to compute the number of reads intersecting predefined genomic intervals. It is by no means one of the most commonly used operations in both DNA and RNA-seq data processing, most notably in gene differential expression and copy number variation-calling. featureCounts performance results have been treated as a baseline. In order to show the difference between the naive approach using the default range join strategy available in Spark SQL and SeQuiLa interval-tree one, we have included it in the single-node test.

5.1.1.1. Test environment setup

5.1.1.2. Infrastructure

Our tests have been run on a 6-node Hadoop cluster:

Role

Number of nodes

CPU cores

RAM

EN

1

24

64

NN/RM

1

24

64

DN/NM

4

24

64

EN - edge node where only Application masters and Spark Drivers have been launched in case of cluster tests. In case of single node tests (Apache Spark local mode), all computations have been performed on the edge node.

NN - HDFS NameNode

RM - YARN ResourceManager

DN - HDFS DataNode

NM - YARN NodeManager

5.1.1.3. Software

All tests have been run using the following software components:

Software

Version

CDH

5.12

Apache Hadoop

2.6.0

Apache Spark

2.2.1

ADAM

0.22

featureCounts

1.6.0

Oracle JDK

1.8

Scala

2.11.8

5.1.2. Datasets

Two NGS datasets have been used in all the tests. WES (whole exome sequencing) and WGS (whole genome sequencing) datasets have been used for vertical and horizontal scalability evaluation respectively. Both of them came from sequencing of NA12878 sample that is widely used in many benchmarks. The table below presents basic datasets information:

Test name

Format

Size [GB]

Row count

WES-SN

BAM

17

161544693

WES-SN

ADAM

14

161544693

WES-SN

BED

0.0045

193557

WGS-CL

BAM

273

2617420313

WGS-CL

ADAM

215

2617420313

WGS-CL

BED

0.0016

68191

WES-SN - tests performed on a single node using WES dataset

WGS-CL - tests performed on a cluster using WGS dataset

5.1.3. Test procedure

To achieve reliable results test cases have been run 3 times. Before each run disk caches on all nodes have been purged.

5.1.3.1. File-dataframe mapping

The first step of the testing procedure was to prepare mapping between input datasets (in BAM, ADAM and BED formats) and their corresponding dataframe/table abstraction. In case of alignment files our custom data sources has been used, for a BED file Spark’s builtin dedicated for CSV data access.

BAM

val bamPath = "NA12878*.bam"
spark.sql(
  s"""
     |CREATE TABLE reads
     |USING org.biodatageeks.sequila.datasources.BAM.BAMDataSource
     |OPTIONS(path "${bamPath}")
     |
  """.stripMargin)
spark.sql(s"SELECT contigName,start,end FROM reads LIMIT 1").show()

+----------+-----+---+
|contigName|start|end|
+----------+-----+---+
|      chr1|   34|109|
+----------+-----+---+

ADAM

val adamPath = "NA12878*.adam"
spark.sql(
  s"""
     |CREATE TABLE reads
     |USING org.biodatageeks.sequila.datasources.ADAM.ADAMDataSource
     |OPTIONS(path "${adamPath}")
     |
  """.stripMargin)
spark.sql(s"SELECT contigName,start,end FROM reads LIMIT 1").show()

+----------+-----+---+
|contigName|start|end|
+----------+-----+---+
|      chr1|   34|109|
+----------+-----+---+

BED

val  bedPath="tgp_exome_hg18.bed"
spark.sql(s"""
    |CREATE TABLE targets(contigName String,start Integer,end Integer)
    |USING csv
    |OPTIONS (path "file:///${bedPath}", delimiter "\t")""".stripMargin)
spark.sql("SELECT * FROM targets LIMIT 1").show

+----------+-----+----+
|contigName|start| end|
+----------+-----+----+
|      chr1| 4806|4926|
+----------+-----+----+

5.1.3.2. SQL query for counting features

For counting reads overlapping predefined feature regions the following SQL query has been used:

SELECT targets.contigName,targets.start,targets.end,count(*) FROM reads JOIN targets
     ON (targets.contigName=reads.contigName
     AND
     CAST(reads.end AS INTEGER)>=CAST(targets.start AS INTEGER)
     AND
     CAST(reads.start AS INTEGER)<=CAST(targets.end AS INTEGER)
     )
     GROUP BY targets.contigName,targets.start,targets.end

Exactly the same query has been used for both single node and cluster tests.

5.1.3.3. Apache Spark settings

Parameter

Values

driver-memory

8g

executor-memory

4-8g

executor-cores

2-4

num-executors

1-15

5.1.4. Results

SeQuiLa when run in parallel outperforms selected competing tools in terms of speed on single node (1.7-22.1x) and cluster (3.2-4.7x). SeQuiLa strategy involving broadcasting interval forest with all data columns (SeQuiLa_it_all) performs best in most of the cases (no network shuffling required), whereas broadcasting intervals with identifiers only (SeQuiLa_it_int) performs comparable to, or better than GenAp. All algorithms favours columnar (ADAM) to row oriented (BAM) file format due to column pruning and disk I/O operations reduction.

5.1.4.1. Local mode

../_images/local.png

5.1.4.2. Hadoop cluster

../_images/cluster.png

5.1.4.3. Limitations

SeQuiLa is slower than featureCounts in a single-threaded applications due to less performat Java BAM reader (mainly BGZF decompression) available in the Java htsjdk library. We will try to investigate and resolve this bottleneck in the next major release.

5.1.5. Discussion

Results showed that SeQuiLa significantly accelerates genomic interval queries. We are aware that paradigm of distributed computing is currently not fully embraced by bioinformaticians therefore we have put an additional effort into preparing SeQuiLa to be easily integrated into existing applications and pipelines.

5.2. Depth of coverage

5.2.1. Performance tests description

The main goal of our tests was to compare SeQuiLa-cov performance and scalability with other state-of-the art coverage solutions (samtools depth, bedtools genomecov, GATK DepthOfCoverage, sambamba depth and mosdepth). The tests were performed on the aligned WES and WGS reads from the NA12878 sample and aimed at calculating blocks and window coverage whenever this functionality was available. Additionally, we performed quality check, veryfing that results generated by SeQuiLa-cov are identical to those returned by samtools depth and we evaluated the impact of Intel GKL on overall performance.

5.2.1.1. Test environment setup

5.2.1.2. Infrastructure

Our tests have been run on a 24-node Hadoop cluster:

Role

Number of nodes

CPU cores

RAM

EN

1

28/56

512

NN/RM

3

28/56

512

DN/NM

24

28/56

512

EN - edge node where only Application masters and Spark Drivers have been launched in case of cluster tests. In case of single node tests (Apache Spark local mode), all computations have been performed on the edge node.

NN - HDFS NameNode

RM - YARN ResourceManager

DN - HDFS DataNode

NM - YARN NodeManager

5.2.1.3. Software

All tests have been run using the following software components:

Software

Version

Notes

HDP

3.0.1

Apache Hadoop

3.1.1

Apache Spark

2.3.1

Oracle JDK

1.8

Scala

2.12

samtools

1.9

bedtools

2.27

GATK

3.8

sambamba

0.6.8

mosdepth

0.2.3

mosdepth

0.2.4

using –fast-mode option

5.2.2. Datasets for coverage tests

Two NGS datasets have been used in all the tests. WES (whole exome sequencing) and WGS (whole genome sequencing) datasets have been used for vertical and horizontal scalability evaluation respectively. Both of them came from sequencing of NA12878 sample that is widely used in many benchmarks.

In order to remove malformed reads (especially to remove CIGAR and Sequence length inconsistencies) we have processed original BAM files with GATK’s tool PrintReads.

# also set compression to BAM default level (5) -Dsamjdk.compression_level=5
gatk --java-options "-Dsamjdk.compression_level=5" PrintReads -I /data/NA12878.hiseq.wgs.bwa.recal.bam -O /data/proper.NA12878.bam

The table below presents basic datasets information:

Data

Format

Size [GB]

Row count

test data URL

original source

WES

BAM

17

161544693

WES BAM

original WES BAM

WGS

BAM

273

2617420313

WGS BAM

original WGS BAM

WGS-L

BAM

127

22314075

WGS-L BAM

original WGS-L BAM

WES - tests performed on a single node using WES dataset

WGS - tests performed on a cluster using WGS dataset

WGS-L - tests performed on a single node using WGS dataset (long reads)

5.2.3. Test procedure

To achieve reliable results and remove test cases have been run 3 times. .. Before each run disk caches on all nodes have been purged.

5.2.3.1. File-dataframe mapping

The first step of the testing procedure was to prepare mapping between input datasets in BAM format and its dataframe/table abstraction through our custom data source.

BAM

val bamPath = "NA12878*.bam"
spark.sql(
  s"""
     |CREATE TABLE reads
     |USING org.biodatageeks.sequila.datasources.BAM.BAMDataSource
     |OPTIONS(path "${bamPath}")
     |
  """.stripMargin)
spark.sql(s"SELECT contigName,start,end FROM reads LIMIT 1").show()

+----------+-----+---+
|contigName|start|end|
+----------+-----+---+
|      chr1|   34|109|
+----------+-----+---+

5.2.3.2. Coverage calculations

For calculating the coverage the following commands have been used:

### SAMTOOLS 1.9
#exome - bases 1 core
{ time samtools depth NA12878.proper.wes.bam > samtools/NA12878.proper.wes.bamdepth ; } 2>> samtools/wes_time.txt
# genome - bases 1 core
{ time samtools depth NA12878.proper.wgs.bam > samtools/NA12878.proper.wgs.bam.depth ; } 2>> samtools/wgs_time.txt
### BEDTOOLS
# exome blocks 1 core
time docker run --rm -v /data/samples/NA12878:/data/samples/NA12878 -w /data/samples/NA12878 biocontainers/bedtools:v2.27.0_cv2 bedtools genomecov -ibam /data/samples/NA12878/WES/NA12878.proper.wes.bam -bga > /data/samples/NA12878/bedtools_genomecov_block_coverage_wes.txt
# genome blocks 1 core
time docker run --rm -v /data/samples/NA12878:/data/samples/NA12878 -w /data/samples/NA12878 biocontainers/bedtools:v2.27.0_cv2 bedtools genomecov -ibam /data/samples/NA12878/WGS/NA12878.proper.wgs.bam -bga > /data/samples/NA12878/bedtools_genomecov_block_coverage_wgs.txt
### GATK
#  exome 1,5,10 cores
{ time docker run -it  -v /data/samples/NA12878/WES:/data/ -v /data/samples/hg_builds/:/ref/ broadinstitute/gatk3:3.8-1  java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R /ref/Homo_sapiens_assembly18.fasta -o /data/gatk_doc_test.txt -I /data/NA12878.proper.wes.bam  -omitIntervals -nt 1} 2>> gatk_wes_time_1.txt
{ time docker run -it  -v /data/samples/NA12878/WES:/data/ -v /data/samples/hg_builds/:/ref/ broadinstitute/gatk3:3.8-1  java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R /ref/Homo_sapiens_assembly18.fasta -o /data/gatk_doc_test.txt -I /data/NA12878.proper.wes.bam  -omitIntervals -nt 5} 2>> gatk_wes_time_5.txt
{ time docker run -it  -v /data/samples/NA12878/WES:/data/ -v /data/samples/hg_builds/:/ref/ broadinstitute/gatk3:3.8-1  java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R /ref/Homo_sapiens_assembly18.fasta -o /data/gatk_doc_test.txt -I /data/NA12878.proper.wes.bam  -omitIntervals -nt 10} 2>> gatk_wes_time_10.txt
# genome 1 core
{ time docker run -it  -v /data/samples/NA12878/WGS:/data/ -v /data/samples/hg_builds/:/ref/ broadinstitute/gatk3:3.8-1 java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R /ref/Homo_sapiens_assembly18.fasta -o /data/gatk_doc_test.txt -I /data/NA12878.proper.wgs.bam -omitIntervals -nt 1} 2>> gatk_wgs_time_1.txt
{ time docker run -it  -v /data/samples/NA12878/WGS:/data/ -v /data/samples/hg_builds/:/ref/ broadinstitute/gatk3:3.8-1 java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R /ref/Homo_sapiens_assembly18.fasta -o /data/gatk_doc_test.txt -I /data/NA12878.proper.wgs.bam -omitIntervals -nt 5} 2>> gatk_wgs_time_5.txt
{ time docker run -it  -v /data/samples/NA12878/WGS:/data/ -v /data/samples/hg_builds/:/ref/ broadinstitute/gatk3:3.8-1 java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R /ref/Homo_sapiens_assembly18.fasta -o /data/gatk_doc_test.txt -I /data/NA12878.proper.wgs.bam -omitIntervals -nt 10} 2>> gatk_wgs_time_10.txt
### SAMBAMBA
# exome - blocks 1,5,10 cores
time docker run --rm -v /data/samples/NA12878:/data/samples/NA12878 -w /data/samples/NA12878 wkusmirek/sambamba /opt/sambamba-0.6.8-linux-static depth base --output-file=sambamba_base_coverage.txt --nthreads=1 /data/samples/NA12878/WES/NA12878.proper.wes.bam
time docker run --rm -v /data/samples/NA12878:/data/samples/NA12878 -w /data/samples/NA12878 wkusmirek/sambamba /opt/sambamba-0.6.8-linux-static depth base --output-file=sambamba_base_coverage.txt --nthreads=5 /data/samples/NA12878/WES/NA12878.proper.wes.bam
time docker run --rm -v /data/samples/NA12878:/data/samples/NA12878 -w /data/samples/NA12878 wkusmirek/sambamba /opt/sambamba-0.6.8-linux-static depth base --output-file=sambamba_base_coverage.txt --nthreads=10 /data/samples/NA12878/WES/NA12878.proper.wes.bam

# exome - windows 1,5,10 cores
time docker run --rm -v /data/samples/NA12878:/data/samples/NA12878 -w /data/samples/NA12878 wkusmirek/sambamba /opt/sambamba-0.6.8-linux-static depth window --output-file=sambamba_window_coverage.txt --nthreads=1 --window-size=500 /data/samples/NA12878/WES/NA12878.proper.wes.bam
time docker run --rm -v /data/samples/NA12878:/data/samples/NA12878 -w /data/samples/NA12878 wkusmirek/sambamba /opt/sambamba-0.6.8-linux-static depth window --output-file=sambamba_window_coverage.txt --nthreads=5 --window-size=500 /data/samples/NA12878/WES/NA12878.proper.wes.bam
time docker run --rm -v /data/samples/NA12878:/data/samples/NA12878 -w /data/samples/NA12878 wkusmirek/sambamba /opt/sambamba-0.6.8-linux-static depth window --output-file=sambamba_window_coverage.txt --nthreads=10 --window-size=500 /data/samples/NA12878/WES/NA12878.proper.wes.bam

# genome - blocks 1,5,10 cores
time docker run --rm -v /data/samples/NA12878:/data/samples/NA12878 -w /data/samples/NA12878 wkusmirek/sambamba /opt/sambamba-0.6.8-linux-static depth base --output-file=sambamba_base_coverage.txt --nthreads=1 /data/samples/NA12878/WGS/NA12878.proper.wgs.bam
time docker run --rm -v /data/samples/NA12878:/data/samples/NA12878 -w /data/samples/NA12878 wkusmirek/sambamba /opt/sambamba-0.6.8-linux-static depth base --output-file=sambamba_base_coverage.txt --nthreads=5 /data/samples/NA12878/WGS/NA12878.proper.wgs.bam
time docker run --rm -v /data/samples/NA12878:/data/samples/NA12878 -w /data/samples/NA12878 wkusmirek/sambamba /opt/sambamba-0.6.8-linux-static depth base --output-file=sambamba_base_coverage.txt --nthreads=10 /data/samples/NA12878/WGS/NA12878.proper.wgs.bam

# genome - windows
time docker run --rm -v /data/samples/NA12878:/data/samples/NA12878 -w /data/samples/NA12878 wkusmirek/sambamba /opt/sambamba-0.6.8-linux-static depth window --output-file=sambamba_window_coverage.txt --nthreads=1 --window-size=500 /data/samples/NA12878/WGS/NA12878.proper.wgs.bam
time docker run --rm -v /data/samples/NA12878:/data/samples/NA12878 -w /data/samples/NA12878 wkusmirek/sambamba /opt/sambamba-0.6.8-linux-static depth window --output-file=sambamba_window_coverage.txt --nthreads=5 --window-size=500 /data/samples/NA12878/WGS/NA12878.proper.wgs.bam
time docker run --rm -v /data/samples/NA12878:/data/samples/NA12878 -w /data/samples/NA12878 wkusmirek/sambamba /opt/sambamba-0.6.8-linux-static depth window --output-file=sambamba_window_coverage.txt --nthreads=10 --window-size=500 /data/samples/NA12878/WGS/NA12878.proper.wgs.bam
### MOSDEPTH v 0.2.3
# exome blocks 1,5,10 cores
{ time mos/mosdepth prefix NA12878.proper.wes.bam ; } 2>> mos_wes_time_1.txt
{ time mos/mosdepth -t 4 prefix NA12878.proper.wes.bam ; } 2>> mos_wes_time_5.txt
{ time mos/mosdepth -t 9 prefix NA12878.proper.wes.bam ; } 2>> mos_wes_time_10.txt

# genome blocks 1,5,10 cores
{ time mos/mosdepth prefix NA12878.proper.wgs.bam ; } 2>> wgs_time_1.txt
{ time mos/mosdepth -t 4 prefix NA12878.proper.wgs.bam ; } 2>> wgs_time_5.txt
{ time mos/mosdepth -t 9 prefix NA12878.proper.wgs.bam ; } 2>> wgs_time_9.txt
### MOSDEPTH v 0.2.4 fast
# exome blocks 1,5,10 cores
{ time mos/mosdepth --fast-mode prefix NA12878.proper.wes.bam ; } 2>> mos_wes_time_1.txt
{ time mos/mosdepth --fast-mode -t 4 prefix NA12878.proper.wes.bam ; } 2>> mos_wes_time_5.txt
{ time mos/mosdepth --fast-mode -t 9 prefix NA12878.proper.wes.bam ; } 2>> mos_wes_time_10.txt

# genome blocks 1,5,10 cores
{ time mos/mosdepth --fast-mode prefix NA12878.proper.wgs.bam ; } 2>> wgs_time_1.txt
{ time mos/mosdepth --fast-mode -t 4 prefix NA12878.proper.wgs.bam ; } 2>> wgs_time_5.txt
{ time mos/mosdepth --fast-mode -t 9 prefix NA12878.proper.wgs.bam ; } 2>> wgs_time_9.txt
### SEQUILA-COV
# spark shell started with 1,5,10 cores
spark-shell  --conf "spark.sql.catalogImplementation=in-memory" --conf spark.dynamicAllocation.enabled=false  --master=yarn-client --driver-memory=4g --executor-memory=4g --num-executors=1 --packages org.biodatageeks:bdg-sequila_2.11:0.4.1-SNAPSHOT --repositories https://zsibio.ii.pw.edu.pl/nexus/repository/maven-snapshots/ -v
spark-shell  --conf "spark.sql.catalogImplementation=in-memory" --conf spark.dynamicAllocation.enabled=false  --master=yarn-client --driver-memory=4g --executor-memory=4g --num-executors=5 --packages org.biodatageeks:bdg-sequila_2.11:0.4.1-SNAPSHOT --repositories https://zsibio.ii.pw.edu.pl/nexus/repository/maven-snapshots/ -v
spark-shell  --conf "spark.sql.catalogImplementation=in-memory" --conf spark.dynamicAllocation.enabled=false  --master=yarn-client --driver-memory=4g --executor-memory=4g --num-executors=10 --packages org.biodatageeks:bdg-sequila_2.11:0.4.1-SNAPSHOT --repositories https://zsibio.ii.pw.edu.pl/nexus/repository/maven-snapshots/ -v
  // inside spark-shell for SeQuiLa-cov
  import org.apache.spark.sql.SequilaSession
  import org.biodatageeks.sequila.utils.{SequilaRegister, UDFRegister,BDGInternalParams}
  spark.sqlContext.setConf(BDGInternalParams.InputSplitSize, "134217728")
      val ss = SequilaSession(spark)
  SequilaRegister.register(ss)
  ss.sqlContext.setConf("spark.biodatageeks.bam.useGKLInflate","true")
  ss.sqlContext.setConf("spark.biodatageeks.bam.useSparkBAM","false")

  /* WES -bases-blocks*/
  ss.sql("""
  CREATE TABLE IF NOT EXISTS reads_exome USING org.biodatageeks.sequila.datasources.BAM.BAMDataSource OPTIONS(path '/data/samples/NA12878/WES/NA12878*.bam')""")
  spark.time{
  ss.sql(s"SELECT * FROM coverage('reads_exome','NA12878', 'blocks')").write.format("parquet").save("/data/samples/NA12878/output_tmp/wes_1_9.parquet")}

  /* WGS -bases-blocks*/
  import org.apache.spark.sql.SequilaSession
  import org.biodatageeks.sequila.utils.{SequilaRegister, UDFRegister}
  val ss = SequilaSession(spark)
  SequilaRegister.register(ss)
  ss.sqlContext.setConf("spark.biodatageeks.bam.useGKLInflate","true")
  ss.sqlContext.setConf("spark.biodatageeks.bam.useSparkBAM","false")
  /*bases-blocks*/
  ss.sql("""
  CREATE TABLE IF NOT EXISTS reads_genome USING org.biodatageeks.sequila.datasources.BAM.BAMDataSource OPTIONS(path '/data/samples/NA12878/NA12878*.bam')""")
  spark.time{
  ss.sql(s"SELECT * FROM coverage('reads_genome','NA12878', 'blocks')").write.format("parquet").save("/data/samples/NA12878/output_tmp/wgs_1_1.parquet")}

  /*windows - 500*/
  import org.apache.spark.sql.SequilaSession
  import org.biodatageeks.sequila.utils.{SequilaRegister, UDFRegister}
  val ss = SequilaSession(spark)
  SequilaRegister.register(ss)
  ss.sqlContext.setConf("spark.biodatageeks.bam.useGKLInflate","true")
  ss.sqlContext.setConf("spark.biodatageeks.bam.useSparkBAM","false")
  /*bases-blocks*/
ss.sql("""
  CREATE TABLE IF NOT EXISTS reads_exome USING org.biodatageeks.sequila.datasources.BAM.BAMDataSource OPTIONS(path '/tmp/data/exome/*.bam')""")
  spark.time{
  ss.sql(s"SELECT * FROM coverage('reads_exome','NA12878', 'blocks', '500')").write.format("parquet").save("/tmp/data/32MB_w500_3.parquet") }

  /* long reads */
  ss.sql("""
    CREATE TABLE IF NOT EXISTS qreads
    USING org.biodatageeks.sequila.datasources.BAM.BAMDataSource
    OPTIONS(path '/data/granges/nanopore/guppy.bam')""")

  ss.sql(s"SELECT contigName, start, end, coverage FROM coverage('qreads','NA12878', 'blocks')").write.mode("overwrite").option("delimiter", "\t").csv("/data/granges/nanopore/guppy_cov.bed")}

5.2.3.3. Apache Spark settings

Parameter

Values

driver-memory

8g

executor-memory

4g

executor-cores

1

num-executors

1-500

5.2.4. Results

Detailed results are shown in the table below:

data

operation type

cores

samtools

bedtools

GATK

sambamba

mosdepth

mosdepth fast

SeQuiLa-cov

WGS

blocks

1

2h 14m 58s

10h 41m 27s

128w *

2h 44m 0s

1h 46m 27s

1h 38m 06s

1h 47m 05s

5

2d 23h 18m *

2h 47m 53s

36m 13s

33m 50s

26m 59s

10

2d 17h 6m *

2h 50m 47s

34m 34s

33m 16s

13m 54s

windows

1

1h 46m 50s

1h 22m 49s

1h 24m 08s

5

1h 41m 23s

20m 3s

18m 43s

10

1h 50m 35s

17m 49s

9m 14s

WES

blocks

1

12m 26s

23m 25s

1d 5h 6m

25m 42s

6m 43s

6m 12s

6m 54s

5

3d 0h 24m

25m 46s

2m 25s

2m 21s

1m 47s

10

2d 22h 30m

25m 49s

2m 20s

2m 06s

1m 04s

windows

1

14m 36s

6m 11s

6m 29s

5

14m 54s

2m 08s

1m 42s

10

14m 40s

2m 14s

1m 1s

(*) estimated time

On the image below you can find performance and scalability comparison of samtools, mosdepth and SeQuiLa-cov.

../_images/coverage.png

5.2.4.1. Base level coverage performance comparison for WES dataset with samtools

The wall-time on a single core is comparable to the Samtools’ solution. We re-confirmed the scalability of SeQuiLa-cov with the base level output and significant time reduction when executed in distributed environment

cores

sequila(bases)

sequila(blocks)

samtools

1

17m 13s

6m 54s

12m 26s

5

4m 17s

1m 47s

10

2m 21s

1m 04s

5.2.4.2. CRAM versus BAM performance comparison for WES dataset (blocks)

We observed that timings for processing CRAM files are ~2.5 - 4 times higher than for BAM files. Importantly, the processing times of the coverage calculation algorithm are equal for BAMs and CRAMs, however the reading stage is significantly slower in case of the latter one. Further speedup of processing CRAM files with SeQuiLa will require significant reimplementation of the data access layer which is an important direction of our future work.

cores

sequila(CRAM)

sequila(BAM)

1

26m 27s

6m 54s

5

4m 35s

1m 47s

10

2m 54s

1m 04s

25

1m 44s

0m 28s

50

1m 15s

0m 20s

5.2.4.3. Performance of saving coverage results as a single BED file

In order to get coverage reults as a single file we need to explicite use coalesce method to merge records from all the partitions before writing them to the storage. Such an approach causes performance degradation as the data cannot be written in distributed fashion. Equally, due to the fact there is only one thread used for writing, the scalability is impaired as well.

import org.apache.spark.sql.SequilaSession
import org.biodatageeks.sequila.utils.{SequilaRegister, UDFRegister,BDGInternalParams}

val ss = SequilaSession(spark)
SequilaRegister.register(ss)
ss.sqlContext.setConf("spark.biodatageeks.bam.useGKLInflate","true")
ss.sql("""CREATE TABLE IF NOT EXISTS reads_exome USING org.biodatageeks.sequila.datasources.BAM.BAMDataSource OPTIONS(path '/data/exome/NA12878.*.bam')""")
spark.time { ss.sql(s"SELECT * FROM coverage('reads_exome','NA12878', 'blocks')")
        .coalesce(1)
        .write.mode("overwrite")
        .option("delimiter", "\t")
        .csv("/data/granges/exome/coverage.bed")}

cores

sequila(BED,coalesce(1))

sequila(Parquet,split)

1

11m 59s

6m 54s

5

4m 56s

1m 47s

10

4m 05s

1m 04s

5.2.4.4. Long reads support

We have tested coverage calculation for long reads in terms of wall-time and quality of the results. We have achieved identical results with samtools when run in ‘bases’ mode.

cores

samtools

sequila(blocks)

1

95m 8s

83m 3s

5

17m 25s

10

9m 10s

5.2.4.5. Discussion

Both samtools and bedtools calculate coverage using only a single thread, however, their results differ significantly, with samtools being around twice as fast. Sambamba positions itself as a multithreaded solution although our tests revealed that its execution time is nearly constant, regardless of the number of CPU cores used, and even twice as slow as samtools.

Mosdepth achieved speedup against samtools in blocks coverage and against sambamba in windows coverage calculations, however, its scalability reaches limit at 5 CPU cores.

Finally, SeQuiLa-cov, achieves nearly identical performance as mosdepth for the single core but the execution time decreases substantially for greater number of available computing resources what makes this solution the fastest when run on multiple cores and nodes.

Our results show that when utilizing additional resources (i.e. more than 10 CPU cores), SeQuiLa-cov is able to reduce the total computation time to 15 seconds for WES and less than one minute for WGS data. Scalability limit is achieved for 200 and ~500 CPU cores in case of WES and WGS data, respectively.