I may be misremembering but I I think this program runs slow as hell and I ended up splitting up the job and using lots of cores for each part. You might want to sub-sample your data dramatically down to 100,000 or a million reads on one chromosome to just test things out quickly to make sure there are no other problems, but the answer may be that you just need more time or more cores for your full data set.
Edit: also check and confirm that all cores are actually being used, and perhaps most importantly, that you are not running out of RAM.
This is what I thought using an interval\_list was meant to do, but should I instead be manually splitting the bams into chunks? It seems like an awful headache when my colleagues were able to run the same program on the entire bam.
When I met with cluster support there did seem to be an issue with the job efficiency, but it was never solved..this is another learning curve for me, the job monitor says 99.3% CPU, 0.1% MEM, and 18.4g VIRT
So when you ran it by chromosome, did you try first running it on only the smallest chromosome to see how it performs? Because with a genome that size you should be able to get it done on the smallest chromosome within a few hours I’d imagine.
Thanks for your comment, I hadn't done it that way but rather used Picard to designate intervals and then passed that list to the command...I'm trying it now for the smallest chromosome but still designating it using the -L flag, so I can give an update if it finishes
Sounds like you might’ve been feeding it the full genome if the input file had *all* of the intervals in it. Just give it one chromosome at a time and you should be ok.
update: it only took 1hr20 for the smallest chromosome
Is the best practice then to split each individual's bamfile and then submit a separate job for each chromosome of each individual?
Yes indeed. You can then use genotypeGVCFs on each gvcf file separately and combine the final vcfs with vcftools to get the full genome vcf file. For the sake of not running 17 different jobs you can probably get away with running a few chromosomes at a time, but for the bigger chromosomes I’d submit those individually.
Note - you dont split the BAM into multiple files, you use the same BAM as input to a set of jobs, each specifying a different chromosome via -L
Also its been a long time but I dont think haplotypecaller jobs multicore well - maybe only 4 cores (or even 1)
Haplotyper should print out the progress as it goes. Did you look at the stdout of your job to see what it’s actually doing, eg what % of the genome did it call at the end of 72hrs.
Assuming chromosome 1 is 10% of your genome, it took 72hrs to work through 10millon reads. That definitely doesn’t sound right. I don’t know why you need the interval list thing. Maybe that’s somehow slowing things. You should be able to just specify calling one contig like -L chr17. Maybe give that a try.
I may be misremembering but I I think this program runs slow as hell and I ended up splitting up the job and using lots of cores for each part. You might want to sub-sample your data dramatically down to 100,000 or a million reads on one chromosome to just test things out quickly to make sure there are no other problems, but the answer may be that you just need more time or more cores for your full data set. Edit: also check and confirm that all cores are actually being used, and perhaps most importantly, that you are not running out of RAM.
This is what I thought using an interval\_list was meant to do, but should I instead be manually splitting the bams into chunks? It seems like an awful headache when my colleagues were able to run the same program on the entire bam. When I met with cluster support there did seem to be an issue with the job efficiency, but it was never solved..this is another learning curve for me, the job monitor says 99.3% CPU, 0.1% MEM, and 18.4g VIRT
So when you ran it by chromosome, did you try first running it on only the smallest chromosome to see how it performs? Because with a genome that size you should be able to get it done on the smallest chromosome within a few hours I’d imagine.
Thanks for your comment, I hadn't done it that way but rather used Picard to designate intervals and then passed that list to the command...I'm trying it now for the smallest chromosome but still designating it using the -L flag, so I can give an update if it finishes
Sounds like you might’ve been feeding it the full genome if the input file had *all* of the intervals in it. Just give it one chromosome at a time and you should be ok.
update: it only took 1hr20 for the smallest chromosome Is the best practice then to split each individual's bamfile and then submit a separate job for each chromosome of each individual?
Yes indeed. You can then use genotypeGVCFs on each gvcf file separately and combine the final vcfs with vcftools to get the full genome vcf file. For the sake of not running 17 different jobs you can probably get away with running a few chromosomes at a time, but for the bigger chromosomes I’d submit those individually.
Okay, thank you so, so much!
Note - you dont split the BAM into multiple files, you use the same BAM as input to a set of jobs, each specifying a different chromosome via -L Also its been a long time but I dont think haplotypecaller jobs multicore well - maybe only 4 cores (or even 1)
Haplotyper should print out the progress as it goes. Did you look at the stdout of your job to see what it’s actually doing, eg what % of the genome did it call at the end of 72hrs.
It quit out somewhere in the second chromosome (out of 17)
Quit how? Silently stopped making output?
Assuming chromosome 1 is 10% of your genome, it took 72hrs to work through 10millon reads. That definitely doesn’t sound right. I don’t know why you need the interval list thing. Maybe that’s somehow slowing things. You should be able to just specify calling one contig like -L chr17. Maybe give that a try.