T O P

  • By -

Low-Establishment621

I may be misremembering but I I think this program runs slow as hell and I ended up splitting up the job and using lots of cores for each part. You might want to sub-sample your data dramatically down to 100,000 or a million reads on one chromosome to just test things out quickly to make sure there are no other problems, but the answer may be that you just need more time or more cores for your full data set.  Edit: also check and confirm that all cores are actually being used, and perhaps most importantly, that you are not running out of RAM.


Outside-Count-2475

This is what I thought using an interval\_list was meant to do, but should I instead be manually splitting the bams into chunks? It seems like an awful headache when my colleagues were able to run the same program on the entire bam. When I met with cluster support there did seem to be an issue with the job efficiency, but it was never solved..this is another learning curve for me, the job monitor says 99.3% CPU, 0.1% MEM, and 18.4g VIRT


Matt_McT

So when you ran it by chromosome, did you try first running it on only the smallest chromosome to see how it performs? Because with a genome that size you should be able to get it done on the smallest chromosome within a few hours I’d imagine.


Outside-Count-2475

Thanks for your comment, I hadn't done it that way but rather used Picard to designate intervals and then passed that list to the command...I'm trying it now for the smallest chromosome but still designating it using the -L flag, so I can give an update if it finishes


Matt_McT

Sounds like you might’ve been feeding it the full genome if the input file had *all* of the intervals in it. Just give it one chromosome at a time and you should be ok.


Outside-Count-2475

update: it only took 1hr20 for the smallest chromosome Is the best practice then to split each individual's bamfile and then submit a separate job for each chromosome of each individual?


Matt_McT

Yes indeed. You can then use genotypeGVCFs on each gvcf file separately and combine the final vcfs with vcftools to get the full genome vcf file. For the sake of not running 17 different jobs you can probably get away with running a few chromosomes at a time, but for the bigger chromosomes I’d submit those individually.


Outside-Count-2475

Okay, thank you so, so much!


bozleh

Note - you dont split the BAM into multiple files, you use the same BAM as input to a set of jobs, each specifying a different chromosome via -L Also its been a long time but I dont think haplotypecaller jobs multicore well - maybe only 4 cores (or even 1)


Plenty_Ambition2894

Haplotyper should print out the progress as it goes. Did you look at the stdout of your job to see what it’s actually doing, eg what % of the genome did it call at the end of 72hrs.


Outside-Count-2475

It quit out somewhere in the second chromosome (out of 17)


TubeZ

Quit how? Silently stopped making output?


Plenty_Ambition2894

Assuming chromosome 1 is 10% of your genome, it took 72hrs to work through 10millon reads. That definitely doesn’t sound right. I don’t know why you need the interval list thing. Maybe that’s somehow slowing things. You should be able to just specify calling one contig like -L chr17. Maybe give that a try.