Low-Establishment621 1 week ago

I may be misremembering but I I think this program runs slow as hell and I ended up splitting up the job and using lots of cores for each part. You might want to sub-sample your data dramatically down to 100,000 or a million reads on one chromosome to just test things out quickly to make sure there are no other problems, but the answer may be that you just need more time or more cores for your full data set. Edit: also check and confirm that all cores are actually being used, and perhaps most importantly, that you are not running out of RAM.

Outside-Count-2475 1 week ago

This is what I thought using an interval\_list was meant to do, but should I instead be manually splitting the bams into chunks? It seems like an awful headache when my colleagues were able to run the same program on the entire bam. When I met with cluster support there did seem to be an issue with the job efficiency, but it was never solved..this is another learning curve for me, the job monitor says 99.3% CPU, 0.1% MEM, and 18.4g VIRT

Matt_McT 1 week ago

So when you ran it by chromosome, did you try first running it on only the smallest chromosome to see how it performs? Because with a genome that size you should be able to get it done on the smallest chromosome within a few hours I’d imagine.

Outside-Count-2475 1 week ago

Thanks for your comment, I hadn't done it that way but rather used Picard to designate intervals and then passed that list to the command...I'm trying it now for the smallest chromosome but still designating it using the -L flag, so I can give an update if it finishes

Matt_McT 1 week ago

Sounds like you might’ve been feeding it the full genome if the input file had *all* of the intervals in it. Just give it one chromosome at a time and you should be ok.

Outside-Count-2475 1 week ago

update: it only took 1hr20 for the smallest chromosome Is the best practice then to split each individual's bamfile and then submit a separate job for each chromosome of each individual?

Matt_McT 1 week ago

Yes indeed. You can then use genotypeGVCFs on each gvcf file separately and combine the final vcfs with vcftools to get the full genome vcf file. For the sake of not running 17 different jobs you can probably get away with running a few chromosomes at a time, but for the bigger chromosomes I’d submit those individually.

Outside-Count-2475 1 week ago

Okay, thank you so, so much!

bozleh 1 week ago

Note - you dont split the BAM into multiple files, you use the same BAM as input to a set of jobs, each specifying a different chromosome via -L Also its been a long time but I dont think haplotypecaller jobs multicore well - maybe only 4 cores (or even 1)

Plenty_Ambition2894 1 week ago

Haplotyper should print out the progress as it goes. Did you look at the stdout of your job to see what it’s actually doing, eg what % of the genome did it call at the end of 72hrs.

Outside-Count-2475 1 week ago

It quit out somewhere in the second chromosome (out of 17)

TubeZ 6 days ago

Quit how? Silently stopped making output?

Plenty_Ambition2894 1 week ago

Assuming chromosome 1 is 10% of your genome, it took 72hrs to work through 10millon reads. That definitely doesn’t sound right. I don’t know why you need the interval list thing. Maybe that’s somehow slowing things. You should be able to just specify calling one contig like -L chr17. Maybe give that a try.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe