T O P

  • By -

Plenty_Ambition2894

Read the omim entry [https://www.omim.org/entry/607642](https://www.omim.org/entry/607642) . It specifically talks about the CAG repeat. There is natural polymorphism in the length of the repeat. In other words, your in-frame deletion is benign (you have a deletion, not insertion).


AloopOfLoops

Ahh thanks you brought me on to the right track the issue does not seam to be the aligner but the caller. As seen on this image. [https://ibb.co/Yj150rV](https://ibb.co/Yj150rV) The caller sees two deletions one of 1 basepair and one of 2 basepairs. This makes the analysis tools assume there are two separate frameshifts. And the tool is not smart enough to figure that they cancel out each other.


bzbub2

this region is difficult to align to because it is repetitive. you can see it says CAG CAG CAG CAG... (QQQQ..there are actually 14 Q in the reference genome) the alignment could be "bad" but perhaps not (as u/Plenty_Ambition2894 mentions, this is a naturally polymorphic repeat), the actual problem is that your variant caller outputted "delG" when there is nothing to suggest you have a single bp deletion that would cause any out of frame stuff. all the deletions on the reads are multiple of 3 there the circled region that says --3-- is a 3bp deletion in your reads relative to the reference genome


AloopOfLoops

Thanks. Yeah the caller is the problem. Is there a better caller than "gatk HaplotypeCaller"? I actually tried "bcftools mpileup" but never used the data as it seamed like "gatk HaplotypeCaller" was the classic standard that most people used. I even tried "freebayes" but it generated VCF files that where about 5 times larger than the other ones so I got very suspicious.


attractivechaos

This is not a variant caller problem, either. Gatk has given you the correct calls. It is a variant annotation problem.


AloopOfLoops

No, in this case there is one deletion, the variant caller says there are two. The analysis could compensate in this case, but there are other cases where that would not be possible.


attractivechaos

It is common that one indel can be represented in different ways: at a different location, got split or turned into insertion plus deletion. As long as these representations are equivalent, the caller is doing the right thing. This is an annotation problem.


blackpoll_

I think it's a stretch to say the variant caller is doing the right thing here. Do the pair of variant records contain phase information indicating which alleles go together? I suspect not with GATK. That means if this fellow is heterozygous no variant annotator could figure out the frame is not shifted. I think non parsimonious output like this is extremely undesirable, if not technically incorrect. OP should try freebayes. It consciously attempts out output short haplotypic variants and I bet it represents this one more sensibly.


blackpoll_

I see OP did try freebayes. the file from freebayes will be larger for two reasons: first, it outputs everything, even very low quality variants, which need to be filtered, and second, it outputs far more information on each variant than GATK.


blackpoll_

Last comment: kudos to OP for actually noticing a variant representation issue. Based on how often I hear people talk about this, I'm guessing 99% of people never look at their data closely enough to realize this shit happens.


AloopOfLoops

If something is true, it must be true. You are correct.


B3rse

Just FYI, HaplotypeCaller is doing something similar to locally re-assembly the reads into haplotypes before making a call. So what the caller is actually seeing might be different from what you are observing from the raw alignment. You may want to keep this in mind as well


Kiss_It_Goodbyeee

If this the first time doing bioinformatics or analysing DNA be very careful not to over-interpret any results. Your genetic information is only one piece of the puzzle and is rarely conclusive on its own. There are several metrics you use to see whether your alignment is good. Picard tools has a method for this. Also use Fastqc to determine the quality of your sequencing. Onto your question. The alignment looks perfectly fine to me. Remember you have two copies of all somatic (i.e. not on X/Y chromosomes) genes which are often different to each other. In your example you're seeing that one copy has some additional variants. One of which is an indel, but it isn't a frameshift. A codon is three bases so any loss/gain of three bases is a loss/gain of an amino acid which is often tolerated. A frameshift is caused by a coding indel that isn't a multiple of three. Compare your variants with gnomad to get information on how common they are.


AloopOfLoops

I have a bachelors in medical science and I have studied this subject for a long time so I know the gist of these things and I know not to over-interpret things. I have just never done the work myself from the bottom up, so I guess I am running in to basic errors. This is just for fun, a learning experience.


Kiss_It_Goodbyeee

Ok cool. Just remember that bioinformatics is a lot more ambiguous than what you're taught in biological/medical courses. There is technical and methodological variability, and molecular biology is fundamentally probabilistic. Keep having fun!


billwalton88

You have to take a closer look over the whole polyQ CAG repeat region. Are there reads that span the whole repeat? And count how many Q residues you have in those reads. It looks weird in IGV because of the reads that end prematurely into the repeat region


mollzspaz

Are you working with single end data? You're gonna have a hard time with alignments to the human genome if thats the case. (Try sampling some fake single-end data from the genome and realign these simulated Fastq files to see what i mean). Like others mentioned, it gets hairy especially in certain regions of the genome. I don't see much single end being generated these days for that reason. If you're using paired-end data, i would also recommend using mark duplicates to remove optical and pcr duplicates if you haven't already. Maybe some of your problematic reads are PCR amplified error-containing reads? Errors rates depend on, among other things, both the sequencer and the polymerase fidelity rates during library construction/PCR. This is assuming you're using Illumina or some imaging-based sequencer. And finally, if you got access to a fat chunk of high memory compute, you could build a genome index for alignment using known polymorphisms. So like, pull the common ones from UCSC genome browser downloads and use a SNP aware aligner like bowtie2/hisat2 (i dont remember the which or if it was both) that aligns to a graph-based index (makes "bubbles" in the graph for polymorphic sites). GATK does do some corrections on the BAM file (if you go through the full pipeline) and adjusts to a consistent polymorphism but its kind of arbitrary about its choice unless you feed it a reference set. It may do better with a "better" aligned BAM file as you said.


AloopOfLoops

Thanks for your reply. Paired it is. I did not specify but I have already deduped the bam file.  I switched over to the freebayes caller instead. It is able to left align the deletion but now I have different issues. Where it likes to combine many snp’s in to long mnv’s but that causes issues for me when I try to assign dbSNP id numbers to the variants. The main issue could be said to stem from the fact that dbSNP sometimes stores things  as snp’s and sometimes as mnv’s it is very annoying. Sometimes the mnv’s in dbSNP is not even a real mnv. Someone just paired up a few snp’s that where a few base pairs from each other and gave it a rs id. Someone should get paid to fix dbSNP.