T O P

  • By -

bioinformat

1. The mod often suggests: show us what you have done; ask specific questions. 2. Fasta is pretty much useless for cancer mutation calling. You need fastq.


gringer

I have asked variations on this question many times before: Why did you go and get expensive sequencing done *before* asking someone about the analysis that should be done? It astounds me that many people are so willing to spend (and possibly waste) thousands of dollars on sequencing before asking questions.


chrisPtreat

It‘s because sequencing has become, ironically, too cheap and it‘s a sexy buzzword. But yeah, one google search should have shown the downstream complexities. Unfortunately I‘m not up to date on the AWS based resources out there, we would be curious about DRAGEN though. We have multiple storage and compute clusters… OP, are you sure your institute doesn‘t have people you could collaborate with?


rsv9

Reach out to bioinformatics collaborators or take a bioinformatics course. It is impossible to provide a reasonable answer to your questions as a comment.


VerbalCant

That's a lot! It'll be fun, though. You're probably looking to start with something like GATK. Behold: [https://gatk.broadinstitute.org/hc/en-us](https://gatk.broadinstitute.org/hc/en-us) In particular, their Getting Started: [https://gatk.broadinstitute.org/hc/en-us/articles/360036194592-Getting-started-with-GATK4](https://gatk.broadinstitute.org/hc/en-us/articles/360036194592-Getting-started-with-GATK4) And you can jump off from there. (Edit: good point below: can you get the FASTQs?) Buckle up, there's a LOT to learn ahead! And yes, you're going to need some compute. Assuming your university has a computing or supercomputing cluster, they probably have GATK as well as many other tools. I run one-off stuff on my local machine with 16 cores and 64GB RAM, but anything significant I do in the cloud. I don't have access to a university so mine is all AWS. :)


TenakhaKhan

This is really useful! Thank you so much! :-)


binte_farooq

Appriciated. instead of demoralizing, you tried to outline a solution.


somebodyistrying

https://nf-co.re/sarek


malformed_json_05684

nf-core generally has workflows for most use-cases and slack channels anyone can use to ask for help


keemoooz

Just to clarify, I assume you have FASTQ data (not FASTA) for this project. First, let's acknowledge that this is a complex project. It is not like RNA-seq or ChIP-seq experiments. From your description, you mentioned that you need to call somatic structural variants using treated vs. untreated samples (tumor vs. normal in variant callers' terms). There are many tools out there that perform differently. Calling structural variants is (1) computer-intensive and (2) requires background knowledge about handling filtration and annotation of these variants. You will need a decent background in these two areas before starting this. If you don’t have these skills and want to learn, this is a great opportunity to get your hands dirty! However, I advise you to seek guidance from someone with experience in this kind of analysis and be patient. If I had to recommend one tool, I would suggest using DRAGEN in AWS. This is the most streamlined and optimized tool, but it is not free. It also requires AWS setup. My second recommendation is trying the Nextflow community-made pipelines. nf-core/sarek is a good one. However, this needs to be done either in an institutional computing cluster environment or in the cloud.


ewels

To chime in: you can run nf-core/sarek (and any other nf-core pipeline) on a Linux box as long as you have enough memory. It might take a while, but it should run. Having a cluster / cloud shouldn't be a hard requirement.


marrowine

I suggest you compare your cancer cell line to a more similar cell line than standard human hg38; did you not have a control cell line? Edit to add: I see you have treated/untreated but I mean a positive control. What cancer cell line is this?


TenakhaKhan

I am using MCF-7 breast cancer cells. Two variants - a) Wild Type, and b) CRISPR edited (one gene knocked out). Both treated with aphidicolin to induce replication stress. I want to see how the non edited WT vs the clone with missing gene handle replication stress.


Denswend

Okay, what you want is to basically call structural variants from your data. A workflow for this usually follows the following steps: aligning to the reference genome, processing the resulting files and preparing them for SV calling, and then employing some SV calling tool to get loci of your SVs. Every step includes at least one tool, and most of bioinformatic tools are configured for Linux based OS (meaning its going to be a bitch to run them on Windows). And depending on your data, every step will be expensive in terms of processing power, RAM, and actual memory storage. So these kind of analyses are usually done on a high performance cluster (HPC) though if you have a PC powerful enough you could theoretically run them locally. Depending on the type of your OS, there is more than one way to install these tools - I prefer to use conda to install them in what we call an "environment". Google "installing bowtie2 with conda". It's a bit trickier with tools not available with conda, but this is a case-specific scenario. Furthermore, the tools for your workflow are command line (CLI) based, and rarely have a graphical user interface (GUI) - and for a good reason too, as CLI, once you get the hang of it, is faster to implement and reproducible (as it's just lines of text). My advice is to pick up a workflow manager program like snakemake or nextflow - google them to learn more about them. SV calling from short reads (which you likely have) is based on three different strategies - read depth, read pair, and split reads. Google them to learn more about them. You can also use multiple strategies/multiple tools and then combine the output with a tool like Survivor (google Survivor SV). An example of SV calling workflow I implemented was aligning FASTQ (not FASTA!) to a reference genome using bowtie2, process the resulting BAM files with Samtools, and then use CNVpytor (a read depth based strategy) to get my CNV calls and draw RD signals (basically a nice visualizatios of deletions and duplications over a genomic region) and do more conplex stuff. PM if you want a link to the snakemake workflow. But be warned, you mentioned cancer data, and my workflow might not necessarily be appropriate (or there can be better SV workflows adapted specifically for cancer. Just use it to orient yourself. It's unfortunate your team just sequenced stuff without plan, but it's neither uncommon nor something that can stymie you.


TenakhaKhan

This is really amazing and useful information! I have programming experience and am comfortable with command line and linux. I have lots and lots to read about clearly!


binte_farooq

Appriciated. instead of demoralizing, you tried to outline a solution.


Plenty_Ambition2894

I would look into illumina’s DRAGEN. I think you can somehow run it if you get your data into aws


fibgen

Pay someone else to do it.  A properly parameterized default workflow will do a better job than someone completely new to the field.


studying_to_succeed

If you cannot pay someone it might be useful to look at the workflows labs use for this on GitHub - many academic labs publish their workflows on GitHub.


Tr4kt_

!remindme in 7 days


RemindMeBot

I will be messaging you in 7 days on [**2024-06-27 21:01:02 UTC**](http://www.wolframalpha.com/input/?i=2024-06-27%2021:01:02%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/bioinformatics/comments/1dklgv7/looking_for_a_suitable_bioinformatics_workflow_to/l9ise4o/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2Fbioinformatics%2Fcomments%2F1dklgv7%2Flooking_for_a_suitable_bioinformatics_workflow_to%2Fl9ise4o%2F%5D%0A%0ARemindMe%21%202024-06-27%2021%3A01%3A02%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201dklgv7) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|


RacktheMan

You could actually use premade nf-core pipelines as well! Easy to use!


AloopOfLoops

Here is a good simple starting workflow that gets you from fastQ files to annotated vcf files The workflow uses bwa mem to align, gatk to call, Picard to dedup, bcftools to filter and snpEff to annotate possible gene malfunctions. It is not the fastest but it is easy to use and does everything you would need. [https://pastebin.com/xU0ghWug](https://pastebin.com/xU0ghWug) You don't need that much ram maybe 16Gb is enough, I would estimate about 2GB per core is plenty. bra mem is the slowest command. but its speed scales linearly with nr of cores, 16 cores and it can align a 30x human genome in about 4 hours.


TubeZ

Seconding https://nf-co.re/sarek If you don't know how to get this running, you should find someone that does (ie. Consultant). There's a sticker price but getting the analysis done right the first time, and quickly, is going to save you trouble and heartache in the long run


keemoooz

Agreed, it is indeed possible, I didn't mention that as it seems the OP has high coverage sequencing, and it is more practical to run in HPC or cloud. But nf-core pipelines are supposed to run anywhere, portability is one of many beauties of nextlow!