Snakemake, RNA-seq : How can I execute one subpart of a pipeline or another subpart based on the characteristics of the sample that is analysed?

问题

I am using snakemake to design a RNAseq-data analysis pipeline. While I've managed to do that, I want to make my pipeline to be as adaptable as possible and make it able to deal with single-reads (SE) data or paired-end (PE) data within the same run of analyses, instead of analysing SE data in one run and PE data in another.

My pipeline is supposed to be designed like this :

dataset download that gives 1 file (SE data) or 2 files (PE data) -->
set of rules A specific to 1 file OR set of rules B specific to 2 files -->
rule that takes 1 or 2 input files and merges it/them into a single output -->
final set of rules.

Note : all rules of A have 1 input and 1 output, all rules of B have 2 inputs and 2 outputs and their respective commands look like :

1 input : somecommand -i {input} -o {output}
2 inputs : somecommand -i1 {input1} -i2 {input2} -o1 {output1} -o2 {output2}

Note 2 : except their differences in inputs/outputs, all rules of sets A and B have the same commands, parameters/etc...

In other words, I want my pipeline to be able to switch between the execution of set of rules A or set of rules B depending on the sample, either by giving it information on the sample in a config file at the start (sample 1 is SE, sample 2 is PE... this is known before-hand) or asking snakemake to counts the number of files after the dataset download to choose the proper next set of rules for each sample. If you see another way to do that, you're welcome to tell be about it.

I thought about using checkpoints, input functions and if/else statement, but I haven't managed to solve my problem with these.

Do you have any hints/advice/ways to make that "switch" happen ?

回答1:

If you know the layout beforehand, then the easiest way would be to store it in some variable, something like this (or alternatively you read this from a config file into a dictionary):

layouts = {"sample1": "paired", "sample2": "single", ... etc}

What you can then do is "merge" your rule like this (I am guessing you are talking about trimming and alignment, so that's my example):

ruleorder: B > A

rule A:
    input:
        {sample}.fastq.gz
    output:
        trimmed_{sample}.fastq.gz
    shell:
        "somecommand -i {input} -o {output}"

rule B:
    input:
        input1={sample}_R1.fastq.gz,
        input2={sample}_R2.fastq.gz
    output:
        output1=trimmed_{sample}_R1.fastq.gz,
        output2=trimmed_{sample}_R2.fastq.gz
    shell:
        "somecommand -i1 {input.input1} -i2 {input.input2} -o1 {output.output1} -o2 {output.output2}"


def get_fastqs(wildcards):
    output = dict()
    if layouts[wildcards.sample] == "single":
        output["input"] = "trimmed_sample2.fastq.gz"
    elif layouts[wildcards.sample] == "paired":
        output["input1"] = "trimmed_sample1_R1.fastq.gz"
        output["input2"] = "trimmed_sample1_R2.fastq.gz"
    return output


rule alignment:  
    def input:
        unpack(get_fastqs)
    def output:
        somepath/{sample}.bam
    shell:
        ...

There is a lot of stuff going on here.

First of all you need a ruleorder so snakemake knows how to handle ambiguous cases
Rule A and B both have to exist (unless you do sth hacky with the output files).
The alignment rule needs an input function to determine which input it requires.

Some self-promotion: I made a snakemake pipeline which does many things, including RNA-seq and downloading of samples online and automatically determining their layout (single-end vs paired-end). Please take a look and see if it solves your problem: https://vanheeringen-lab.github.io/seq2science/content/workflows/rna_seq.html

EDIT:

When you say “merging” rules, do you mean rule A, B and alignment ?

That was unclear wording of me. With merging I meant to "merge the single-end and paired-end and paired-end logic together, so you can continue with a single rule (e.g. count table, you name it).

Rule order : why did you choose B > A ? To make sure that paired samples don’t end up running in the single-end rules?

Exactly! When a rule needs trimmed_sample1_R1.fastq.gz, how would Snakemake know the name of your sample? Is the name of the sample, sample1, or is it sample1_R1? It can be either, and that makes snakemake complain that it does not know how to resolve this. When you add a ruleorder you tell Snakemake, when it is unclear, resolve in this order.

The command in the alignment rule needs 1 or 2 inputs. I intend to use an if/else in params directive to choose the inputs. Am I correct to think that? (I think you did that as well in your pipeline)

Yes that's the way we solved it. We did it in that way since we want every rule to have it's own environment. If you do not use a seperate conda environment for alignment, then you can do it cleaner/prettier, like so

rule alignment:  
    input:
        unpack(get_fastqs)
    output:
        somepath/{sample}.bam
    run:
        if layouts[wildcards.sample] == "single":
            shell("single-end command")
        if layouts[wildcards.sample] == "paired":
            shell("paired-end command")

I feel like this option is much clearer than what we did in the seq2science pipeline. However in the seq2science pipeline we support many different aligners and they all have a different conda environment, so the run directive can not be used.

来源：https://stackoverflow.com/questions/63323670/snakemake-rna-seq-how-can-i-execute-one-subpart-of-a-pipeline-or-another-subp

标签

snakemake

rna-seq