snakemake, how to build a loop for two independent parameter

问题

I want to loop snakemake above two different wildcards, which - I think - are somehow independent from each other.

In case that there is already a solved threat for this case, I would be happy for a hint. But so far I'm not sure what the correct terms are to look for what I want to do.

Let's assume my pipeline has three steps. I have a set of samples which I process in each of those three steps. Put in the second step I deploy an extra parameter to every sample. In the third step now I have to iterate through the samples and the associated parameter. Because of this structure, I think it's not possible to solve this with a dictionary structure.

To picture the case I made this simplification of my files and rules: The config file:

config.yaml

samples:
- a
- b
- c
- d
threshhold:
- 0.5
- 0.1
- 0.2

A scheme of my snakefile. It pictures the exact structure and naming of snakemake operations. At least the naming is simplified. (I added the tools I actually use in brackets but I think that's not essential for understanding.)

rule all: 
    input: 
      expand("{sample}.bam", sample=config["samples"]),
      expand("{sample}_{param}.bed", sample=config["samples"], param=config["threshhold"])

rule first: # (samtools view)
    input: 
       "{sample}.sam"
    output:
       "{sample}.bam"
    shell:
       "<somecommand> {input} {output}"

rule second: # ( macs2 callpeaks; Of course, there are multiple outputs but they only vary in their suffix))
    input:
       "{sample}.bam"
    output:
       "{sample}_{param}.bed"
    params:
       out_name="{sample}",
       threshhold="{param}"
    shell:
       "<somecommand> {input} -n {params.names} -q {params.threshhold}"

So now I have a list of files looking like this:

a_0.5.bed
a_0.1.bed
a_0.2.bed
b_0.5.bed
b_0.1.bed
b_0.2.bed
...

In my third rule I want to do intersections of different samples with the same param. Like: a_0.5.bed x b_0.5.bed and c_0.5.bed x d_0.5.bed and get the output like ab_0.5.bed, ab_0.1.bed, cd_0.5.bed...

My first attempt was this:

rule all:
    input:
        expand("ab_{param}.bed", param=config["threshhold"])

rule intersect_2: # (bedtools intersect)
    input:
        a=expand("{sample_a}_{param}_peaks.narrowPeak", sample_a=config["samples"][0], param=config["threshhold"]),
        b=expand("{sample_b}_{param}_peaks.narrowPeak", sample_b=config["samples"][1], param=config["threshhold"])
    output:
        ab="intersect/ab_{param}.bed"
    params:
        threshhold="{param}"
    shell:
        "bedtools intersect -u -a {input.a} -b {input.b} > {output.ab}"

Well this doesn't work because now the input are all the different parameters files at once.

I think I need more and a different loop structure here. Maybe even some extra python loops around the rule or something? But since I have no programming experience at all and I'm just getting started to go into those things step by step, I couldn't figure out by now where to start or which loop is required for this.

Summary: With the given config file I'd like to archive a folder wich is filled with different combinations of the samples with the same parameter. So to end up with a list which looks like this:

ab_0.5.bed
ba_0.5.bed
cb_0.5.bed
ca_0.5.bed
abc_0.5.bed
bca_0.5.bed
cba_0.5.bed

And those combinations too, for all the other parameters.

I would really appreciate any help and every hint that helps to understand what I exactly want to do there and how I can build this.

EDIT: Would maybe a completely restructured config file would help? Where the samples are already pre-combined? Maybe like this: (let's assume that s1, s2 and so on are standing for the real (and long) sample name)

config.yaml
samples_combinations:
- s1 : s2
- s3 : s2
- s3 : s1

I still have to rename it... But I don't really like the idea. My aim is to build something which is easily applicable and straight forward without as much manual refinement, Especially because you can have way more than just three samples in this case which have to be combined in multiple ways.

回答1:

I think you almost got it. You can achieve what you want with simple wildcards. Remember that expand is used to create lists. Keep it simple:

rule all:
    input:
        expand("intersect/ab_{param}.bed", param=config["threshhold"])

rule intersect_2: # (bedtools intersect)
    input:
        a="{sample1}_{param}_peaks.narrowPeak",
        b="{sample2}_{param}_peaks.narrowPeak"
    output:
        ab="intersect/{sample1}{sample2}_{param}.bed"
    shell:
        "bedtools intersect -u -a {input.a} -b {input.b} > {output.ab}"

Just some notes about this:

It is confusing to use wildcard names (or input/output names) that are also real names of your samples.
It is quite dangerous to put two or more wildcards in a row without separators

All put together, I would write something like this:

import itertools

sampleCombinations = [s[0]+"-"+s[1] for s in list(itertools.combinations(config["samples"],2))]

rule all:
    input:
        expand("intersect/{pairOfSamples}_{param}.bed", pairOfSamples=sampleCombinations, param=config["threshhold"])

rule intersect_2: # (bedtools intersect)
    input:
        s1="{sample1}_{param}_peaks.narrowPeak",
        s2="{sample2}_{param}_peaks.narrowPeak"
    output:
        "intersect/{sample1}-{sample2}_{param}.bed"
    shell:
        "bedtools intersect -u -a {input.s1} -b {input.s2} > {output}"

itertools.combinations will make pairs out of your samples defined in config.

The sampleCombinations list will hold all possible pairs of samples separated by a hyphen (ie: "a-b", "b-c", "c-d" etc...). This might break if your sample names contain hyphens, as snakemake will not be able to reconstruct the wildcards well ({sample1}-{sample2}). It that case, choose another separator.

EDIT following OP's comments:

Sorry in my rule all, I had forgotten to put the intersect folder output.
I also forgot the _{param}_ par in the input files

Let's say you have a config file like this:

samples:
- name: very_long_name_a
  shortName: a
- name: very_long_name_b
  shortName: b
- name: very_long_name_c
  shortName: c
threshhold:
- 0.5
- 0.1
- 0.2

then, you can use something like this:

import itertools

configfile: "config.yaml"

longNamesList = [s["name"] for s in config["samples"]]
shortNamesList = [s["shortName"] for s in config["samples"]]
shortToLong = {s["shortName"]:s["name"] for s in config["samples"]}

sampleCombinations = [s[0]+"-"+s[1] for s in list(itertools.combinations(shortNamesList,2))]

rule all:
        input: expand("intersect/{pairOfSamples}_{param}.bed", pairOfSamples=sampleCombinations, param=config["threshhold"])


rule intersect_2: # (bedtools intersect)
    input:
        s1=lambda wildcards: shortToLong[wildcards.sample1]+"_"+wildcards.param+"_peaks.narrowPeak",
        s2=lambda wildcards: shortToLong[wildcards.sample2]+"_"+wildcards.param+"_peaks.narrowPeak"
    output:
        "intersect/{sample1}-{sample2}_{param}.bed"
    shell:
        "bedtools intersect -u -a {input.s1} -b {input.s2} > {output}"

来源：https://stackoverflow.com/questions/56491104/snakemake-how-to-build-a-loop-for-two-independent-parameter

标签

python

loops

snakemake