问题
How to use a list in Snakemake tabular config.
I use Snakemake Tabular (mapping with BWA mem) configuration to describe my sequencing units (libraries sequenced on separate lines). At the next stage of analysis I have to merge sequencing units (mapped .bed files) and take merged .bam files (one for each sample). Now I'm using YAML config for describing of what units belong to what samples. But I wish to use Tabular config for this purpose,
I'm not clear how to write and recall a list (containing sample information) from cell of Tab separated file.
This is how my Tablular config for units looks like:
Unit SampleSM LineID PlatformPL LibraryLB RawFileR1 RawFileR2
sample_001.lane_L1 sample_001 lane_L1 ILLUMINA sample_001 /user/data/sample_001.lane_L1.R1.fastq.gz /user/data/sample_001.lane_L1.R2.fastq.gz
sample_001.lane_L2 sample_001 lane_L2 ILLUMINA sample_001 /user/data/sample_001.lane_L2.R1.fastq.gz /user/data/sample_001.lane_L2.R2.fastq.gz
sample_001.lane_L8 sample_001 lane_L8 ILLUMINA sample_001 /user/data/sample_001.lane_L8.R1.fastq.gz /user/data/sample_001.lane_L8.R2.fastq.gz
sample_002.lane_L1 sample_002 lane_L1 ILLUMINA sample_002 /user/data/sample_002.lane_L1.R1.fastq.gz /user/data/sample_002.lane_L1.R2.fastq.gz
sample_002.lane_L2 sample_002 lane_L2 ILLUMINA sample_002 /user/data/sample_002.lane_L2.R1.fastq.gz /user/data/sample_002.lane_L2.R2.fastq.gz
This is how my YAML config for Samples looks like:
samples:
"sample_001": ["sample_001.lane_L1", "sample_001.lane_L2", "sample_001.lane_L8"]
"sample_002": ["sample_002.lane_L1", "sample_002.lane_L2"]
My Snakemake code:
import pandas as pd
import os
workdir: "/user/data/snakemake/"
configfile: "Samples.yaml"
units_table = pd.read_table("Units.tsv").set_index("Unit", drop=False)
rule all:
input:
expand('map_folder/{unit}.bam', unit=units_table.Unit),
expand('merge_bam_folder/{sample}.bam', sample=config["samples"]),
rule map_paired_end:
input:
r1 = lambda wildcards: expand(units_table.RawFileR1[wildcards.unit]),
r2 = lambda wildcards: expand(units_table.RawFileR2[wildcards.unit])
output:
bam = 'map_folder/{unit}.bam'
params:
bai = 'map_folder/{unit}.bam.bai',
ref='/user/data/human_g1k_v37.fasta.gz',
SampleSM = lambda wildcards: units_table.SampleSM[wildcards.unit],
LineID = lambda wildcards: units_table.LineID[wildcards.unit],
PlatformPL = lambda wildcards: units_table.PlatformPL[wildcards.unit],
LibraryLB = lambda wildcards: units_table.LibraryLB[wildcards.unit]
threads:
16
shell:
r"""
seqtk mergepe {input.r1} {input.r2}\
| bwa mem -M -t {threads} -v 3 \
{params.ref} - \
-R "@RG\tID:{params.LineID}\tSM:{params.SampleSM}\tPL:{params.PlatformPL}\tLB:{params.LibraryLB}"\
| samtools view -u -Sb - \
| samtools sort - -m 4G -o {output.bam}
samtools index {output.bam}
"""
rule samtools_merge_bam:
input:
lambda wildcards: expand('map_folder/{file}.bam', file=config['samples'][wildcards.sample])
output:
bam = 'merge_bam_folder/{sample}.bam'
threads:
1
shell:
r"""
samtools merge {output.bam} {input}
samtools index {output.bam}
"""
回答1:
What about this below?
I have excluded the Samples.yaml as I think it is not necessary given your sample sheet.
In rule samtools_merge_bam
you collect all unit-bam files sharing the same SampleSM. These unit-bam files are created in map_paired_end
where the lambda expression collects the fastq files for each unit.
Note also that I have removed the unit-bam files from the all rule as (I think) these are just intermediate files and they could be marked as temporary using the temp() flag.
import pandas as pd
import os
workdir: "/output/dir"
units_table = pd.read_table("Units.tsv")
samples= list(units_table.SampleSM.unique())
rule all:
input:
expand('merge_bam_folder/{sample}.bam', sample= samples),
rule map_paired_end:
input:
r1 = lambda wildcards: units_table.RawFileR1[units_table.Unit == wildcards.unit],
r2 = lambda wildcards: units_table.RawFileR2[units_table.Unit == wildcards.unit],
output:
bam = 'map_folder/{unit}.bam'
params:
bai = 'map_folder/{unit}.bam.bai',
ref='/user/data/human_g1k_v37.fasta.gz',
SampleSM = lambda wildcards: list(units_table.SampleSM[units_table.Unit == wildcards.unit]),
LineID = lambda wildcards: list(units_table.LineID[units_table.Unit == wildcards.unit]),
PlatformPL = lambda wildcards: list(units_table.PlatformPL[units_table.Unit == wildcards.unit]),
LibraryLB = lambda wildcards: list(units_table.LibraryLB[units_table.Unit == wildcards.unit]),
threads:
16
shell:
r"""
seqtk mergepe {input.r1} {input.r2}\
| bwa mem -M -t {threads} -v 3 \
{params.ref} - \
-R "@RG\tID:{params.LineID}\tSM:{params.SampleSM}\tPL:{params.PlatformPL}\tLB:{params.LibraryLB}"\
| samtools view -u -Sb - \
| samtools sort - -m 4G -o {output.bam}
samtools index {output.bam}
"""
rule samtools_merge_bam:
input:
lambda wildcards: expand('map_folder/{unit}.bam',
unit= units_table.Unit[units_table.SampleSM == wildcards.sample])
output:
bam = 'merge_bam_folder/{sample}.bam'
threads:
1
shell:
r"""
samtools merge {output.bam} {input}
samtools index {output.bam}
"""
回答2:
please take a look at one of the best practice workflows under https:://github.com/snakemake-workflows.
For example, the dna-seq-gatk-variant-calling workflow defines two tsv files, one for units and one for samples. This allows you to (a) add more attributes to samples, and (b) have multiple units per sample.
来源:https://stackoverflow.com/questions/54400171/how-to-use-list-in-snakemake-tabular-configuration-for-describing-of-sequencing