问题
To speed up a certain snakemake step I would like to:
- split my bamfile per chromosome using
bamtools split -in sample.bam --reference
this results in files named assample.REF_{chromosome}.bam
- perform variant calling on each resulting in e.g.
sample.REF_{chromosome}.vcf
- recombine the obtained vcf files using vcf-concat (VCFtools) using
vcf-concat file1.vcf file2.vcf file3.vcf > sample.vcf
The problem is that I don't know a priori which chromosomes may be in my bam file. So I cannot specify accurately the output of bamtools split
. Furthermore, I'm not sure how to make the input of vcf-concat
to take all vcf files.
I thought of using a samples.fofn and do something like
rule split_bam:
input:
bam = "alignment/{sample}.bam",
pattern = "alignment/{sample}.REF_"
output:
alignment/anon.splitbams.fofn
log:
"logs/bamtools_split/{sample}.log"
shell:
"bamtools split -in {input.bam} -reference && \
ls alignment/{input.pattern}*.bam | sed 's/.bam/.vcf/' > {output}"
And use the same fofn
for concatenating the obtained vcf files. But this feels like a very awkward hack and I'd appreciate your suggestions.
EDIT 20180409
As suggested by @jeeyem I tried the dynamic()
functions, but I can't figure it out.
My complete snakefile is on GitHub, the dynamic part is at lines 99-133.
The error I get is:
InputFunctionException in line 44 of /home/wdecoster/DR34/SV-nanopore.smk:
KeyError: 'anon___snakemake_dynamic'
Wildcards:
sample=anon___snakemake_dynamic
(with anon
an anonymized {sample} identifier)
Running with --debug-dag gives (last parts before erroring):
candidate job cat_vcfs
wildcards: sample=anon
candidate job nanosv
wildcards: sample=anon___snakemake_dynamic, chromosome=_
candidate job samtools_index
wildcards: aligner=split_ngmlr, sample=anon___snakemake_dynamic.REF__
candidate job split_bam
wildcards: sample=anon___snakemake_dynamic, chromosome=_
InputFunctionException in line 44 of /home/wdecoster/DR34/SV-nanopore.smk:
KeyError: 'anon___snakemake_dynamic'
Wildcards:
sample=anon___snakemake_dynamic
Which shows that the wildcard is misinterpreted?
Cheers, Wouter
回答1:
You can lookup the chromosome names from the bam header, or a corresponding .fai
file for the used reference. This can be done at the beginning of your Snakefile. Then, you can use expand("alignment/{{sample}}.REF_{chromosome}.bam", chromosome=chromosomes)
to define the output files of that rule. No need to use dynamic.
来源:https://stackoverflow.com/questions/49574969/snakemake-unknown-output-input-files-after-splitting-by-chromosome