问题
I tried to create a snakemake file to run sortmeRNA pipeline:
SAMPLES = ['test']
READS=["R1", "R2"]
rule all:
input: expand("Clean/4.Unmerge/{exp}.non_rRNA_{read}.fastq", exp = SAMPLES, read = READS)
rule unzip:
input:
fq = "trimmed/{exp}.{read}.trimd.fastq.gz"
output:
ofq = "Clean/1.Unzipped/{exp}.{read}.trimd.fastq"
shell: "gzip -dkc < {input.fq} > {output.ofq}"
rule merge_paired:
input:
read1 = "Clean/1.Unzipped/{exp}.R1.trimd.fastq",
read2 = "Clean/1.Unzipped/{exp}.R2.trimd.fastq"
output:
il = "Clean/2.interleaved/{exp}.il.trimd.fastq"
shell: "merge-paired-reads.sh {input.read1} {input.read2} {output.il}"
rule sortmeRNA:
input:
ilfq = "Clean/2.interleaved/{exp}.il.trimd.fastq"
output:
reads_rRNA = "Clean/3.sorted/{exp}_reads_rRNA",
non_rRNA = "Clean/3.sorted/{exp}_reads_nonRNA"
params:
silvabac = "rRNA_databases/silva-bac-16s-id90.fasta,index/silva-bac-16s-db:rRNA_databases/silva-bac-23s-id98.fasta,index/silva-bac-23s-db",
silvaarc = "rRNA_databases/silva-arc-16s-id95.fasta,index/silva-arc-16s-db:rRNA_databases/silva-arc-23s-id98.fasta,index/silva-arc-23s-db",
silvaeuk = "rRNA_databases/silva-euk-18s-id95.fasta,index/silva-euk-18s-db:rRNA_databases/silva-euk-28s-id98.fasta,index/silva-euk-28s-db",
rfam = "rRNA_databases/rfam-5s-database-id98.fasta,index/rfam-5s-db:rRNA_databases/rfam-5.8s-database-id98.fasta,index/rfam-5.8s-db",
acc = "--num_alignments 1 --fastx --log -a 20 -m 64000 --paired_in -v"
log:
"Clean/sortmeRNAlogs/{exp}_sortmeRNA.log"
shell:'''
sortmerna --ref {params.silvabac}:{params.silvaarc}:{params.silvaeuk}:{params.rfam} --reads {input.ilfq} --aligned {output.reads_rRNA} --other {output.non_rRNA} {params.acc}
'''
rule unmerge_paired:
input:
inun = "Clean/3.sorted/{exp}_reads_nonRNA.fastq"
output:
R1 = "Clean/4.Unmerge/{exp}.non_rRNA_R1.fastq",
R2 = "Clean/4.Unmerge/{exp}.non_rRNA_R2.fastq"
shell:"unmerge-paired-reads.sh {input.inun} {output.R1} {output.R2}"
This worked fine !. But for 1 sample it produced an output with size ~53 GB. I have 90 samples to run and cannot afford huge disk space. I tried to make output of rules unzip,merge_paired,sortmeRNA as temp(), but upon executing unmerge_paired raises "Missing input files exception" error. I also tried to add rule_remove to delete all those intermediate directories. But that is not executed as last rule, rather somewhere in the middle raising error again !. Is there any efficient way to do this ?
The error that occurs is:
MissingInputException in line 45 of sortmeRNA_pipeline_memv2.0.snakefile:
Missing input files for rule unmerge_paired:
Clean/3.sorted/test_reads_nonRNA.fastq
Also please note that, rule sortmeRNA requires a string for output and produces string.fastq file, which is then input into rule unmerge_paired ! Thanks.
回答1:
For Snakemake to connect the input of one rule to the output of another, they will need to be identical. The way you describe the output of sortmeRNA
and the input unmerge_paired
do not work together, whether you put temp()
around it or not.
rule sortmeRNA:
input:
ilfq = "Clean/2.interleaved/{exp}.il.trimd.fastq"
output:
reads_rRNA = temp("Clean/3.sorted/{exp}_reads_rRNA.fastq"),
non_rRNA = temp("Clean/3.sorted/{exp}_reads_nonRNA.fastq")
params:
reads_rRNA = "Clean/3.sorted/{exp}_reads_rRNA",
non_rRNA = "Clean/3.sorted/{exp}_reads_nonRNA"
shell:
'''
sortmerna --aligned {params.reads_rRNA} --other {params.non_rRNA} ...
'''
rule unmerge_paired:
input:
inun = "Clean/3.sorted/{exp}_reads_nonRNA.fastq" # or rules.sortmeRNA.output.non_rRNA
output:
R1 = "Clean/4.Unmerge/{exp}.non_rRNA_R1.fastq",
R2 = "Clean/4.Unmerge/{exp}.non_rRNA_R2.fastq"
shell:
"unmerge-paired-reads.sh {input.inun} {output.R1} {output.R2}"
I removed all the stuff which isn't necessary to understand what is going on, you will have to place those back obviously. I changed the output of sortmeRNA
to the actual output of the rule (and made them temp). I also added two params, which are the same as the output but then without the fastq extension.
来源:https://stackoverflow.com/questions/59766049/execute-snakemake-rule-as-last-rule