Calling another pipeline within a snakefile result in mising output errors

∥☆過路亽.° 提交于 2020-01-25 07:47:05

问题


I am using an assembly pipeline called Canu inside my snakemake pipeline, but when it comes to the rule calling Canu, snakemake exits witht he MissingOutputException error as the pipeline submits multiple jobs to the cluster itself so it seems snakemake expects the output after the first job has finished. Is there a way to avoid this? I know I could use a very long --latency-wait option but this is not very optimal.

snakefile code:

#!/miniconda/bin/python

workdir: config["path_to_files"]
wildcard_constraints:
    separator = config["separator"],
    sample = '|' .join(config["samples"]),

rule all:
    input:
        expand("assembly-stats/{sample}_stats.txt", sample = config["samples"])

rule short_reads_QC:
    input:
        f"short_reads/{{sample}}_short{config['separator']}*.fq.gz"

    output:
        "fastQC-reports/{sample}.html"

    conda:
        "/home/lamma/env-export/hybrid_assembly.yaml"

    shell:
        """
        mkdir fastqc-reports
        fastqc -o fastqc-reports {input}
        """

rule quallity_trimming:
    input:
        forward = f"short_reads/{{sample}}_short{config['separator']}1.fq.gz",
        reverse = f"short_reads/{{sample}}_short{config['separator']}2.fq.gz",

    output:
        forward = "cleaned_short-reads/{sample}_short_1-clean.fastq",
        reverse = "cleaned_short-reads/{sample}_short_2-clean.fastq"

    conda:
        "/home/lamma/env-export/hybrid_assembly.yaml"

    shell:
        "bbduk.sh -Xmx1g in1={input.forward} in2={input.reverse} out1={output.forward} out2={output.reverse}  qtrim=rl trimq=10"

rule long_read_assembly:
    input:
        "long_reads/{sample}_long.fastq.gz"

    output:
        "canu-outputs/{sample}.subreads.contigs.fasta"

    conda:
        "/home/lamma/env-export/hybrid_assembly.yaml"

    shell:
        "canu -p {wildcards.sample} -d canu-outputs genomeSize=8m -pacbio-raw {input}"
rule short_read_alignment:
    input:
        short_read_fwd = "cleaned_short-reads/{sample}_short_1-clean.fastq",
        short_read_rvs = "cleaned_short-reads/{sample}_short_2-clean.fastq",
        reference = "canu-outputs/{sample}.subreads.contigs.fasta"

    output:
        "bwa-output/{sample}_short.bam"

    conda:
        "/home/lamma/env-export/hybrid_assembly.yaml"

    shell:
        "bwa mem {input.reference} {input.short_read_fwd} {input.short_read_rvs} |  samtools view -S -b > {output}"


rule indexing_and_sorting:
    input:
        "bwa-output/{sample}_short.bam"
    output:
        "bwa-output/{sample}_short_sorted.bam"

    conda:
        "/home/lamma/env-export/hybrid_assembly.yaml"

    shell:
        "samtools sort {input} > {output}"

rule polishing:
    input:
        bam_files = "bwa-output/{sample}_short_sorted.bam",
        long_assembly = "canu-outputs/{sample}.subreads.contigs.fasta"

    output:
        "pilon-output/{sample}-improved.fasta"

    conda:
        "/home/lamma/env-export/hybrid_assembly.yaml"

    shell:
        "pilon --genome {input.long_assembly} --frags {input.bam_files} --output {output} --outdir pilon-output"

rule assembly_stats:
    input:
        "pilon-output/{sample}-improved.fasta"
    output:
        "assembly-stats/{sample}_stats.txt"

    conda:
        "/home/lamma/env-export/hybrid_assembly.yaml"

    shell:
        "stats.sh in={input} gc=assembly-stats/{wildcards.sample}/{wildcards.sample}_gc.csv gchist=assembly-stats/{wildcards.sample}/{wildcards.sample}_gchist.csv shist=assembly-stats/{wildcards.sample}/{wildcards.sample}_shist.csv > assembly-stats/{wildcards.sample}/{wildcards.sample}_stats.txt"

The exact error:

Waiting at most 60 seconds for missing files.
MissingOutputException in line 43 of /faststorage/home/lamma/scripts/hybrid_assembly/bacterial-hybrid-assembly.smk:
Missing files after 60 seconds:
canu-outputs/F19FTSEUHT1027.PSU4_ISF1A.subreads.contigs.fasta
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

The snakemake command being used:

snakemake --latency-wait 60 --rerun-incomplete --keep-going --jobs 99 --cluster-status 'python /home/lamma/faststorage/scripts/slurm-status.py' --cluster 'sbatch  -t {cluster.time} --mem={cluster.mem} --cpus-per-task={cluster.c} --error={cluster.error}  --job-name={cluster.name} --output={cluster.output}' --cluster-config bacterial-hybrid-assembly-config.json --configfile yaml-config-files/test_experiment3.yaml --use-conda --snakefile bacterial-hybrid-assembly.smk

回答1:


I surmise that canu is giving you canu-outputs/{sample}.contigs.fasta not canu-outputs/{sample}.subreads.contigs.fasta. If so edit the canu commad to be

canu -p {wildcards.sample}.subreads ...

(By the way, I don't think #!/miniconda/bin/python is necessary).



来源:https://stackoverflow.com/questions/59630183/calling-another-pipeline-within-a-snakefile-result-in-mising-output-errors

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!