Snakemake - How do I use each line of a file as an input?

我的梦境 提交于 2020-12-12 04:02:36

问题


I need to use each line of the file tissuesused.txt as an input for a parallelized rule in snakemake. I think there are about 48 jobs I would like to call in total.

for line in $(cat tissuesused.txt)
do
   echo "Sorting $line.phen_fastqtl.bed to $line/$line.pheno.bed..."
   bedtools sort -header -i $line/$line.phen_fastqtl.bed > $line/$line.pheno.bed 
   echo "bgzipping $line/$line.pheno.bed..."
   bgzip -f $line/$line.pheno.bed
   #figure out where tabix outputs
   echo "Indexing $line/$line.pheno.bed.gz..."
   tabix -p bed $line/$line.pheno.bed.gz
done

How would I go about doing this in snakemake? I can't find anything on this online. This job occurs halfway through the pipeline and therefore I don't know how I would go about defining a function at the top of the snakefile for a file that isn't there yet. I would just like to create a list of strings which each contain an abbreviation for a type of human tissue found in tissuesused.txt. I found a section in the snakemake docs that look like they might be relevant but I am not sure how I would apply it to my case. Thank you in advance.

EDIT: Here's what I have so far, not sure if it'll work:

def fileAsList(file):
    with open(file) as f:
        for line in f:
            lis = []
            spl = line.split()
            lis.append(spl[0])
        return lis
...
rule sort_zip_ind_pheno:
    input:
        tis=fileAsList("tissuesused.txt"),
        chk=".make_tis_dirs.chkpnt"
    output:
        touch(".sort_zip_ind_pheno.chkpnt")
    shell:
        "bedtools sort -header -i {input.tis}/{input.tis}.phen_fastqtl.bed > \
        {input.tis}/{input.tis}.pheno.bed;"
        "bgzip -f {input.tis}/{input.tis}.pheno.bed;"
        "tabix -p bed {input.tis}/{input.tis}.pheno.bed.gz"

Please let me know if this makes sense.


回答1:


I think what you are looking for are checkpoints in Snakemake. Take a look at this example:

checkpoint get_tissue:
    output:
        "tissuesused.txt"
    run:
        with open(output[0], 'a') as f:
            for i in range(9):
                f.write(f"{i}\n")


rule read_tissue:
    output:
        "tissue_{n}.txt"
    shell:
        """
        echo "this is tissue {wildcards.n}" > {output}
        """


def read_tissues_output(wildcards):
    with open(checkpoints.get_tissue.get().output[0]) as f:
        samples = [sample for sample in f.read().split('\n') if len(sample) > 0]  # we dont want empty lines
        return expand("tissue_{sample}.txt", sample=samples)

rule all:
    input:
        read_tissues_output

and run it with

snakemake --until all

Rule all uses read_tissues_output as input function (just as you specify in the question). This function then tries to open the output of checkpoint get_tissue, and if it doesn't exist yet it will generate it. Once the output exists the function reads the file, and returns the files we want to generate (tissue_{1-10}). Rule read_tissue can generate then generate these files (in parallel) for us.

edit:

tissueused.txt:

WHLBLD
TESTIS
THYROID

Snakefile

def read_tissues_output():
    with open('tissuesused.txt') as f:
        samples = [sample for sample in f.read().split('\n') if len(sample) > 0]  # we dont want empty lines
        return expand("tissue_{sample}.txt", sample=samples)

rule all:
    input:
        read_tissues_output()


rule read_tissue:
    output:
        "tissue_{n}.txt"
    shell:
        """
        echo "this is tissue {wildcards.n}" > {output}
        """


来源:https://stackoverflow.com/questions/57596812/snakemake-how-do-i-use-each-line-of-a-file-as-an-input

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!