snakemake

snakemake: how to deal with variable number of output from a rule

眉间皱痕 提交于 2020-01-03 21:12:22
问题 I want to run bcl2fastq to generate fastq files from bcl format. Depending on the sequencing set up with respect to sequencing mode and how many indexes were used, it can generate either read1,read2,index1 or read1,read2,index1,index2, etc. What I want to do is, put the read output number information in the config.yaml file as this: readids: ['I1','I2','R1','R2'] and let the rule figure out automatically how many read output (fastq.gz files) it should generate. How do I write the output

snakemake, how to build a loop for two independent parameter

China☆狼群 提交于 2020-01-03 03:33:07
问题 I want to loop snakemake above two different wildcards, which - I think - are somehow independent from each other. In case that there is already a solved threat for this case, I would be happy for a hint. But so far I'm not sure what the correct terms are to look for what I want to do. Let's assume my pipeline has three steps. I have a set of samples which I process in each of those three steps. Put in the second step I deploy an extra parameter to every sample. In the third step now I have

SnakeMake rule with Python script, conda and cluster

你。 提交于 2020-01-01 17:56:05
问题 I would like to get snakemake running a Python script with a specific conda environment via a SGE cluster. On the cluster I have miniconda installed in my home directory. My home directory is mounted via NFS so accessible to all cluster nodes. Because miniconda is in my home directory, the conda command is not on the operating system path by default. I.e., to use conda I need to first explicitly add this to the path. I have a conda environment specification as a yaml file, which could be used

What are snakemake metadata files? When can I erase those?

佐手、 提交于 2020-01-01 10:02:40
问题 I notice that my backup rsync script spends quite some time copying stuff with random name from .snakemake/metadata folders. What are those files used for? Can I safely erase them after a snakemake run has completed, or are they necessary for snakemake to correctly perform the next run? More generally, is there some documentation about the files that snakemake creates in the .snakemake folder? 回答1: From this comment by Johannes Koster, creator of Snakemake: [The .snakemake/ directory] is used

Parallelizing snakemake rule

人盡茶涼 提交于 2019-12-31 02:01:32
问题 Sorry if this is a naive question, but I'm still trying to wrap my head around the intricacies of Snakemake. I have a directory containing a number of files that I want to apply a rule to in parallel (i.e. I want to submit the same script to the cluster, specifying a different input file for each submission). I first tried using expand for the input files, but this only resulted in one job submission: CHROMS = [str(c) for c in range(1, 23)] + ["X"] rule vep: input: expand("data/split/chr

How to do a partial expand in Snakemake?

感情迁移 提交于 2019-12-30 10:32:47
问题 I'm trying to first generate 4 files, for the LETTERS x NUMS combinations, then summarize over the NUMS to obtain one file per element in LETTERS: LETTERS = ["A", "B"] NUMS = ["1", "2"] rule all: input: expand("combined_{letter}.txt", letter=LETTERS) rule generate_text: output: "text_{letter}_{num}.txt" shell: """ echo "test" > {output} """ rule combine text: input: expand("text_{letter}_{num}.txt", num=NUMS) output: "combined_{letter}.txt" shell: """ cat {input} > {output} """ Executing this

How to do a partial expand in Snakemake?

不打扰是莪最后的温柔 提交于 2019-12-30 10:32:00
问题 I'm trying to first generate 4 files, for the LETTERS x NUMS combinations, then summarize over the NUMS to obtain one file per element in LETTERS: LETTERS = ["A", "B"] NUMS = ["1", "2"] rule all: input: expand("combined_{letter}.txt", letter=LETTERS) rule generate_text: output: "text_{letter}_{num}.txt" shell: """ echo "test" > {output} """ rule combine text: input: expand("text_{letter}_{num}.txt", num=NUMS) output: "combined_{letter}.txt" shell: """ cat {input} > {output} """ Executing this

Snakemake passes only the first path in the output to shell command

喜你入骨 提交于 2019-12-25 00:15:57
问题 I am trying to feed all of the paths at once in one variable to a python script in snakemake like that: rule neo4j: input: script = 'python/neo4j.py', path_to_cl = 'results/clusters/umap/{sample}_umap_clusters.csv', path_to_umap = 'results/umap/{sample}_umap.csv', path_to_mtx = 'data_files/normalized/{sample}.csv' output: 'results/neo4j/{sample}/cells.csv', 'results/neo4j/{sample}/genes.csv', 'results/neo4j/{sample}/cl_nodes.csv', 'results/neo4j/{sample}/cl_contains.csv', 'results/neo4j/

Snakemake : “wildcards in input files cannot be determined from output files”

痴心易碎 提交于 2019-12-24 10:38:32
问题 I use Snakemake to execute some rules, and I've a problem with one: rule filt_SJ_out: input: "pass1/{sample}SJ.out.tab" output: "pass1/SJ.db" shell:''' gawk '$6==1 || ($6==0 && $7>2)' {input} >> {output}; ''' Here, I just want to merge some files into a general file, but by searching on google I've see that wildcards use in inputs must be also use in output. But I can't find a solution to work around this problem .. Thank's by advance 回答1: If you know the values of sample prior to running the

How to access cluster_config dict within rule?

南笙酒味 提交于 2019-12-23 21:42:41
问题 I'm working on writing a benchmarking report as part of a workflow, and one of the things I'd like to include is information about the amount of resources requested for each job. Right now, I can manually require the cluster config file ('cluster.json') as a hardcoded input. Ideally, though, I would like to be able to access the per-rule cluster config information that is passed through the --cluster-config arg. In init .py, this is accessed as a dict called cluster_config . Is there any way