问题
Say I'm following the best practise workflow suggested for snakemake. Now I'd like to know how (i.e. which version) a given file, say plots/myplot.pdf
, was generated. I found this surprisingly hard if not impossible only having the result folder at hand.
In more detail, say I was generated the results using. snakemake --use-conda --conda-prefix ~/.conda/myenvs
which will resolve and download the conda-environments specified in the rule below (copied from the documentation):
rule NAME:
input:
"table.txt"
output:
"plots/myplot.pdf"
conda:
"envs/ggplot.yaml"
script:
"scripts/plot-stuff.R"
Say the content of envs/ggplot.yaml
is the following:
channels:
- conda-forge
dependencies:
- r-ggplot2
After completion the ggplot environment will have been saved under say (note, the env name d2d1d57b assigned by snakemake automatically): ~/.conda/myevns/d2d1d57b
The problem is that if I ship the workflow
subfolder e.g. as the result to someone else (or as supplement to a paper), I don't know what ggplot
version was used for that run. All I know is the content of the yaml file (which is also reported when using --reports
.).
Also, since ggplot depends on other software, such as for instance R
, I wouldn't know which R version was used for a given rule using this environment, since yaml file doesn't list indirect dependencies.
Ideally, I'd like want to have the complete environment software version shipped with the workflow results.
As a workaround one could use conda env export name_of_env
and copy the output in the result folder, but strangly conda list -n ~/.conda/myevns/d2d1d57b
does not work ( due to error Characters not allowed: ('/', ' ', ':', '#')
)
Creating a environment manually and inspecting indeed gives me (among other info):
r-base 4.0.2 he766273_1 conda-forge
r-ggplot2 3.3.2 r40h6115d3f_0 conda-forge
That's exactly what I'm after, but this of course would be too tedious manually.
This is also true when using wrappers as far as I can tell.
In summary, given a workflow or even for a given file within the workflow, how to trace back which exact software version(s) were used to generate it. Ideally, this information would be automatically shipped with the result of a workflow by default.
Maybe I'm even missing something very obvious, so hopefully someone can shed some light on this.
回答1:
Based on our discussion in the comments, you could redirect your environment to a log file:
rule NAME:
input:
"table.txt"
output:
"plots/myplot.pdf"
log:
"mylog.txt"
conda:
"envs/ggplot.yaml"
shell:
"""
conda env export > {log}
yourcode
"""
However as you indicate this won't work if people do not use --use-conda, plus it is tedious to add this to each rule, so you could try something like this (not tested, might not work):
if workflow.use_conda:
shell.prefix("set -o pipefail; conda env export > {log}; ")
Which adds the export to each shell command!
Now if you use scripts, I am not so sure anymore how to continue. "easiest" might be to just call "conda env export" in a shell command inside python/R
edit
the shell prefix trick does not seem to work, so I striked through the text.
回答2:
As @Maarten-vd-Sande mentioned, version should be specified in the conda env file. Just as you may have thought, you will also need to define r-base
and its version in conda env file so as to ensure the use of specific version of R. See here for an example from a snakemake-wrapper.
As part of best practices towards reproducible research, it is highly recommended to specify tool versions in conda env files. Snakemake-wrappers typically follow this rule, but you might find some not following this.
来源:https://stackoverflow.com/questions/64043879/how-to-trace-back-exact-software-versions-used-to-generate-result-files-in-a-s