chaos game for DNA sequences

只愿长相守 提交于 2019-12-03 03:25:01

Summary of the incremental edits below:

This will give you a considerable speedup in computing the point coordinates by using compiled code (50x excluding computing shifts):

shifts = chars /. {"A" -> {0., 0.}, "T" -> {.5, 0.}, "G" -> {.5, .5}, "C" -> {0, .5}};
fun1d = Compile[{{a, _Real, 1}}, FoldList[#/2 + #2 &, .5, a], CompilationTarget -> "C"]
pts = Transpose[fun1d /@ Transpose[shifts]];

The bottleneck in your code is actually rendering the graphic, we instead of plotting each point, we'll visualize the density of points:

threshold = 1;
With[{size = 300}, 
 Image[1 - UnitStep[BinCounts[pts, 1/size, 1/size] - threshold]]
]

A region will be coloured black if it has at least threshold points. size is the image-dimension. By either choosing a large size or a large threshold you can avoid the "black square problem".


My original answer with more details:

On my rather dated machine, the code is not very slow.

chars = RandomChoice[{"A", "T", "C", "G"}, 800000];

f[x_, "A"] := x/2;
f[x_, "T"] := x/2 + {1/2, 0};
f[x_, "G"] := x/2 + {1/2, 1/2};
f[x_, "C"] := x/2 + {0, 1/2};
Timing[pts = FoldList[f, {0.5, 0.5}, chars];]
Graphics[{PointSize[Tiny], Point[pts]}]

I get a timing of 6.8 seconds, which is usable unless you need to run it lots of times in a loop (if it's not fast enough for your use case and machine, please add a comment, and we'll try to speed it up).

Rendering the graphic unfortunately takes much longer than this (36 seconds), and I don't know if there's anything you can do about it. Disabling antialiasing may help a little bit, depending on your platform, but not much: Style[Graphics[{PointSize[Tiny], Point[pts]}], Antialiasing -> False] (for me it doesn't). This is a long-standing annoyance for many of us.

Regarding the whole graphic being black, you can resize it using your mouse and make it bigger. The next time you evaluate your expression, the output graphic will remember its size. Or just use ImageSize -> 800 as a Graphics option. Considering the pixel density of screens the only other solution that I can think of (that doesn't involve resizing the graphic) would be to represent pixel density using shades of grey, and plot the density.

EDIT:

This is how you can plot the density (this is also much much faster to compute and render than the point-plot!):

With[{resolution = 0.01}, 
 ArrayPlot@BinCounts[pts, resolution, resolution]
]

Play with the resolution to make the plot nice.

For my random-sequence example, this only gives a grey plot. For your genome data it will probably give a more interesting pattern.

EDIT 2:

Here's a simple way to speed up the function using compilation:

First, replace the characters by the shift vectors (has to be done only once for a dataset, then you can save the result):

arr = chars /. {"A" -> {0., 0.}, "T" -> {.5, 0.}, "G" -> {.5, .5}, "C" -> {0, .5}};

Then let's compile our function:

fun = Compile[{{a, _Real, 2}}, FoldList[#/2 + #2 &, {.5, .5}, a], 
 CompilationTarget -> "C"]

Remove CompilationTarget if your version of Mathematica is earlier than 8 or you don't have a C compiler installed.

fun[arr]; // Timing

gives me 0.6 seconds, which is an instant 10x speedup.

EDIT 3:

Another ~5x speedup is possible compared to the above compiled version by avoiding some kernel callbacks in the compiled function (I checked the compilation output using CompilePrint to come up with this version --- otherwise it's not obvious why it's faster):

fun1d = Compile[{{a, _Real, 1}}, FoldList[#/2 + #2 &, .5, a], 
  CompilationTarget -> "C"]

arrt = Transpose[arr];
Timing[result = fun1d /@ arrt;]
pts = Transpose[result];

This runs in 0.11 seconds on my machine. On a more modern machine it should finish in a few seconds even for a 40 MB dataset.

I split off the transpositions into separate inputs because at this point the running time of fun1d starts to get comparable to the running time of Transpose.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!