Delete repeating list elements preserving order of appearance

笑着哭i 提交于 2019-12-17 11:00:58

问题


I am producing flat lists with 10^6 to 10^7 Real numbers, and some of them are repeating.

I need to delete the repeating instances, keeping the first occurrence only, and without modifying the list order.

The key here is efficiency, as I have a lot of lists to process.

Example (fake):

Input:

  {.8, .3 , .8, .5, .3, .6}

Desired Output

  {.8, .3, .5, .6}  

Aside note

Deleting repeating elements with Union (without preserving order) gives in my poor man's laptop:

DiscretePlot[a = RandomReal[10, i]; First@Timing@Union@a, {i, 10^6 Range@10}]


回答1:


You want DeleteDuplicates, which preserves list order:

In[13]:= DeleteDuplicates[{.8, .3, .8, .5, .3, .6}]

Out[13]= {0.8, 0.3, 0.5, 0.6}

It was added in Mathematica 7.0.




回答2:


Not to compete with other answers, but I just could not help sharing a Compile - based solution. The solution is based on building a binary search tree, and then checking for every number in the list, whether its index in the list is the one used in building the b-tree. If yes, it is the original number, if no - it is a duplicate. What makes this solution interesting for me is that it shows a way to emulate "pass-by-reference" with Compile. The point is that, if we inline compiled functions into other Compiled functions (and that can be achieved with an "InlineCompiledFunctions" option), we can refer in inner functions to the variables defined in outer function scope (because of the way inlining works). This is not a true pass-by-reference, but it still allows to combine functions from smaller blocks, without efficiency penalty (this is more in the spirit of macro-expnsion). I don't think this is documented, and have no idea whether this will stay in future versions. Anyways, here is the code:

(* A function to build a binary tree *)
Block[{leftchildren , rightchildren},
makeBSearchTree = 
Compile[{{lst, _Real, 1}},
Module[{len = Length[lst], ctr = 1, currentRoot = 1},
 leftchildren = rightchildren =  Table[0, {Length[lst]}];
 For[ctr = 1, ctr <= len, ctr++,
  For[currentRoot = 1, lst[[ctr]] != lst[[currentRoot]],(* 
   nothing *),
   If[lst[[ctr]] < lst[[currentRoot]],
    If[leftchildren[[currentRoot]] == 0,
     leftchildren[[currentRoot]] = ctr;
     Break[],
     (* else *)
     currentRoot = leftchildren[[currentRoot]] ],
    (* else *)
    If[rightchildren[[currentRoot]] == 0,
     rightchildren[[currentRoot]] = ctr;
     Break[],
     (* else *)
     currentRoot = rightchildren[[currentRoot]]]]]];
 ], {{leftchildren, _Integer, 1}, {rightchildren, _Integer, 1}},
CompilationTarget -> "C", "RuntimeOptions" -> "Speed",
CompilationOptions -> {"ExpressionOptimization" -> True}]];


(* A function to query the binary tree and check for a duplicate *)
Block[{leftchildren , rightchildren, lst},
isDuplicate = 
Compile[{{index, _Integer}},
Module[{currentRoot = 1, result = True},
 While[True,
  Which[
   lst[[index]] == lst[[currentRoot]],
    result = index != currentRoot;
    Break[],
   lst[[index]] < lst[[currentRoot]],
    currentRoot = leftchildren[[currentRoot]],
   True,
    currentRoot = rightchildren[[currentRoot]]
   ]];
 result
 ],
{{leftchildren, _Integer, 1}, {rightchildren, _Integer, 
  1}, {lst, _Real, 1}},
CompilationTarget -> "C", "RuntimeOptions" -> "Speed",
CompilationOptions -> {"ExpressionOptimization" -> True}
]];


(* The main function *)
Clear[deldup];
deldup = 
Compile[{{lst, _Real, 1}},
  Module[{len = Length[lst], leftchildren , rightchildren , 
     nodup = Table[0., {Length[lst]}], ndctr = 0, ctr = 1},
makeBSearchTree[lst]; 
For[ctr = 1, ctr <= len, ctr++,
 If[! isDuplicate [ctr],
  ++ndctr;
   nodup[[ndctr]] =  lst[[ctr]]
  ]];
Take[nodup, ndctr]], CompilationTarget -> "C", 
"RuntimeOptions" -> "Speed",
CompilationOptions -> {"ExpressionOptimization" -> True,
 "InlineCompiledFunctions" -> True, 
 "InlineExternalDefinitions" -> True}];

Here are some tests:

In[61]:= intTst = N@RandomInteger[{0,500000},1000000];

In[62]:= (res1 = deldup[intTst ])//Short//Timing
Out[62]= {1.141,{260172.,421188.,487754.,259397.,<<432546>>,154340.,295707.,197588.,119996.}}

In[63]:= (res2 = Tally[intTst,Equal][[All,1]])//Short//Timing
Out[63]= {0.64,{260172.,421188.,487754.,259397.,<<432546>>,154340.,295707.,197588.,119996.}}

In[64]:= res1==res2
Out[64]= True

Not as fast as the Tally version, but also Equal - based, and as I said, my point was to illustrate an interesting (IMO) technique.




回答3:


For versions of Mathematica before 7, and for general interest, here are several ways of implementing the UnsortedUnion (i.e. DeleteDuplicates) function. These are collected from the help docs and MathGroup. They have been adjusted to accept multiple lists which are then joined, in analogy to Union.

For Mathematica 4 or earlier

UnsortedUnion = Module[{f}, f[y_] := (f[y] = Sequence[]; y); f /@ Join@##] &

For Mathematica 5

UnsortedUnion[x__List] := Reap[Sow[1, Join@x], _, # &][[2]]

For Mathematica 6

UnsortedUnion[x__List] := Tally[Join@x][[All, 1]]

From Leonid Shifrin for Mathematica 3+ (?)

unsortedUnion[x_List] := Extract[x, Sort[Union[x] /. Dispatch[MapIndexed[Rule, x]]]]


来源:https://stackoverflow.com/questions/5246330/delete-repeating-list-elements-preserving-order-of-appearance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!