distinct-values

How to maintain a Unique List in Java?

心不动则不痛 提交于 2019-11-27 00:08:55
问题 How to create a list of unique/distinct objects (no duplicates) in Java? Right now I am using HashMap<String, Integer> to do this as the key is overwritten and hence at the end we can get HashMap.getKeySet() which would be unique. But I am sure there should be a better way to do this as the value part is wasted here. 回答1: You can use a Set implementation: Some info from the JAVADoc: A collection that contains no duplicate elements . More formally, sets contain no pair of elements e1 and e2

Spark DataFrame: count distinct values of every column

元气小坏坏 提交于 2019-11-26 18:58:32
The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame? The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns. Multiple aggregations would be quite expensive to compute. I suggest that you use approximation methods instead. In this case, approxating distinct count: val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3") val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap df.agg

Spark DataFrame: count distinct values of every column

柔情痞子 提交于 2019-11-26 05:34:29
问题 The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame? The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns. 回答1: Multiple aggregations would be quite expensive to compute. I suggest that you use approximation methods instead. In this case, approxating distinct count: val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3

Counting unique / distinct values by group in a data frame

可紊 提交于 2019-11-25 22:39:54
问题 Let\'s say I have the following data frame: > myvec name order_no 1 Amy 12 2 Jack 14 3 Jack 16 4 Dave 11 5 Amy 12 6 Jack 16 7 Tom 19 8 Larry 22 9 Tom 19 10 Dave 11 11 Jack 17 12 Tom 20 13 Amy 23 14 Jack 16 I want to count the number of distinct order_no values for each name . It should produce the following result: name number_of_distinct_orders Amy 2 Jack 3 Dave 1 Tom 2 Larry 1 How can I do that? 回答1: This should do the trick: ddply(myvec,~name,summarise,number_of_distinct_orders=length