duplicates

Pandas : remove SOME duplicate values based on conditions

為{幸葍}努か 提交于 2021-02-05 05:00:26
问题 I have a dataset : id url keep_if_dup 1 A.com Yes 2 A.com Yes 3 B.com No 4 B.com No 5 C.com No I want to remove duplicates, i.e. keep first occurence of "url" field, BUT keep duplicates if the field "keep_if_dup" is YES. Expected output : id url keep_if_dup 1 A.com Yes 2 A.com Yes 3 B.com No 5 C.com No What I tried : Dataframe=Dataframe.drop_duplicates(subset='url', keep='first') which of course does not take into account "keep_if_dup" field. Output is : id url keep_if_dup 1 A.com Yes 3 B.com

Pandas : remove SOME duplicate values based on conditions

こ雲淡風輕ζ 提交于 2021-02-05 04:59:31
问题 I have a dataset : id url keep_if_dup 1 A.com Yes 2 A.com Yes 3 B.com No 4 B.com No 5 C.com No I want to remove duplicates, i.e. keep first occurence of "url" field, BUT keep duplicates if the field "keep_if_dup" is YES. Expected output : id url keep_if_dup 1 A.com Yes 2 A.com Yes 3 B.com No 5 C.com No What I tried : Dataframe=Dataframe.drop_duplicates(subset='url', keep='first') which of course does not take into account "keep_if_dup" field. Output is : id url keep_if_dup 1 A.com Yes 3 B.com

Pandas - Duplicate Row based on condition

大城市里の小女人 提交于 2021-02-04 17:47:37
问题 I'm trying to create a duplicate row if the row meets a condition. In the table below, I created a cumulative count based on a groupby, then another calculation for the MAX of the groupby. df['PathID'] = df.groupby(DateCompleted).cumcount() + 1 df['MaxPathID'] = df.groupby(DateCompleted)['PathID'].transform(max) Date Completed PathID MaxPathID 1/31/17 1 3 1/31/17 2 3 1/31/17 3 3 2/1/17 1 1 2/2/17 1 2 2/2/17 2 2 In this case, I want to duplicate only the record for 2/1/17 since there is only

Dedupe in Python

痴心易碎 提交于 2021-02-04 11:41:34
问题 While going through the examples of the Dedupe library in Python which is used for records deduplication, I found out that it creates a Cluster Id column in the output file, which according to the documentation indicates which records refer to each other. Athough I am not able to find out any relation between the Cluster Id and how is this helping in finding duplicate records. If anyone has an insight into this, please explain this to me. This is the code for deduplication. # This can run

Dedupe in Python

大兔子大兔子 提交于 2021-02-04 11:41:11
问题 While going through the examples of the Dedupe library in Python which is used for records deduplication, I found out that it creates a Cluster Id column in the output file, which according to the documentation indicates which records refer to each other. Athough I am not able to find out any relation between the Cluster Id and how is this helping in finding duplicate records. If anyone has an insight into this, please explain this to me. This is the code for deduplication. # This can run

VBA looping through sheets removing duplicates

女生的网名这么多〃 提交于 2021-01-29 19:58:53
问题 I have seen similar things, but my code that seems to be working I just want to check for improvements or what potential bugs, unintended consequences there could be for it. I have been handed spreadsheets that have ended up getting duplicate information. There are a lot of spreadsheets, some have 100 sheets inside each file. Obviously don't want to go through each sheet manually using remove duplicate information. After searching around I think I have a solution and want your opinions on it.

How do I find duplicate files by comparing them by size (ie: not hashing) in bash

久未见 提交于 2021-01-29 10:22:41
问题 How do I find duplicate files by comparing them by size (ie: not hashing) in bash. Testbed files: -rw-r--r-- 1 usern users 68239 May 3 12:29 The W.pdf -rw-r--r-- 1 usern users 68239 May 3 12:29 W.pdf -rw-r--r-- 1 usern users 8 May 3 13:43 X.pdf Yes, files can have spaces (Boo!). I want to check files in the same directory, move the ones which match something else into 'these are probably duplicates' folder. My probable use-case is going to have humans randomly mis-naming a smaller set of

Run Time Error when running a VBA string to remove unique duplicates

妖精的绣舞 提交于 2021-01-29 10:20:56
问题 So I initially asked how to remove unique duplicates based on case sensitivity (please refer to the link below: Excel: Removing Duplicates Based On Case Sensitivity and ultimately I was guided to the following link: How to remove duplicates that are case SENSITIVE in Excel (for 100k records or more)? This time I'm using column q to test out the formula and so far the following formula works: Sub duptest() Sheets("Analysis").Select Dim x, dict Dim lr As Long lr = Cells(Rows.Count, 1).End(xlUp)

Postgresql group by for multiple lines

℡╲_俬逩灬. 提交于 2021-01-29 09:03:04
问题 I have this table named hr_holidays_by_calendar . I just want to filter out the rows where the same employee is having two leaves in same day . Table hr_holidays_by_calendar : Query I tried: Wasn't anywhere near in solving this. select hol1.employee_id, hol1.leave_date, hol1.no_of_days, hol1.leave_state from hr_holidays_by_calendar hol1 inner join (select employee_id, leave_date from hr_holidays_by_calendar hol1 group by employee_id, leave_date having count(*)>1)sub on hol1.employee_id=sub

Mark duplicated values MySQL without using GROUP BY

血红的双手。 提交于 2021-01-29 08:49:27
问题 Can you please help to mark duplicated values in an additional column without grouping duplicated values? See my example data (Example what I have and What I need to achieve on the right): As you can see I have Product ID with suffix E (Power) and G (Gas). Some Product IDs are duplicated: the same Product ID - one with E and the second one with G makes Dual Product . Product ID only with E makes Power_Only_product , Product ID only with G makes Gas_Only_product , the same Product ID with E