data-cleaning

Finding and removing some characters in a column of data in Excel

谁都会走 提交于 2019-12-25 02:18:09
问题 I have copied and pasted some debugging information into an Excel sheet. However, it contains some "weird" characters in some cells of one column, that should otherwise contain integers only. What would be the easiest way to eliminate such characters using VBA? An example is shown in the list below: 1 **'␁'** <- I'm trying to get rid of the part that I have bolded 2 '␂' 3 '␃' 4 '␂' I want to use the file as a data source in another application. Thanks in advance. 回答1: Try this (first time

dplyr table reconstructing/data wrangling

﹥>﹥吖頭↗ 提交于 2019-12-25 00:18:02
问题 I'm trying to create a variable that defines true vs false searches. The original dataset is located here: https://github.com/wikimedia-research/Discovery-Hiring-Analyst-2016/blob/master/events_log.csv.gz The basic scenario is that there are variables that define how many times a user (defined by ID- either session_id or uuid in the original dataset) performs a true search vs a false search, such that a visit is always preceded by a search, but a search does not have to be followed by a visit

How to check if an id comes into data on a particular date that it stays until an exit date

旧时模样 提交于 2019-12-24 17:19:11
问题 I have a data set that looks something like below. Basically, I am interested in checking if a particular id is present at the beginning of the year(in this case jan,1,2003) that it is present everyday until the end of the year( dec 31 2003) then starting the checking process over again with the start of next year as people might change from year to year but should not change within a year. If on certain day, an id is not present I would like to know which day and which id. I first started

Change value of all strings in column based on condition

白昼怎懂夜的黑 提交于 2019-12-24 13:39:31
问题 New-ish to R, I have a question about data cleaning. I have a column that contains what type of drive a car is - four wheel, all wheel, 2 wheel etc The problem is there is no standardization, so some rows have 4 WHEEL drive, 4wd, 4WD, Four - Wheel - Drive, etc The first step is easy, which is to uppercase everything but the step I'm having trouble with is changing each value to a standard, like 4WD, without having to recode each unique drive. Something like For Each value in column, if value

How to find a typo in a data frame and replace it

风流意气都作罢 提交于 2019-12-24 10:12:59
问题 I have a data frame with names, surnames, birthdays and some random variables. Lets say it looks like this: BIRTH NAME SURNAME random_value 1 1 Luke Skywalker 1 2 1 Luke Skywalker 2 4 2 Leia Organa 3 5 3 Han Solo 7 7 1 Ben Solo 1 8 5 Lando Calrissian 3 9 3 Han Solo 4 10 3 Ham Solo 4 11 1 Luke Wkywalker 9 How can I figure out, if there is a typo in name or surname, based on BIRTH , NAME and SURNAME , and then replace the typo with the correct name or surname? For example, we see, that there

converting object types columns into numeric type using pandas

回眸只為那壹抹淺笑 提交于 2019-12-24 09:21:01
问题 I am trying to clean the data using pandas. When I execute df.datatypes it shows that the columns are of type objects. I wish to convert them into numeric types. I tried various ways of doing so like; data[['a','b']] = data[['a','b']].apply(pd.to_numeric, errors ='ignore') Then, data['c'] = data['c'].infer_objects() But nothing seems to be working. The interpreter does not throw any error but at the same time, it does not performs the desired conversion. Any help will be greatly appreciated.

Python - Pandas delete specific rows/columns in excel

試著忘記壹切 提交于 2019-12-24 09:17:51
问题 i have the following excel file, and i would like to clean specific rows/columns so that i can further process the file. I have tried this, but i have not managed to remove any of the blank lines, i ve only managed to trim from those containing data. Here, i was trying to only save the data from the third row and on. xl = pd.ExcelFile("MRD.xlsx") df = xl.parse("Sheet3") df2 = df.iloc[3:] writer4 = pd.ExcelWriter('pandas3.out.no3lines.xlsx', engine='xlsxwriter') table5 = pd.DataFrame(df2)

Parse Input and structure the output # Keywords from tweets

↘锁芯ラ 提交于 2019-12-24 07:57:44
问题 I am trying to put all the #keywords from the tweetText into a separate column along with other columns. I have not mentioned other columns as they would only create confusion. The tweetText which does not have #keywords shall be deleted and those which have shall be fished out and put them in different column. I am kind of lost in the part where I need to filter the #Keywords from the tweetText . Input: TweetsID, Tweets (has many more columns) 714602054988275712,I'm at MK Appartaments in

How to split a string into different variables?

a 夏天 提交于 2019-12-24 07:18:11
问题 I'm trying to analyze a large data set for listings on Airbnb and in the amenities column, it lists out the amenities that the listing has. For example, {"Wireless Internet","Air conditioning",Kitchen,Heating,"Fire extinguisher",Essentials,Shampoo,Hangers} and {TV,"Wireless Internet","Air conditioning",Kitchen,"Elevator in building",Heating,"Suitable for events","Smoke detector","Carbon monoxide detector","First aid kit",Essentials,Shampoo,"Lock on bedroom door",Hangers,"Hair dryer",Iron,

R - Match values from 2 dataframes based on multiple condtions (when the order of lookup IDs are random)

杀马特。学长 韩版系。学妹 提交于 2019-12-23 12:58:32
问题 Hi I have two data frames: df1 = data.frame(PersonId1=c(1,2,3,4,5,6,7,8,9,10,1),PersonId2=c(11,12,13,14,15,16,17,18,19,20,11), Played_together = c(1,0,0,1,1,0,0,0,1,0,1), Event=c(1,1,1,1,2,2,2,2,2,2,2), Utility=c(20,-2,-5,10,30,2,1,.5,50,-1,60)) df2 = data.frame(PersonId1=c(11,15,9,1),PersonId2=c(1,5,19,11), Played_together = c(1,1,1,1), Event=c(1,2,2,2)) Where df1 looks like this: PersonId1 PersonId2 Played_together Event Utility 1 1 11 1 1 20.0 2 2 12 0 1 -2.0 3 3 13 0 1 -5.0 4 4 14 1 1 10