data-cleaning | 易学教程

Finding and removing some characters in a column of data in Excel

阅读更多关于 Finding and removing some characters in a column of data in Excel

问题 I have copied and pasted some debugging information into an Excel sheet. However, it contains some "weird" characters in some cells of one column, that should otherwise contain integers only. What would be the easiest way to eliminate such characters using VBA? An example is shown in the list below: 1 **'␁'** <- I'm trying to get rid of the part that I have bolded 2 '␂' 3 '␃' 4 '␂' I want to use the file as a data source in another application. Thanks in advance. 回答1: Try this (first time

dplyr table reconstructing/data wrangling

阅读更多关于 dplyr table reconstructing/data wrangling

问题 I'm trying to create a variable that defines true vs false searches. The original dataset is located here: https://github.com/wikimedia-research/Discovery-Hiring-Analyst-2016/blob/master/events_log.csv.gz The basic scenario is that there are variables that define how many times a user (defined by ID- either session_id or uuid in the original dataset) performs a true search vs a false search, such that a visit is always preceded by a search, but a search does not have to be followed by a visit

How to check if an id comes into data on a particular date that it stays until an exit date

阅读更多关于 How to check if an id comes into data on a particular date that it stays until an exit date

问题 I have a data set that looks something like below. Basically, I am interested in checking if a particular id is present at the beginning of the year(in this case jan,1,2003) that it is present everyday until the end of the year( dec 31 2003) then starting the checking process over again with the start of next year as people might change from year to year but should not change within a year. If on certain day, an id is not present I would like to know which day and which id. I first started

Change value of all strings in column based on condition

阅读更多关于 Change value of all strings in column based on condition

问题 New-ish to R, I have a question about data cleaning. I have a column that contains what type of drive a car is - four wheel, all wheel, 2 wheel etc The problem is there is no standardization, so some rows have 4 WHEEL drive, 4wd, 4WD, Four - Wheel - Drive, etc The first step is easy, which is to uppercase everything but the step I'm having trouble with is changing each value to a standard, like 4WD, without having to recode each unique drive. Something like For Each value in column, if value

How to find a typo in a data frame and replace it

阅读更多关于 How to find a typo in a data frame and replace it

问题 I have a data frame with names, surnames, birthdays and some random variables. Lets say it looks like this: BIRTH NAME SURNAME random_value 1 1 Luke Skywalker 1 2 1 Luke Skywalker 2 4 2 Leia Organa 3 5 3 Han Solo 7 7 1 Ben Solo 1 8 5 Lando Calrissian 3 9 3 Han Solo 4 10 3 Ham Solo 4 11 1 Luke Wkywalker 9 How can I figure out, if there is a typo in name or surname, based on BIRTH , NAME and SURNAME , and then replace the typo with the correct name or surname? For example, we see, that there

converting object types columns into numeric type using pandas

阅读更多关于 converting object types columns into numeric type using pandas

问题 I am trying to clean the data using pandas. When I execute df.datatypes it shows that the columns are of type objects. I wish to convert them into numeric types. I tried various ways of doing so like; data[['a','b']] = data[['a','b']].apply(pd.to_numeric, errors ='ignore') Then, data['c'] = data['c'].infer_objects() But nothing seems to be working. The interpreter does not throw any error but at the same time, it does not performs the desired conversion. Any help will be greatly appreciated.

Python - Pandas delete specific rows/columns in excel

阅读更多关于 Python - Pandas delete specific rows/columns in excel

问题 i have the following excel file, and i would like to clean specific rows/columns so that i can further process the file. I have tried this, but i have not managed to remove any of the blank lines, i ve only managed to trim from those containing data. Here, i was trying to only save the data from the third row and on. xl = pd.ExcelFile("MRD.xlsx") df = xl.parse("Sheet3") df2 = df.iloc[3:] writer4 = pd.ExcelWriter('pandas3.out.no3lines.xlsx', engine='xlsxwriter') table5 = pd.DataFrame(df2)

Parse Input and structure the output # Keywords from tweets

阅读更多关于 Parse Input and structure the output # Keywords from tweets

问题 I am trying to put all the #keywords from the tweetText into a separate column along with other columns. I have not mentioned other columns as they would only create confusion. The tweetText which does not have #keywords shall be deleted and those which have shall be fished out and put them in different column. I am kind of lost in the part where I need to filter the #Keywords from the tweetText . Input: TweetsID, Tweets (has many more columns) 714602054988275712,I'm at MK Appartaments in

How to split a string into different variables?

阅读更多关于 How to split a string into different variables?

问题 I'm trying to analyze a large data set for listings on Airbnb and in the amenities column, it lists out the amenities that the listing has. For example, {"Wireless Internet","Air conditioning",Kitchen,Heating,"Fire extinguisher",Essentials,Shampoo,Hangers} and {TV,"Wireless Internet","Air conditioning",Kitchen,"Elevator in building",Heating,"Suitable for events","Smoke detector","Carbon monoxide detector","First aid kit",Essentials,Shampoo,"Lock on bedroom door",Hangers,"Hair dryer",Iron,

R - Match values from 2 dataframes based on multiple condtions (when the order of lookup IDs are random)

阅读更多关于 R - Match values from 2 dataframes based on multiple condtions (when the order of lookup IDs are random)

问题 Hi I have two data frames: df1 = data.frame(PersonId1=c(1,2,3,4,5,6,7,8,9,10,1),PersonId2=c(11,12,13,14,15,16,17,18,19,20,11), Played_together = c(1,0,0,1,1,0,0,0,1,0,1), Event=c(1,1,1,1,2,2,2,2,2,2,2), Utility=c(20,-2,-5,10,30,2,1,.5,50,-1,60)) df2 = data.frame(PersonId1=c(11,15,9,1),PersonId2=c(1,5,19,11), Played_together = c(1,1,1,1), Event=c(1,2,2,2)) Where df1 looks like this: PersonId1 PersonId2 Played_together Event Utility 1 1 11 1 1 20.0 2 2 12 0 1 -2.0 3 3 13 0 1 -5.0 4 4 14 1 1 10