data-cleaning | 易学教程

Python Pandas — Forward filling entire rows with value of one previous column

阅读更多关于 Python Pandas — Forward filling entire rows with value of one previous column

问题 New to pandas development. How do I forward fill a DataFrame with the value contained in one previously seen column? Self-contained example: import pandas as pd import numpy as np O = [1, np.nan, 5, np.nan] H = [5, np.nan, 5, np.nan] L = [1, np.nan, 2, np.nan] C = [5, np.nan, 2, np.nan] timestamps = ["2017-07-23 03:13:00", "2017-07-23 03:14:00", "2017-07-23 03:15:00", "2017-07-23 03:16:00"] dict = {'Open': O, 'High': H, 'Low': L, 'Close': C} df = pd.DataFrame(index=timestamps, data=dict) ohlc

Python Pandas replace multiple columns zero to Nan

阅读更多关于 Python Pandas replace multiple columns zero to Nan

问题 List with attributes of persons loaded into pandas dataframe df2 . For cleanup I want to replace value zero ( 0 or '0' ) by np.nan . df2.dtypes ID object Name object Weight float64 Height float64 BootSize object SuitSize object Type object dtype: object Working code to set value zero to np.nan : df2.loc[df2['Weight'] == 0,'Weight'] = np.nan df2.loc[df2['Height'] == 0,'Height'] = np.nan df2.loc[df2['BootSize'] == '0','BootSize'] = np.nan df2.loc[df2['SuitSize'] == '0','SuitSize'] = np.nan

Turn 2010 Q1 to datetime as 2010-3-31

阅读更多关于 Turn 2010 Q1 to datetime as 2010-3-31

问题 How to find a smart solution to turn Year_Q to datetime? I tried to use pd.to_datetime(working_visa_nationality['Year_Q']) but got error says that this cannot be recognized. So I tried a stupid way as: working_visa_nationality['Year'] = working_visa_nationality.Year_Q.str.slice(0,4) working_visa_nationality['Quarter'] = working_visa_nationality.Year_Q.str.slice(6,8) And now I found a problem: it is true that I can groupby data by the year, but it is difficult to include the quarter to my line

Avoiding type conflicts with dplyr::case_when

阅读更多关于 Avoiding type conflicts with dplyr::case_when

问题 I am trying to use dplyr::case_when within dplyr::mutate to create a new variable where I set some values to missing and recode other values simultaneously. However, if I try to set values to NA , I get an error saying that we cannot create the variable new because NA s are logical: Error in mutate_impl(.data, dots) : Evaluation error: must be type double, not logical. Is there a way to set values to NA in a non-logical vector in a data frame using this? library(dplyr) # Create data df <-

How do I clean twitter data in R?

阅读更多关于 How do I clean twitter data in R?

问题 I extracted tweets from twitter using the twitteR package and saved them into a text file. I have carried out the following on the corpus xx<-tm_map(xx,removeNumbers, lazy=TRUE, 'mc.cores=1') xx<-tm_map(xx,stripWhitespace, lazy=TRUE, 'mc.cores=1') xx<-tm_map(xx,removePunctuation, lazy=TRUE, 'mc.cores=1') xx<-tm_map(xx,strip_retweets, lazy=TRUE, 'mc.cores=1') xx<-tm_map(xx,removeWords,stopwords(english), lazy=TRUE, 'mc.cores=1') (using mc.cores=1 and lazy=True as otherwise R on mac is running

Is there an R function for checking if a specified GeoJSON object(polygon or multi-polygon) contains the specified point?

阅读更多关于 Is there an R function for checking if a specified GeoJSON object(polygon or multi-polygon) contains the specified point?

问题 I have an array of point { "Sheet1": [ { "CoM ID": "1040614", "Genus": "Washingtonia", "Year Planted": "1998", "Latitude": "-37.81387927", "Longitude": "144.9817733" }, { "CoM ID": "1663526", "Genus": "Banksia", "Year Planted": "2017", "Latitude": "-37.79582801", "Longitude": "144.9160598" }, { "CoM ID": "1031170", "Genus": "Melaleuca", "Year Planted": "1997", "Latitude": "-37.82326441", "Longitude": "144.9305296" } ] } and also an array of Geojson polygon in the same form as shown below: {

Efficient String Search and Replace

阅读更多关于 Efficient String Search and Replace

问题 I am trying to clean about 2 million entries in a database consisting of job titles. Many have several abbreviations that I wish to change to a single consistent and more easily searchable option. So far I am simply running through the column with individual mapply(gsub(...) commands. But I have about 80 changes to make this way, so it takes almost 30 minutes to run. There has got to be a better way. I'm new to string searching, I found the *$ trick, which helped. Is there a way to do more

R - Creating New Column Based off of a Partial String

阅读更多关于 R - Creating New Column Based off of a Partial String

问题 I have a large dataset (Dataset "A") with a column Description which contains something along the lines "1952 Rolls Royce Silver Wraith" or "1966 Holden ". I also have a separate dataset (Dataset "B") with a list of every Car Brand that I need (eg " Holden ", " Rolls Royce ", "Porsche"). How can I create a new column in dataset "A" that assigns the Partial strings of the Description with the correct Car Brand ? (This column would only hold the correct Car Brand with the appropriate matching

How to locate a structured region of data inside of a not structured data frame in R?

阅读更多关于 How to locate a structured region of data inside of a not structured data frame in R?

问题 I have a certain kind of data frames that contain a subset of interest. The problem is that this subset, is non consistent between the different data frames . Nonetheless, in a more abstract level , follows a general structure: a rectangular region inside the data frame. example1 <- data.frame(x = c("name", "129-2", NA, NA, "acc", 2, 3, 4, NA, NA), y = c(NA, NA, NA, NA, "deb", 3, 2, 5, NA, NA), z = c(NA, NA, NA, NA, "asset", 1, 1, 2, NA, NA)) print(example1) x y z 1 name <NA> <NA> 2 129-2 <NA

tool to extract data structures from unclean data

阅读更多关于 tool to extract data structures from unclean data

问题 I have unstructured geneally unclean data in a database field. There are common structures which are consistent in the data namely: field: name:value fieldset: nombre <FieldSet> field, . . . field(n) table nombre <table> head(1)... head(n) val(1)... val(n) . . . I was wondering if there was a tool (preferably in Java) that could extract learn/understand these data structures, parse the file and convert to a Map or object which I could run validation checks on? I am aware of Antlr but