missing-data

Exporting ints with missing values to csv in Pandas

て烟熏妆下的殇ゞ 提交于 2019-12-04 23:11:46
When saving a Pandas DataFrame to csv, some integers are getting converted in floats. It happens where a column of floats has missing values ( np.nan ). Is there a simple way to avoid it? (Especially in an automatic way - I often deal with many columns of various data types.) For example import pandas as pd import numpy as np df = pd.DataFrame([[1,2],[3,np.nan],[5,6]], columns=["a","b"], index=["i_1","i_2","i_3"]) df.to_csv("file.csv") yields ,a,b i_1,1,2.0 i_2,3, i_3,5,6.0 What I would like to get is ,a,b i_1,1,2 i_2,3, i_3,5,6 EDIT: I am fully aware of Support for integer NA - Pandas Caveats

Highcharts: Displaying Linechart with missing datapoints

本小妞迷上赌 提交于 2019-12-04 20:17:25
I am calculating the average-value of properties for each week of the year. And I want to display these information in a line chart (x-Axis is the week of year, y-Axis the average value and the different lines represent different properties). But for any given property I do not necessarily have a datapoint for each week of the year. If I do not have such a datapoint I want my line for this property to interpolate between the datapoints I have. Anyone else run into a similiar issue? Highcharts does not really do interpolation. Sure, if your series has a missing point it will draw the line

Row-by-row fillna with respect to a specific column?

旧巷老猫 提交于 2019-12-04 16:55:37
I have the following pandas dataframe and I would like to fill the NaNs in columns A-C in a row-wise fashion with values from columns D. Is there an explicit way to do this where I can define that all the NaNs should depend row-wise on values in column D? I couldn't find a way to explicitly do this in fillna(). Note that there are additional columns E-Z which have their own NaNs and may have other rules for filling in NaNs, and should be left untouched . A B C D E 158 158 158 177 ... 158 158 158 177 ... NaN NaN NaN 177 ... 158 158 158 177 ... NaN NaN NaN 177 ... Would like to have this for

Multi-Indexed fillna in Pandas

夙愿已清 提交于 2019-12-04 15:15:59
I have a multi-indexed dataframe and I'm looking to backfill missing values within a group. The dataframe I have currently looks like this: df = pd.DataFrame({ 'group': ['group_a'] * 7 + ['group_b'] * 3 + ['group_c'] * 2, 'Date': ["2013-06-11", "2013-07-02", "2013-07-09", "2013-07-30", "2013-08-06", "2013-09-03", "2013-10-01", "2013-07-09", "2013-08-06", "2013-09-03", "2013-07-09", "2013-09-03"], 'Value': [np.nan, np.nan, np.nan, 9, 4, 40, 18, np.nan, np.nan, 5, np.nan, 2]}) df.Date = df['Date'].apply(lambda x: pd.to_datetime(x).date()) df = df.set_index(['group', 'Date']) I'm trying to get a

Replace mean or mode for missing values in R

ε祈祈猫儿з 提交于 2019-12-04 14:39:52
I have a large database made up of mixed data types (numeric, character, factor, ordinal factor) with missing values, and I am trying to create a for loop to substitute the missing values using either the mean of the respective column if numerical or the mode if character/factor. This is what I have until now: #fake array: age<- c(5,8,10,12,NA) a <- factor(c("aa", "bb", NA, "cc", "cc")) b <- c("banana", "apple", "pear", "grape", NA) df_test <- data.frame(age=age, a=a, b=b) df_test$b <- as.character(df_test$b) for (var in 1:ncol(df_test)) { if (class(df_test[,var])=="numeric") { df_test[is.na

Predict.glm not predicting missing values in response

核能气质少年 提交于 2019-12-04 11:23:03
问题 For some reason, when I specify glms (and lm's too, it turns out), R is not predicting missing values of the data. Here is an example: y = round(runif(50)) y = c(y,rep(NA,50)) x = rnorm(100) m = glm(y~x, family=binomial(link="logit")) p = predict(m,na.action=na.pass) length(p) y = round(runif(50)) y = c(y,rep(NA,50)) x = rnorm(100) m = lm(y~x) p = predict(m) length(p) The length of p should be 100, but its 50. The weird thing is that I have other predicts in the same script that do predict

Specify different types of missing values (NAs)

倾然丶 夕夏残阳落幕 提交于 2019-12-04 10:24:39
问题 I'm interested to specify types of missing values. I have data that have different types of missing and I am trying to code these values as missing in R, but I am looking for a solution were I can still distinguish between them. Say I have some data that looks like this, set.seed(667) df <- data.frame(a = sample(c("Don't know/Not sure","Unknown","Refused","Blue", "Red", "Green"), 20, rep=TRUE), b = sample(c(1, 2, 3, 77, 88, 99), 10, rep=TRUE), f = round(rnorm(n=10, mean=.90, sd=.08), digits =

Replace Nulls in DataFrame with Max in Row

北城余情 提交于 2019-12-04 07:08:19
Is there a way (more efficient than using a for loop) to replace all the nulls in a Pandas' DataFrame with the max value in its respective row. I guess that is what you are looking for: import pandas as pd df = pd.DataFrame({'a': [1, 2, 0], 'b': [3, 0, 10], 'c':[0, 5, 34]}) a b c 0 1 3 0 1 2 0 5 2 0 10 34 You can use apply , iterate over all rows and replace 0 by the maximal number of the row by using the replace function which gives you the expected output: df.apply(lambda row: row.replace(0, max(row)), axis=1) a b c 0 1 3 3 1 2 5 5 2 34 10 34 If you want to to replace NaN - which seemed to

missing value when calculating running medians?

你说的曾经没有我的故事 提交于 2019-12-04 03:54:59
问题 I would like to smooth out a time series to avoid spurious jitter/error. In other words I want to do some very local robust smoothing. I came across rollmean and rollmedian in the zoo package but ran into a problem because my vector had a NA in it. I then read somewhere that those zoo functions use runmed and therein lies the problem. ==examples== median(c(1,1,1,2,2,2,7,NA,1,2,3,10,10,10),na.rm = TRUE) runmed(c(1,1,1,2,2,2,7,NA,1,2,3,10,10,10),k=3) The first line returns 2, but would have

Identifying rows in data.frame with only NA values in R

我们两清 提交于 2019-12-04 02:11:55
I have a data.frame with 15,000 observations of 34 ordinal and NA variables. I am performing clustering for a market segmentation study and need the rows with only NAs removed. After taking out the userID I got an error message saying to omit 2099 rows with only NAs before clustering. I found a link for removing rows with all NA values, but I need to identify which of the 2099 rows have all NA values. Here the link for the discussion removing rows with all NA values: Remove Rows with NAs in data.frame Here's a sample of the first five observations from six variables: > head(Store2df, n=5)