data-processing

Handling missing/incomplete data in R--is there function to mask but not remove NAs?

断了今生、忘了曾经 提交于 2019-11-28 18:40:57
As you would expect from a DSL aimed at data analysis, R handles missing/incomplete data very well, for instance: Many R functions have an na.rm flag that when set to TRUE , remove the NAs: >>> v = mean( c(5, NA, 6, 12, NA, 87, 9, NA, 43, 67), na.rm=T) >>> v (5, 6, 12, 87, 9, 43, 67) But if you want to deal with NAs before the function call, you need to do something like this: to remove each 'NA' from a vector: vx = vx[!is.na(a)] to remove each 'NA' from a vector and replace it w/ a '0': ifelse(is.na(vx), 0, vx) to remove entire each row that contains 'NA' from a data frame: dfx = dfx[complete

Large scale data processing Hbase vs Cassandra [closed]

别来无恙 提交于 2019-11-28 13:17:31
问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 7 years ago . I am nearly landed at Cassandra after my research on large scale data storage solutions. But its generally said that Hbase is better

How to read 4GB file on 32bit system

生来就可爱ヽ(ⅴ<●) 提交于 2019-11-28 05:03:21
问题 In my case I have different files lets assume that I have >4GB file with data. I want to read that file line by line and process each line. One of my restrictions is that soft has to be run on 32bit MS Windows or on 64bit with small amount of RAM (min 4GB). You can also assume that processing of these lines isn't bottleneck. In current solution I read that file by ifstream and copy to some string. Here is snippet how it looks like. std::ifstream file(filename_xml.c_str()); uintmax_t m

Algorithm for grouping anagram words

风格不统一 提交于 2019-11-28 04:24:42
Given a set of words, we need to find the anagram words and display each category alone using the best algorithm. input: man car kile arc none like output: man car arc kile like none The best solution I am developing now is based on an hashtable, but I am thinking about equation to convert anagram word into integer value. Example: man => 'm'+'a'+'n' but this will not give unique values. Any suggestion? See following code in C#: string line = Console.ReadLine(); string []words=line.Split(' '); int[] numbers = GetUniqueInts(words); for (int i = 0; i < words.Length; i++) { if (table.ContainsKey

Hibernate out of memory exception while processing large collection of elements

让人想犯罪 __ 提交于 2019-11-28 01:03:54
问题 I am trying to process collection of heavy weight elements (images). Size of collection varies between 8000 - 50000 entries. But for some reason after processing 1800-1900 entries my program falls with java.lang.OutOfMemoryError: Java heap space. In my understanding each time when I call session.getTransaction().commit() program should free heap memory, but looks like it never happens. What do I do wrong? Here is the code: private static void loadImages( LoadStrategy loadStrategy ) throws

Ways to read only select columns from a file into R? (A happy medium between `read.table` and `scan`?) [duplicate]

流过昼夜 提交于 2019-11-27 17:23:48
This question already has an answer here: Only read selected columns 3 answers I have some very big delimited data files and I want to process only certain columns in R without taking the time and memory to create a data.frame for the whole file. The only options I know of are read.table which is very wasteful when I only want a couple of columns or scan which seems too low level for what I want. Is there a better option, either with pure R or perhaps calling out to some other shell script to do the column extraction and then using scan or read.table on it's output? (Which leads to the

Handling missing/incomplete data in R--is there function to mask but not remove NAs?

杀马特。学长 韩版系。学妹 提交于 2019-11-27 11:35:44
问题 As you would expect from a DSL aimed at data analysis, R handles missing/incomplete data very well, for instance: Many R functions have an na.rm flag that when set to TRUE , remove the NAs: >>> v = mean( c(5, NA, 6, 12, NA, 87, 9, NA, 43, 67), na.rm=T) >>> v (5, 6, 12, 87, 9, 43, 67) But if you want to deal with NAs before the function call, you need to do something like this: to remove each 'NA' from a vector: vx = vx[!is.na(a)] to remove each 'NA' from a vector and replace it w/ a '0':

Algorithm for grouping anagram words

左心房为你撑大大i 提交于 2019-11-27 05:20:27
问题 Given a set of words, we need to find the anagram words and display each category alone using the best algorithm. input: man car kile arc none like output: man car arc kile like none The best solution I am developing now is based on an hashtable, but I am thinking about equation to convert anagram word into integer value. Example: man => 'm'+'a'+'n' but this will not give unique values. Any suggestion? See following code in C#: string line = Console.ReadLine(); string []words=line.Split(' ');

Replacing numbers within a range with a factor

这一生的挚爱 提交于 2019-11-26 19:11:54
Given a dataframe column which is a series of integers (age), I want to convert ranges of integers into ordinal variables. My current code doesn't work, how do I do this? df <- read.table("http://dl.dropbox.com/u/822467/df.csv", header = TRUE, sep = ",") df[(df >= 0) & (df <= 14)] <- "Age1" df[(df >= 15) & (df <= 44)] <- "Age2" df[(df >= 45) & (df <= 64)] <- "Age3" df[(df > 64)] <- "Age4" table(df) Use cut to do this in one step: dfc <- cut(df$x, breaks=c(0, 15, 45, 56, Inf)) str(dfc) Factor w/ 4 levels "(0,15]","(15,45]",..: 3 4 3 2 2 4 2 2 4 4 ... Once you are satisfied that the breaks are

Replacing numbers within a range with a factor

大城市里の小女人 提交于 2019-11-26 06:49:32
问题 Given a dataframe column which is a series of integers (age), I want to convert ranges of integers into ordinal variables. My current code doesn\'t work, how do I do this? df <- read.table(\"http://dl.dropbox.com/u/822467/df.csv\", header = TRUE, sep = \",\") df[(df >= 0) & (df <= 14)] <- \"Age1\" df[(df >= 15) & (df <= 44)] <- \"Age2\" df[(df >= 45) & (df <= 64)] <- \"Age3\" df[(df > 64)] <- \"Age4\" table(df) 回答1: Use cut to do this in one step: dfc <- cut(df$x, breaks=c(0, 15, 45, 56, Inf)