data-processing | 易学教程

Handling missing/incomplete data in R--is there function to mask but not remove NAs?

阅读更多关于 Handling missing/incomplete data in R--is there function to mask but not remove NAs?

As you would expect from a DSL aimed at data analysis, R handles missing/incomplete data very well, for instance: Many R functions have an na.rm flag that when set to TRUE , remove the NAs: >>> v = mean( c(5, NA, 6, 12, NA, 87, 9, NA, 43, 67), na.rm=T) >>> v (5, 6, 12, 87, 9, 43, 67) But if you want to deal with NAs before the function call, you need to do something like this: to remove each 'NA' from a vector: vx = vx[!is.na(a)] to remove each 'NA' from a vector and replace it w/ a '0': ifelse(is.na(vx), 0, vx) to remove entire each row that contains 'NA' from a data frame: dfx = dfx[complete

Large scale data processing Hbase vs Cassandra [closed]

阅读更多关于 Large scale data processing Hbase vs Cassandra [closed]

问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 7 years ago . I am nearly landed at Cassandra after my research on large scale data storage solutions. But its generally said that Hbase is better

How to read 4GB file on 32bit system

阅读更多关于 How to read 4GB file on 32bit system

问题 In my case I have different files lets assume that I have >4GB file with data. I want to read that file line by line and process each line. One of my restrictions is that soft has to be run on 32bit MS Windows or on 64bit with small amount of RAM (min 4GB). You can also assume that processing of these lines isn't bottleneck. In current solution I read that file by ifstream and copy to some string. Here is snippet how it looks like. std::ifstream file(filename_xml.c_str()); uintmax_t m

Algorithm for grouping anagram words

阅读更多关于 Algorithm for grouping anagram words

Given a set of words, we need to find the anagram words and display each category alone using the best algorithm. input: man car kile arc none like output: man car arc kile like none The best solution I am developing now is based on an hashtable, but I am thinking about equation to convert anagram word into integer value. Example: man => 'm'+'a'+'n' but this will not give unique values. Any suggestion? See following code in C#: string line = Console.ReadLine(); string []words=line.Split(' '); int[] numbers = GetUniqueInts(words); for (int i = 0; i < words.Length; i++) { if (table.ContainsKey

Hibernate out of memory exception while processing large collection of elements

阅读更多关于 Hibernate out of memory exception while processing large collection of elements

问题 I am trying to process collection of heavy weight elements (images). Size of collection varies between 8000 - 50000 entries. But for some reason after processing 1800-1900 entries my program falls with java.lang.OutOfMemoryError: Java heap space. In my understanding each time when I call session.getTransaction().commit() program should free heap memory, but looks like it never happens. What do I do wrong? Here is the code: private static void loadImages( LoadStrategy loadStrategy ) throws

Ways to read only select columns from a file into R? (A happy medium between `read.table` and `scan`?) [duplicate]

阅读更多关于 Ways to read only select columns from a file into R? (A happy medium between `read.table` and `scan`?) [duplicate]

This question already has an answer here: Only read selected columns 3 answers I have some very big delimited data files and I want to process only certain columns in R without taking the time and memory to create a data.frame for the whole file. The only options I know of are read.table which is very wasteful when I only want a couple of columns or scan which seems too low level for what I want. Is there a better option, either with pure R or perhaps calling out to some other shell script to do the column extraction and then using scan or read.table on it's output? (Which leads to the

Handling missing/incomplete data in R--is there function to mask but not remove NAs?

阅读更多关于 Handling missing/incomplete data in R--is there function to mask but not remove NAs?

问题 As you would expect from a DSL aimed at data analysis, R handles missing/incomplete data very well, for instance: Many R functions have an na.rm flag that when set to TRUE , remove the NAs: >>> v = mean( c(5, NA, 6, 12, NA, 87, 9, NA, 43, 67), na.rm=T) >>> v (5, 6, 12, 87, 9, 43, 67) But if you want to deal with NAs before the function call, you need to do something like this: to remove each 'NA' from a vector: vx = vx[!is.na(a)] to remove each 'NA' from a vector and replace it w/ a '0':

Algorithm for grouping anagram words

阅读更多关于 Algorithm for grouping anagram words

问题 Given a set of words, we need to find the anagram words and display each category alone using the best algorithm. input: man car kile arc none like output: man car arc kile like none The best solution I am developing now is based on an hashtable, but I am thinking about equation to convert anagram word into integer value. Example: man => 'm'+'a'+'n' but this will not give unique values. Any suggestion? See following code in C#: string line = Console.ReadLine(); string []words=line.Split(' ');

Replacing numbers within a range with a factor

阅读更多关于 Replacing numbers within a range with a factor

Given a dataframe column which is a series of integers (age), I want to convert ranges of integers into ordinal variables. My current code doesn't work, how do I do this? df <- read.table("http://dl.dropbox.com/u/822467/df.csv", header = TRUE, sep = ",") df[(df >= 0) & (df <= 14)] <- "Age1" df[(df >= 15) & (df <= 44)] <- "Age2" df[(df >= 45) & (df <= 64)] <- "Age3" df[(df > 64)] <- "Age4" table(df) Use cut to do this in one step: dfc <- cut(df$x, breaks=c(0, 15, 45, 56, Inf)) str(dfc) Factor w/ 4 levels "(0,15]","(15,45]",..: 3 4 3 2 2 4 2 2 4 4 ... Once you are satisfied that the breaks are

Replacing numbers within a range with a factor

阅读更多关于 Replacing numbers within a range with a factor

问题 Given a dataframe column which is a series of integers (age), I want to convert ranges of integers into ordinal variables. My current code doesn\'t work, how do I do this? df <- read.table(\"http://dl.dropbox.com/u/822467/df.csv\", header = TRUE, sep = \",\") df[(df >= 0) & (df <= 14)] <- \"Age1\" df[(df >= 15) & (df <= 44)] <- \"Age2\" df[(df >= 45) & (df <= 64)] <- \"Age3\" df[(df > 64)] <- \"Age4\" table(df) 回答1: Use cut to do this in one step: dfc <- cut(df$x, breaks=c(0, 15, 45, 56, Inf)