I run the following code :
traindata = trainData.read_csv(\'train.tsv\', delimiter = \'\\t\')
which calls this function :
def
Numpy is creating an array of huge strings, each with a length set to the maximum length of any one string in that column, and you are probably running out of ram in the middle of this massive memory allocation.
By doing
self.data = np.array(rows, dtype=object)
numpy doesn't need to allocate big chunks of new memory for string objects - dtype=object
tells numpy to keep its array contents as references to existing python objects (the strings already exist in your python list rows
), and these pointers take up much less space than the string objects would.