问题
I'm working on the following code for performing Random Forest Classification on train and test sets;
from sklearn.ensemble import RandomForestClassifier
from numpy import genfromtxt, savetxt
def main():
dataset = genfromtxt(open('filepath','r'), delimiter=' ', dtype='f8')
target = [x[0] for x in dataset]
train = [x[1:] for x in dataset]
test = genfromtxt(open('filepath','r'), delimiter=' ', dtype='f8')
rf = RandomForestClassifier(n_estimators=100)
rf.fit(train, target)
predicted_probs = [[index + 1, x[1]] for index, x in enumerate(rf.predict_proba(test))]
savetxt('filepath', predicted_probs, delimiter=',', fmt='%d,%f',
header='Id,PredictedProbability', comments = '')
if __name__=="__main__":
main()
However I get the following error on execution;
----> dataset = genfromtxt(open('C:/Users/Saurabh/Desktop/pgm/Cora/a_train.csv','r'), delimiter='', dtype='f8')
ValueError: Some errors were detected !
Line #88 (got 1435 columns instead of 1434)
Line #93 (got 1435 columns instead of 1434)
Line #164 (got 1435 columns instead of 1434)
Line #169 (got 1435 columns instead of 1434)
Line #524 (got 1435 columns instead of 1434)
...
...
...
Any suggestions as to how avoid it?? Thanks.
回答1:
genfromtxt
will give this error if the number of columns is unequal.
I can think of 3 ways around it:
1. Use the usecols parameter
np.genfromtxt('yourfile.txt',delimiter=',',usecols=np.arange(0,1434))
However - this may mean that you lose some data (where rows are longer than 1434 columns) - whether or not that matters is down to you.
2. Adjust your input data file so that it has an equal number of columns.
3. Use something other than genfromtxt:
.............like this
回答2:
You have too many columns in one of your rows. For example
>>> import numpy as np
>>> from StringIO import StringIO
>>> s = """
... 1 2 3 4
... 1 2 3 4 5
... """
>>> np.genfromtxt(StringIO(s),delimiter=" ")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.6/site-packages/numpy/lib/npyio.py", line 1654, in genfromtxt
raise ValueError(errmsg)
ValueError: Some errors were detected !
Line #2 (got 5 columns instead of 4)
回答3:
An exception is raised if an inconsistency is detected in the number of columns.A number of reasons and solutions are possible.
Add
invalid_raise = False
to skip the offending lines.dataset = genfromtxt(open('data.csv','r'), delimiter='', invalid_raise = False)
If your data contains Names, make sure that the field name doesn’t contain any space or invalid character, or that it does not correspond to the name of a standard attribute (like size or shape), which would confuse the interpreter.
deletechars
Gives a string combining all the characters that must be deleted from the name. By default, invalid characters are
~!@#$%^&*()-=+~\|]}[{';: /?.>,<.
excludelist
Gives a list of the names to exclude, such as
return, file, print…
If one of the input name is part of this list, an underscore character ('_') will be appended to it.
case_sensitive
Whether the names should be case-sensitive (
case_sensitive=True
), converted to upper case (case_sensitive=False
orcase_sensitive='upper'
) or to lower case (case_sensitive='lower'
).
data = np.genfromtxt("data.txt", dtype=None, names=True,\
deletechars="~!@#$%^&*()-=+~\|]}[{';: /?.>,<.", case_sensitive=True)
Reference: numpy.genfromtxt
回答4:
I had this error. The cause was a single entry in my data that had a space. This caused it to see it as an extra row. Make sure all spacing is consistent throughout all the data.
回答5:
It seems like the header that includes the column names have 1 more column than the data itself (1435 columns on header vs. 1434 on data).
You could either:
1) Eliminate 1 column from the header that doesn't make sense with data
OR
2) Use the skip header from genfromtxt()
for example, np.genfromtxt('myfile', skip_header=*how many lines to skip*, delimiter=' ')
more information found in the documentation.
回答6:
In my case, the error aroused due to having a special symbol in the row.
Error cause: having special characters like
- '#' hash
- ',' given the fact that your ( delimiter = ',' )
Example csv file
- 1,hello,#this,fails
1,hello,',this',fails
-----CODE-----
import numpy as numpy data = numpy.genfromtxt(file, delimiter=delimeter) #Error
Environment Note:
OS: Ubuntu
csv editor: LibreOffice
IDE: Pycharm
回答7:
I also had this error when I was also trying to load a text dataset with genfromtext and do text classification with Keras.
The data format was: [some_text]\t[class_label]
.
My understanding was that there are some characters in the 1st column that somehow confuse the parser and the two columns cannot be split properly.
data = np.genfromtxt(my_file.csv, delimiter='\t', usecols=(0,1), dtype=str);
this snippet created the same ValueError with yours and my first workaround was to read everything as one column:
data = np.genfromtxt(my_file, delimiter='\t', usecols=(0), dtype=str);
and split the data later by myself.
However, what finally worked properly was to explicitly define the comment parameter in genfromtxt.
data = np.genfromtxt(my_file, delimiter='\t', usecols=(0,1), dtype=str, comments=None);
According to the documentation:
The optional argument comments is used to define a character string that marks the beginning of a comment. By default, genfromtxt assumes comments='#'. The comment marker may occur anywhere on the line. Any character present after the comment marker(s) is simply ignored.
the default character that indicates a comment is '#', and thus if this character is included in your text column, everything is ignored after it. That is probably why the two columns cannot be recognized by genfromtext.
来源:https://stackoverflow.com/questions/23353585/got-1-columns-instead-of-error-in-numpy