I am using split()
and split(\" \")
on the same string. But why is split(\" \")
returning less number of elements than split()
str.split with the None
argument (or, no argument) splits on all whitespace characters, and this isn't limited to just the space you type in using your spacebar.
In [457]: text = 'this\nshould\rhelp\tyou\funderstand'
In [458]: text.split()
Out[458]: ['this', 'should', 'help', 'you', 'understand']
In [459]: text.split(' ')
Out[459]: ['this\nshould\rhelp\tyou\x0cunderstand']
List of all whitespace characters that split(None)
splits on can be found at All the Whitespace Characters? Is it language independent?
In Python, the split function splits on a specific string if specified, otherwise on spaces (and then you can access the result list by index as usual):
s = "Hello world! How are you?"
s.split()
Out[9]:['Hello', 'world!', 'How', 'are', 'you?']
s.split("!")
Out[10]: ['Hello world', ' How are you?']
s.split("!")[0]
Out[11]: 'Hello world'
From my own experience, the most confusion had come from split()
's different treatments on whitespace.
Having a separator like ' '
vs None
, triggers different behavior of split()
. According to the Python documentation.
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
Below is an example, in which the sample string has a trailing space ' '
, which is the same whitespace as the one passed in the second split()
. Hence, this method behaves differently, not because of some whitespace character mismatch, but it's more of how this method was designed to work, maybe for convenience in common scenarios, but it can also be confusing for people who expect the split()
to just split
.
sample = "a b "
sample.split()
>>> ['a', 'b']
sample.split(' ')
>>> ['a', 'b', '']
The method str.split
called without arguments has a somewhat different behaviour.
First it splits by any whitespace character.
'foo bar\nbaz\tmeh'.split() # ['foo', 'bar', 'baz', 'meh']
But it also remove the empty strings from the output list.
' foo bar '.split(' ') # ['', 'foo', 'bar', '']
' foo bar '.split() # ['foo', 'bar']
If you run the help command on the split() function you'll see this:
split(...) S.split([sep [,maxsplit]]) -> list of strings
Return a list of the words in the string S, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done. If sep is not specified or is None, any whitespace string is a separator and empty strings are removed from the result.
Therefore the difference between the to is that split()
without specifing the delimiter will delete the empty strings while the one with the delimiter won't.