问题
This seems like it should be pretty trivial, but I am new at Python and want to do it the most Pythonic way.
I want to find the index corresponding to the n'th occurrence of a substring within a string.
There's got to be something equivalent to what I WANT to do which is
mystring.find("substring", 2nd)
How can you achieve this in Python?
回答1:
Mark's iterative approach would be the usual way, I think.
Here's an alternative with string-splitting, which can often be useful for finding-related processes:
def findnth(haystack, needle, n):
parts= haystack.split(needle, n+1)
if len(parts)<=n+1:
return -1
return len(haystack)-len(parts[-1])-len(needle)
And here's a quick (and somewhat dirty, in that you have to choose some chaff that can't match the needle) one-liner:
'foo bar bar bar'.replace('bar', 'XXX', 1).find('bar')
回答2:
Here's a more Pythonic version of the straightforward iterative solution:
def find_nth(haystack, needle, n):
start = haystack.find(needle)
while start >= 0 and n > 1:
start = haystack.find(needle, start+len(needle))
n -= 1
return start
Example:
>>> find_nth("foofoofoofoo", "foofoo", 2)
6
If you want to find the nth overlapping occurrence of needle
, you can increment by 1
instead of len(needle)
, like this:
def find_nth_overlapping(haystack, needle, n):
start = haystack.find(needle)
while start >= 0 and n > 1:
start = haystack.find(needle, start+1)
n -= 1
return start
Example:
>>> find_nth_overlapping("foofoofoofoo", "foofoo", 2)
3
This is easier to read than Mark's version, and it doesn't require the extra memory of the splitting version or importing regular expression module. It also adheres to a few of the rules in the Zen of python, unlike the various re
approaches:
- Simple is better than complex.
- Flat is better than nested.
- Readability counts.
回答3:
This will find the second occurrence of substring in string.
def find_2nd(string, substring):
return string.find(substring, string.find(substring) + 1)
Edit: I haven't thought much about the performance, but a quick recursion can help with finding the nth occurrence:
def find_nth(string, substring, n):
if (n == 1):
return string.find(substring)
else:
return string.find(substring, find_nth(string, substring, n - 1) + 1)
回答4:
Understanding that regex is not always the best solution, I'd probably use one here:
>>> import re
>>> s = "ababdfegtduab"
>>> [m.start() for m in re.finditer(r"ab",s)]
[0, 2, 11]
>>> [m.start() for m in re.finditer(r"ab",s)][2] #index 2 is third occurrence
11
回答5:
I'm offering some benchmarking results comparing the most prominent approaches presented so far, namely @bobince's findnth()
(based on str.split()
) vs. @tgamblin's or @Mark Byers' find_nth()
(based on str.find()
). I will also compare with a C extension (_find_nth.so
) to see how fast we can go. Here is find_nth.py
:
def findnth(haystack, needle, n):
parts= haystack.split(needle, n+1)
if len(parts)<=n+1:
return -1
return len(haystack)-len(parts[-1])-len(needle)
def find_nth(s, x, n=0, overlap=False):
l = 1 if overlap else len(x)
i = -l
for c in xrange(n + 1):
i = s.find(x, i + l)
if i < 0:
break
return i
Of course, performance matters most if the string is large, so suppose we want to find the 1000001st newline ('\n') in a 1.3 GB file called 'bigfile'. To save memory, we would like to work on an mmap.mmap
object representation of the file:
In [1]: import _find_nth, find_nth, mmap
In [2]: f = open('bigfile', 'r')
In [3]: mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
There is already the first problem with findnth()
, since mmap.mmap
objects don't support split()
. So we actually have to copy the whole file into memory:
In [4]: %time s = mm[:]
CPU times: user 813 ms, sys: 3.25 s, total: 4.06 s
Wall time: 17.7 s
Ouch! Fortunately s
still fits in the 4 GB of memory of my Macbook Air, so let's benchmark findnth()
:
In [5]: %timeit find_nth.findnth(s, '\n', 1000000)
1 loops, best of 3: 29.9 s per loop
Clearly a terrible performance. Let's see how the approach based on str.find()
does:
In [6]: %timeit find_nth.find_nth(s, '\n', 1000000)
1 loops, best of 3: 774 ms per loop
Much better! Clearly, findnth()
's problem is that it is forced to copy the string during split()
, which is already the second time we copied the 1.3 GB of data around after s = mm[:]
. Here comes in the second advantage of find_nth()
: We can use it on mm
directly, such that zero copies of the file are required:
In [7]: %timeit find_nth.find_nth(mm, '\n', 1000000)
1 loops, best of 3: 1.21 s per loop
There appears to be a small performance penalty operating on mm
vs. s
, but this illustrates that find_nth()
can get us an answer in 1.2 s compared to findnth
's total of 47 s.
I found no cases where the str.find()
based approach was significantly worse than the str.split()
based approach, so at this point, I would argue that @tgamblin's or @Mark Byers' answer should be accepted instead of @bobince's.
In my testing, the version of find_nth()
above was the fastest pure Python solution I could come up with (very similar to @Mark Byers' version). Let's see how much better we can do with a C extension module. Here is _find_nthmodule.c
:
#include <Python.h>
#include <string.h>
off_t _find_nth(const char *buf, size_t l, char c, int n) {
off_t i;
for (i = 0; i < l; ++i) {
if (buf[i] == c && n-- == 0) {
return i;
}
}
return -1;
}
off_t _find_nth2(const char *buf, size_t l, char c, int n) {
const char *b = buf - 1;
do {
b = memchr(b + 1, c, l);
if (!b) return -1;
} while (n--);
return b - buf;
}
/* mmap_object is private in mmapmodule.c - replicate beginning here */
typedef struct {
PyObject_HEAD
char *data;
size_t size;
} mmap_object;
typedef struct {
const char *s;
size_t l;
char c;
int n;
} params;
int parse_args(PyObject *args, params *P) {
PyObject *obj;
const char *x;
if (!PyArg_ParseTuple(args, "Osi", &obj, &x, &P->n)) {
return 1;
}
PyTypeObject *type = Py_TYPE(obj);
if (type == &PyString_Type) {
P->s = PyString_AS_STRING(obj);
P->l = PyString_GET_SIZE(obj);
} else if (!strcmp(type->tp_name, "mmap.mmap")) {
mmap_object *m_obj = (mmap_object*) obj;
P->s = m_obj->data;
P->l = m_obj->size;
} else {
PyErr_SetString(PyExc_TypeError, "Cannot obtain char * from argument 0");
return 1;
}
P->c = x[0];
return 0;
}
static PyObject* py_find_nth(PyObject *self, PyObject *args) {
params P;
if (!parse_args(args, &P)) {
return Py_BuildValue("i", _find_nth(P.s, P.l, P.c, P.n));
} else {
return NULL;
}
}
static PyObject* py_find_nth2(PyObject *self, PyObject *args) {
params P;
if (!parse_args(args, &P)) {
return Py_BuildValue("i", _find_nth2(P.s, P.l, P.c, P.n));
} else {
return NULL;
}
}
static PyMethodDef methods[] = {
{"find_nth", py_find_nth, METH_VARARGS, ""},
{"find_nth2", py_find_nth2, METH_VARARGS, ""},
{0}
};
PyMODINIT_FUNC init_find_nth(void) {
Py_InitModule("_find_nth", methods);
}
Here is the setup.py
file:
from distutils.core import setup, Extension
module = Extension('_find_nth', sources=['_find_nthmodule.c'])
setup(ext_modules=[module])
Install as usual with python setup.py install
. The C code plays at an advantage here since it is limited to finding single characters, but let's see how fast this is:
In [8]: %timeit _find_nth.find_nth(mm, '\n', 1000000)
1 loops, best of 3: 218 ms per loop
In [9]: %timeit _find_nth.find_nth(s, '\n', 1000000)
1 loops, best of 3: 216 ms per loop
In [10]: %timeit _find_nth.find_nth2(mm, '\n', 1000000)
1 loops, best of 3: 307 ms per loop
In [11]: %timeit _find_nth.find_nth2(s, '\n', 1000000)
1 loops, best of 3: 304 ms per loop
Clearly quite a bit faster still. Interestingly, there is no difference on the C level between the in-memory and mmapped cases. It is also interesting to see that _find_nth2()
, which is based on string.h
's memchr()
library function, loses out against the straightforward implementation in _find_nth()
: The additional "optimizations" in memchr()
are apparently backfiring...
In conclusion, the implementation in findnth()
(based on str.split()
) is really a bad idea, since (a) it performs terribly for larger strings due to the required copying, and (b)
it doesn't work on mmap.mmap
objects at all. The implementation in find_nth()
(based on str.find()
) should be preferred in all circumstances (and therefore be the accepted answer to this question).
There is still quite a bit of room for improvement, since the C extension ran almost a factor of 4 faster than the pure Python code, indicating that there might be a case for a dedicated Python library function.
回答6:
I'd probably do something like this, using the find function that takes an index parameter:
def find_nth(s, x, n):
i = -1
for _ in range(n):
i = s.find(x, i + len(x))
if i == -1:
break
return i
print find_nth('bananabanana', 'an', 3)
It's not particularly Pythonic I guess, but it's simple. You could do it using recursion instead:
def find_nth(s, x, n, i = 0):
i = s.find(x, i)
if n == 1 or i == -1:
return i
else:
return find_nth(s, x, n - 1, i + len(x))
print find_nth('bananabanana', 'an', 3)
It's a functional way to solve it, but I don't know if that makes it more Pythonic.
回答7:
Simplest way?
text = "This is a test from a test ok"
firstTest = text.find('test')
print text.find('test', firstTest + 1)
回答8:
This will give you an array of the starting indices for matches to yourstring
:
import re
indices = [s.start() for s in re.finditer(':', yourstring)]
Then your nth entry would be:
n = 2
nth_entry = indices[n-1]
Of course you have to be careful with the index bounds. You can get the number of instances of yourstring
like this:
num_instances = len(indices)
回答9:
Here is another approach using re.finditer.
The difference is that this only looks into the haystack as far as necessary
from re import finditer
from itertools import dropwhile
needle='an'
haystack='bananabanana'
n=2
next(dropwhile(lambda x: x[0]<n, enumerate(re.finditer(needle,haystack))))[1].start()
回答10:
Here's another re
+ itertools
version that should work when searching for either a str
or a RegexpObject
. I will freely admit that this is likely over-engineered, but for some reason it entertained me.
import itertools
import re
def find_nth(haystack, needle, n = 1):
"""
Find the starting index of the nth occurrence of ``needle`` in \
``haystack``.
If ``needle`` is a ``str``, this will perform an exact substring
match; if it is a ``RegexpObject``, this will perform a regex
search.
If ``needle`` doesn't appear in ``haystack``, return ``-1``. If
``needle`` doesn't appear in ``haystack`` ``n`` times,
return ``-1``.
Arguments
---------
* ``needle`` the substring (or a ``RegexpObject``) to find
* ``haystack`` is a ``str``
* an ``int`` indicating which occurrence to find; defaults to ``1``
>>> find_nth("foo", "o", 1)
1
>>> find_nth("foo", "o", 2)
2
>>> find_nth("foo", "o", 3)
-1
>>> find_nth("foo", "b")
-1
>>> import re
>>> either_o = re.compile("[oO]")
>>> find_nth("foo", either_o, 1)
1
>>> find_nth("FOO", either_o, 1)
1
"""
if (hasattr(needle, 'finditer')):
matches = needle.finditer(haystack)
else:
matches = re.finditer(re.escape(needle), haystack)
start_here = itertools.dropwhile(lambda x: x[0] < n, enumerate(matches, 1))
try:
return next(start_here)[1].start()
except StopIteration:
return -1
回答11:
Building on modle13's answer, but without the re
module dependency.
def iter_find(haystack, needle):
return [i for i in range(0, len(haystack)) if haystack[i:].startswith(needle)]
I kinda wish this was a builtin string method.
>>> iter_find("http://stackoverflow.com/questions/1883980/", '/')
[5, 6, 24, 34, 42]
回答12:
>>> s="abcdefabcdefababcdef"
>>> j=0
>>> for n,i in enumerate(s):
... if s[n:n+2] =="ab":
... print n,i
... j=j+1
... if j==2: print "2nd occurence at index position: ",n
...
0 a
6 a
2nd occurence at index position: 6
12 a
14 a
回答13:
Providing another "tricky" solution, which use split
and join
.
In your example, we can use
len("substring".join([s for s in ori.split("substring")[:2]]))
回答14:
# return -1 if nth substr (0-indexed) d.n.e, else return index
def find_nth(s, substr, n):
i = 0
while n >= 0:
n -= 1
i = s.find(substr, i + 1)
return i
回答15:
Solution without using loops and recursion.
Use the required pattern in compile method and enter the desired occurrence in variable 'n' and the last statement will print the starting index of the nth occurrence of the pattern in the given string. Here the result of finditer i.e. iterator is being converted to list and directly accessing the nth index.
import re
n=2
sampleString="this is history"
pattern=re.compile("is")
matches=pattern.finditer(sampleString)
print(list(matches)[n].span()[0])
回答16:
For the special case where you search for the n'th occurence of a character (i.e. substring of length 1), the following function works by building a list of all positions of occurences of the given character:
def find_char_nth(string, char, n):
"""Find the n'th occurence of a character within a string."""
return [i for i, c in enumerate(string) if c == char][n-1]
If there are fewer than n
occurences of the given character, it will give IndexError: list index out of range
.
This is derived from @Zv_oDD's answer and simplified for the case of a single character.
回答17:
The replace one liner is great but only works because XX and bar have the same lentgh
A good and general def would be:
def findN(s,sub,N,replaceString="XXX"):
return s.replace(sub,replaceString,N-1).find(sub) - (len(replaceString)-len(sub))*(N-1)
回答18:
This is the answer you really want:
def Find(String,ToFind,Occurence = 1):
index = 0
count = 0
while index <= len(String):
try:
if String[index:index + len(ToFind)] == ToFind:
count += 1
if count == Occurence:
return index
break
index += 1
except IndexError:
return False
break
return False
回答19:
Here is my solution for finding n
th occurrance of b
in string a
:
from functools import reduce
def findNth(a, b, n):
return reduce(lambda x, y: -1 if y > x + 1 else a.find(b, x + 1), range(n), -1)
It is pure Python and iterative. For 0 or n
that is too large, it returns -1. It is one-liner and can be used directly. Here is an example:
>>> reduce(lambda x, y: -1 if y > x + 1 else 'bibarbobaobaotang'.find('b', x + 1), range(4), -1)
7
回答20:
Def:
def get_first_N_words(mytext, mylen = 3):
mylist = list(mytext.split())
if len(mylist)>=mylen: return ' '.join(mylist[:mylen])
To use:
get_first_N_words(' One Two Three Four ' , 3)
Output:
'One Two Three'
回答21:
Avoid a failure or incorrect output when the input value for occurrence provided is higher than the actual count of occurrence. For example, in a string 'overflow' if you would check the 3rd occurrence of 'o' ( it has only 2 occurrences ) then below code will return a warning or message indicating that the occurrence value has exceeded.
Input Occurrence entered has exceeded the actual count of Occurrence.
def check_nth_occurrence (string, substr, n):
## Count the Occurrence of a substr
cnt = 0
for i in string:
if i ==substr:
cnt = cnt + 1
else:
pass
## Check if the Occurrence input has exceeded the actual count of Occurrence
if n > cnt:
print (f' Input Occurrence entered has exceeded the actual count of Occurrence')
return
## Get the Index value for first Occurrence of the substr
index = string.find(substr)
## Get the Index value for nth Occurrence of Index
while index >= 0 and n > 1:
index = string.find(substr, index+ 1)
n -= 1
return index
回答22:
How about:
c = os.getcwd().split('\\')
print '\\'.join(c[0:-2])
来源:https://stackoverflow.com/questions/61961236/how-to-get-the-substring-between-two-markers-and-repeat-the-process-in-python