Extract all substrings between two markers

两盒软妹~` 提交于 2020-08-10 23:02:27

问题


I have a string:

mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"

What I want is a list of substrings between the markers start="&maker1" and end="/\n". Thus, the expected result is:

whatIwant = ["The String that I want", "Another string that I want"]

I've read the answers here:

  1. Find string between two substrings [duplicate]
  2. How to extract the substring between two markers?

And tried this but not successfully,

>>> import re
>>> mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
>>> whatIwant = re.search("&marker1(.*)/\n", mystr)
>>> whatIwant.group(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

What could I do to resolve this? Also, I have a very long string

>>> len(myactualstring)
7792818

回答1:


Consider this option using re.findall:

mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
matches = re.findall(r'&marker1\n(.*?)\s*/\n', mystr)
print(matches)

This prints:

['The String that I want', 'Another string that I want']

Here is an explanation of the regex pattern:

&marker1      match a marker
\n            newline
(.*?)         match AND capture all content until reaching the first
\s*           optional whitespace, followed by
/\n           / and newline

Note that re.findall will only capture what appears in the (...) capture group, which is what you are trying to extract.




回答2:


What could I do to resolve this? I would do:

import re
mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
found = re.findall(r"\&marker1\n(.*?)/\n", mystr)
print(found)

Output:

['The String that I want ', 'Another string that I want ']

Note that:

  • & has special meaning in re patterns, if you want literal & you need to escape it (\&)
  • . does match anything except newlines
  • findall is better suited choiced if you just want list of matched substrings, rather than search
  • *? is non-greedy, in this case .* would work too, because . do not match newline, but in other cases you might ending matching more than you wish
  • I used so-called raw-string (r-prefixed) to make escaping easier

Read module re documentation for discussion of raw-string usage and implicit list of characters with special meaning.



来源:https://stackoverflow.com/questions/62342552/extract-all-substrings-between-two-markers

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!