how to remove text between [removed] and [removed] using python?

后端未结

关注

 9  643

眼角桃花

how to remove text between using python?

相关标签:

9条回答

萌比男神i

2021-02-04 19:59

example_text = "This is some text <script> blah blah blah </script> this is some more text."

import re
myre = re.compile("(^.*)<script>(.*)</script>(.*$)")
result = myre.match(example_text)
result.groups()
  <52> ('This is some text ', ' blah blah blah ', ' this is some more text.')

# Text between <script> .. </script>
result.group(2)
  <56> 'blah blah blah'

# Text outside of <script> .. </script>
result.group(1)+result.group(3)
  <57> 'This is some text  this is some more text.'

0 讨论(0)

轻奢々

2021-02-04 20:02
According to answers posted by Pev and wr, why not to upgrade a regular expression, e.g.:
```
pattern = r"(?is)<script[^>]*>(.*?)</script>"
text = """<script>foo bar  
baz bar foo  </script>"""
re.sub(pattern, '', text)
```
(?is) - added to ignore case and allow new lines in text. This version should also support script tags with attributes.

EDIT: I can't add any comments yet, so I'm just editing my answer. I totally agree with the comment below, regexps are totally wrong for such tasks and b. soup ot lxml are a lot better. But question asked gave just a simple example and regexps should be enough for such simple task. Using Beautiful Soup for a simple text removing could just be too much (overload? I don't how to express what I mean, excuse my english).

BTW I made a mistake, the code should look like this:
```
pattern = r"(?is)(<script[^>]*>)(.*?)(</script>)"
text = """<script>foo bar  
baz bar foo  </script>"""
re.sub(pattern, '\1\3', text)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
执念已碎

2021-02-04 20:03

If you're removing everything between <script> and </script> why not just remove the entire node?

Are you expecting a resig-style src and body?

0 讨论(0)
发布评论:

提交评论
- 加载中...

轻奢々

2021-02-04 20:05

You can do this with the HTMLParser module (complicated) or use regular expressions:

import re
content = "asdf <script> bla </script> end"
x=re.search("<script>.*?</script>", content, re.DOTALL)
span = x.span() # gives (5, 27)

stripped_content = content[:span[0]] + content[span[1]:]

EDIT: re.DOTALL, thanks to tgray

0 讨论(0)

佛祖请我去吃肉

2021-02-04 20:11

I don't know Python good enough to tell you a solution. But if you want to use that to sanitize the user input you have to be very, very careful. Removing stuff between and just doesn't catch everything. Maybe you can have a look at existing solutions (I assume Django includes something like this).

0 讨论(0)
发布评论:

提交评论
- 加载中...
忘了有多久

2021-02-04 20:16
Are you trying to prevent XSS? Just eliminating the <script> tags will not solve all possible attacks! Here's a great list of the many ways (some of them very creative) that you could be vulnerable http://ha.ckers.org/xss.html. After reading this page you should understand why just elimintating the <script> tags using a regular expression is not robust enough. The python library lxml has a function that will robustly clean your HTML to make it safe to display.

If you are sure that you just want to eliminate the <script> tags this code in lxml should work:
```
from lxml.html import parse

root = parse(filename_or_url).getroot()
for element in root.iter("script"):
    element.drop_tree()
```
Note: I downvoted all the solutions using regular expresions. See here why you shouldn't parse HTML using regular expressions: Using regular expressions to parse HTML: why not?

Note 2: Another SO question showing HTML that is impossible to parse with regular expressions: Can you provide some examples of why it is hard to parse XML and HTML with a regex?
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页