How to remove all html tags from downloaded page

后端 未结 7 1956
鱼传尺愫
鱼传尺愫 2020-12-31 17:32

I have downloaded a page using urlopen. How do I remove all html tags from it? Is there any regexp to replace all <*> tags?

7条回答
  •  野趣味
    野趣味 (楼主)
    2020-12-31 18:24

    There are multiple options to filter out Html tags from data. you can use Regex or remove_tags from w3lib which is in-built in python.

    from w3lib.html import remove_tags
    data_to_remove = '

    hello\t\t, \tworld\n

    ' print remove_tags(data_to_remove)`

    OUTPUT: hello-world

    Note: remove_tags accept string object. you can pass remove_tags(str(data_to_remove))

提交回复
热议问题