how to convert html text into plain text using pyspark? Replacing html tags from string

空扰寡人 提交于 2019-12-11 14:26:47

问题


I have one text file in which there is one column 'descn' which has some text but they are in html format. So i want to convert html text into plain text using pyspark. Please help me to do this.

file name:

mdcl_insigt.txt

input:

PROTEUSÂ <div><br></div><div>We are struggling with pathology. We don&#39;t control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.</div>

it should convert like this, output:

PROTEUS We are struggling with pathology. We don't control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.

回答1:


You can try doing a regexp_replace():

from pyspark.sql.functions import regexp_replace

df = df.withColumn("parsed_descn", regexp_replace("descn", "<[^>]+>", ""))

The regex is not perfect and might fail. Please do some more research to make it better.

It worked on your sample string, when I tried it on regexr

Here is the screenshot:

Pyspark output:

df.withColumn("parsed", F.regexp_replace("descn", "<[^>]+>", "")).select("parsed").collect()

[Row(parsed='PROTEUSÂ We are struggling with pathology. We don&#39;t control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.')]


来源:https://stackoverflow.com/questions/58797064/how-to-convert-html-text-into-plain-text-using-pyspark-replacing-html-tags-from

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!