问题
I have one text file in which there is one column 'descn' which has some text but they are in html format. So i want to convert html text into plain text using pyspark. Please help me to do this.
file name:
mdcl_insigt.txt
input:
PROTEUSÂ <div><br></div><div>We are struggling with pathology. We don't control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.</div>
it should convert like this, output:
PROTEUS We are struggling with pathology. We don't control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.
回答1:
You can try doing a regexp_replace():
from pyspark.sql.functions import regexp_replace
df = df.withColumn("parsed_descn", regexp_replace("descn", "<[^>]+>", ""))
The regex is not perfect and might fail. Please do some more research to make it better.
It worked on your sample string, when I tried it on regexr
Here is the screenshot:
Pyspark output:
df.withColumn("parsed", F.regexp_replace("descn", "<[^>]+>", "")).select("parsed").collect()
[Row(parsed='PROTEUSÂ We are struggling with pathology. We don't control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.')]
来源:https://stackoverflow.com/questions/58797064/how-to-convert-html-text-into-plain-text-using-pyspark-replacing-html-tags-from