how to convert html text into plain text using pyspark? Replacing html tags from string

问题

I have one text file in which there is one column 'descn' which has some text but they are in html format. So i want to convert html text into plain text using pyspark. Please help me to do this.

file name:

mdcl_insigt.txt

input:

PROTEUSÂ <div><br></div><div>We are struggling with pathology. We don&#39;t control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.</div>

it should convert like this, output:

PROTEUS We are struggling with pathology. We don't control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.

回答1:

You can try doing a regexp_replace():

from pyspark.sql.functions import regexp_replace

df = df.withColumn("parsed_descn", regexp_replace("descn", "<[^>]+>", ""))

The regex is not perfect and might fail. Please do some more research to make it better.

It worked on your sample string, when I tried it on regexr

Here is the screenshot:

Pyspark output:

df.withColumn("parsed", F.regexp_replace("descn", "<[^>]+>", "")).select("parsed").collect()

[Row(parsed='PROTEUSÂ We are struggling with pathology. We don&#39;t control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.')]

来源：https://stackoverflow.com/questions/58797064/how-to-convert-html-text-into-plain-text-using-pyspark-replacing-html-tags-from

标签

pyspark

pyspark-sql