Pyparsing: extract variable length, variable content, variable whitespace substring

為{幸葍}努か 提交于 2019-11-28 12:58:43

Here is a sample to pull out the patient data and any matching Gleason data.

from pyparsing import *
num = Word(nums)
accessionDate = Combine(num + "/" + num + "/" + num)("accDate")
accessionNumber = Combine("S" + num + "-" + num)("accNum")
patMedicalRecordNum = Combine(num + "/" + num + "-" + num + "-" + num)("patientNum")
gleason = Group("GLEASON" + Optional("SCORE:") + num("left") + "+" + num("right") + "=" + num("total"))
assert 'GLEASON 5+4=9' == gleason
assert 'GLEASON SCORE:  3 + 3 = 6' == gleason

patientData = Group(accessionDate + accessionNumber + patMedicalRecordNum)
assert '01/02/11  S11-4444 20/111-22-3333' == patientData

partMatch = patientData("patientData") | gleason("gleason")

lastPatientData = None
for match in partMatch.searchString(data):
    if match.patientData:
        lastPatientData = match
    elif match.gleason:
        if lastPatientData is None:
            print "bad!"
            continue
        print "{0.accDate}: {0.accNum} {0.patientNum} Gleason({1.left}+{1.right}={1.total})".format(
                        lastPatientData.patientData, match.gleason
                        )

Prints:

01/01/11: S11-55555 20/444-55-6666 Gleason(5+4=9)
01/02/11: S11-4444 20/111-22-3333 Gleason(3+3=6)

Take a look at the SkipTo parse element in pyparsing. If you define a pyparsing structure for the num+num=num part, you should be able to use SkipTo to skip anything between "Gleason" and that. Roughly like this (untested pseuo-pyparsing):

score = num + "+" + num + "=" num
Gleason = "Gleason" + SkipTo(score) + score

PyParsing by default skips whitespace anyway, and with SkipTo you can skip anything that doesn't match your desired format.

gleason = re.compile("gleason\d+\d=\d")
scores = set()
for record in records:
    for line in record.lower().split("\n"):
        if "gleason" in line:
            scores.add(gleason.match(line.replace(" ", "")).group(0)[7:])

Or something

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!