问题
IF I have a dictionary:
mydict = {"g18_84pp_2A_MVP1_GoodiesT0-HKJ-DFG_MIX-CMVP1_Y1000-MIX.txt" : 0,
"g18_84pp_2A_MVP2_GoodiesT0-HKJ-DFG_MIX-CMVP2_Y1000-MIX.txt" : 1,
"g18_84pp_2A_MVP3_GoodiesT0-HKJ-DFG_MIX-CMVP3_Y1000-MIX.txt" : 2,
"g18_84pp_2A_MVP4_GoodiesT0-HKJ-DFG_MIX-CMVP4_Y1000-MIX.txt" : 3,
"g18_84pp_2A_MVP5_GoodiesT0-HKJ-DFG_MIX-CMVP5_Y1000-MIX.txt" : 4,
"g18_84pp_2A_MVP6_GoodiesT0-HKJ-DFG_MIX-CMVP6_Y1000-MIX.txt" : 5,
"h18_84pp_3A_MVP1_GoodiesT1-HKJ-DFG-CMVP1_Y1000-FIX.txt" : 6,
"g18_84pp_2A_MVP7_GoodiesT0-HKJ-DFG_MIX-CMVP7_Y1000-MIX.txt" : 7,
"h18_84pp_3A_MVP2_GoodiesT1-HKJ-DFG-CMVP2_Y1000-FIX.txt" : 8,
"h18_84pp_3A_MVP3_GoodiesT1-HKJ-DFG-CMVP3_Y1000-FIX.txt" : 9,
"p18_84pp_2B_MVP1_GoodiesT2-HKJ-DFG-CMVP3_Y1000-FIX.txt" : 10}
I want to extract the common part
g18_84pp_2A_MVP_GoodiesT0
before the first-
.also I want add a
_MIX
to followg18_84pp_2A_MVP_GoodiesT0
when finding the particular wordMIX
in first group . Assume that I am able to classify two groups depending on whether isMIX
orFIX
in myDict, then the final Output dictionary:
OutputNameDict= {"g18_84pp_2A_MVP_GoodiesT0_MIX" : 0,
"h18_84pp_3A_MVP_GoodiesT1_FIX" : 1,
"p18_84pp_2B_MVP_FIX": 2}
Is there any function I could use to find common part? How pick up the word before or after particular symbol like -
and find particular words like MIX
or FIX
?
回答1:
You can use split
to get the common part:
s = "g18_84pp_2A_MVP1_GoodiesT0-HKJ-DFG_MIX-CMVP1_Y1000-MIX.txt"
n = s.split('-')[0]
In fact, split
will give you a list of each token delimited by '-'
, so s.split('-')
yields:
['g18_84pp_2A_MVP1_GoodiesT0', 'HKJ', 'DFG_MIX', 'CMVP1_Y1000', 'MIX.txt']
To see if MIX
or FIX
is in a string, you can use in
:
if 'MIX' in s:
print "then MIX is in the string s"
If you want to get rid if the numbers after 'MVP'
, you can use re
module:
import re
s = 'g18_84pp_2A_MVP1_GoodiesT0'
s = re.sub('MVP[0-9]*','MVP',s)
Here is a sample function to get a list of the common parts:
def foo(mydict):
return [re.sub('MVP[0-9]*', 'MVP', k.split('-')[0]) for k in mydict]
回答2:
You can use the index()
function to find your dashes, then with that knowledge you can take the rest of the string past that point. For instance,
mydict = {"g18_84pp_2A_MVP1_GoodiesT0-HKJ-DFG_MIX-CMVP1_Y1000-MIX.txt" : 0,
"g18_84pp_2A_MVP2_GoodiesT0-HKJ-DFG_MIX-CMVP2_Y1000-MIX.txt" : 1,
"g18_84pp_2A_MVP3_GoodiesT0-HKJ-DFG_MIX-CMVP3_Y1000-MIX.txt" : 2,
"g18_84pp_2A_MVP4_GoodiesT0-HKJ-DFG_MIX-CMVP4_Y1000-MIX.txt" : 3,
"g18_84pp_2A_MVP5_GoodiesT0-HKJ-DFG_MIX-CMVP5_Y1000-MIX.txt" : 4,
"g18_84pp_2A_MVP6_GoodiesT0-HKJ-DFG_MIX-CMVP6_Y1000-MIX.txt" : 5,
"g18_84pp_2A_MVP7_GoodiesT0-HKJ-DFG_MIX-CMVP7_Y1000-MIX.txt" : 6,
"h18_84pp_3A_MVP1_GoodiesT1-HKJ-DFG_MIX-CMVP1_Y1000-FIX.txt" : 7,
"h18_84pp_3A_MVP2_GoodiesT1-HKJ-DFG_MIX-CMVP2_Y1000-FIX.txt" : 8,
"h18_84pp_3A_MVP2_GoodiesT1-HKJ-DFG_MIX-CMVP3_Y1000-FIX.txt" : 9}
for value in sorted(mydict.iterkeys()):
index = value.index('-')
extracted = value[index+1:-4] # Goes past the first occurrence of - and removes .txt from the end
print extracted[-3:] # Find the last 3 letters in the string
Will print the following:
MIX
MIX
MIX
MIX
MIX
MIX
MIX
FIX
FIX
FIX
Then if statements can be used to do what you would like.
If you want to extract just the common part.
index = value.index('-')
extracted = value[:index] # Will get g18_84pp_2A_MVP1_GoodiesT0
Then to figure out the ending to use. If you know the ending of the mydict value will always be MIX.txt or FIX.txt then you can do this.
for value in sorted(mydict.iterkeys()):
ending = value[-7:-4]
index = value.index('-')
extracted = value[:index]
print "%s_%s" % (extracted, ending)
Which prints
g18_84pp_2A_MVP1_GoodiesT0_MIX
g18_84pp_2A_MVP2_GoodiesT0_MIX
g18_84pp_2A_MVP3_GoodiesT0_MIX
g18_84pp_2A_MVP4_GoodiesT0_MIX
g18_84pp_2A_MVP5_GoodiesT0_MIX
g18_84pp_2A_MVP6_GoodiesT0_MIX
g18_84pp_2A_MVP7_GoodiesT0_MIX
h18_84pp_3A_MVP1_GoodiesT1_FIX
h18_84pp_3A_MVP2_GoodiesT1_FIX
h18_84pp_3A_MVP2_GoodiesT1_FIX
Then you add it to the extracted dictionary.
回答3:
Thanks for the answers. My complete code as following. Any suggestion to optimize it?
import re
mydict = {"g18_84pp_2A_MVP1_GoodiesT0-HKJ-DFG_MIX-CMVP1_Y1000-MIX.txt" : 0,
"g18_84pp_2A_MVP2_GoodiesT0-HKJ-DFG_MIX-CMVP2_Y1000-MIX.txt" : 1,
"g18_84pp_2A_MVP3_GoodiesT0-HKJ-DFG_MIX-CMVP3_Y1000-MIX.txt" : 2,
"g18_84pp_2A_MVP4_GoodiesT0-HKJ-DFG_MIX-CMVP4_Y1000-MIX.txt" : 3,
"g18_84pp_2A_MVP5_GoodiesT0-HKJ-DFG_MIX-CMVP5_Y1000-MIX.txt" : 4,
"g18_84pp_2A_MVP6_GoodiesT0-HKJ-DFG_MIX-CMVP6_Y1000-MIX.txt" : 5,
"h18_84pp_3A_MVP1_GoodiesT1-HKJ-DFG-CMVP1_Y1000-FIX.txt" : 6,
"g18_84pp_2A_MVP7_GoodiesT0-HKJ-DFG_MIX-CMVP7_Y1000-MIX.txt" : 7,
"h18_84pp_3A_MVP2_GoodiesT1-HKJ-DFG-CMVP2_Y1000-FIX.txt" : 8,
"h18_84pp_3A_MVP3_GoodiesT1-HKJ-DFG-CMVP3_Y1000-FIX.txt" : 9,
"p18_84pp_2B_MVP1_GoodiesT2-HKJ-DFG-CMVP3_Y1000-FIX.txt" : 10}
ExtractDict = {}
start = 0
for stringList in sorted(mydict.iterkeys()):
stringList = stringList.split('.')[0]
underscore = stringList.split('_')
Area= re.split('[0-9]+',stringList.split('_')[3])[0] # MVP and etc.
CaseNameString=underscore[0]+"_"+underscore[1]+"_"+underscore[2]+"_"+Area #g18_84pp_2A_MVP_GoodiesT0 and etc.
postfix= stringList.split('-')[4]
Newstring= CaseNameString + "_" + postfix
ExtractDict[Newstring]= start
start += 1
startagain =0
OutputNameDict = {}
for OutputNameList in sorted(ExtractDict.iterkeys()):
OutputNameDict[OutputNameList] = startagain
startagain +=1
#OutputNameDict = {'h18_84pp_3A_MVP_FIX': 1, 'p18_84pp_2B_MVP_FIX': 2, 'g18_84pp_2A_MVP_MIX': 0}
来源:https://stackoverflow.com/questions/17701804/how-to-extract-the-common-words-before-particular-symbol-and-find-particular-wor