问题
This is very related to a previous question but I am having difficulties adapting for my use case.
I have a sentence: "Forbes Asia 200 Best Under 500 Billion 2011"
I have tokens like:
oldTokens = [u'Forbes', u'Asia', u'200', u'Best', u'Under', u'500', u'Billion', u'2011']
And the indices of where a previous parser has figured out where there should be location or number slots:
numberTokenIDs = {(7,): 2011.0, (2,): 200.0, (5,6): 500000000000.00}
locationTokenIDs = {(0, 1): u'Forbes Asia'}
The token IDs correspond to the index of the tokens where there are locations or numbers, the objective is to obtain a new set of tokens like:
newTokens = [u'ForbesAsia', u'200', u'Best', u'Under', u'500Billion', u'2011']
With new number and location tokenIDs perhaps like (to avoid index out of bounds exceptions):
numberTokenIDs = {(5,): 2011.0, (1,): 200.0, (4,): 500000000000.00}
locationTokenIDs = {(0,): u'Forbes Asia'}
Essentially I would like to go through the new reduced set of tokens, and be able to ultimately create a new sentence called:
"LOCATION_SLOT NUMBER_SLOT Best Under NUMBER_SLOT NUMBER_SLOT"
via going through the new set of tokens and replacing the correct tokenID with either LOCATION_SLOT
or NUMBER_SLOT
. If I did this with the current set of number and location token IDs, I would get:
"LOCATION_SLOT LOCATION_SLOT NUMBER_SLOT Best Under NUMBER_SLOT NUMBER_SLOT NUMBER_SLOT".
How would I do this?
Another example is:
Location token IDs are: (0, 1)
Number token IDs are: (3, 4)
Old sampleTokens [u'United', u'Kingdom', u'USD', u'1.240', u'billion']
Where I want to both delete tokens and also change location and number token IDs to be able to replace the sentence like:
sampleTokens[numberTokenID] = "NUMBER_SLOT"
sampleTokens[locationTokenID] = "LOCATION_SLOT"
Such that the replaced tokens are [u'LOCATION_SLOT', u'USD', u'NUMBER_SLOT']
Note, the concatenation should concatenate all the values in the tuple if there are more than one (also the tuple could also contain >2 elements e.g. The United States of America
).
回答1:
This should work (if I understood correctly):
token_by_index = dict(enumerate(oldTokens))
groups = numberTokenIDs.keys() + locationTokenIDs.keys()
for group in groups:
token_by_index[group[0]] = ''.join(token_by_index.pop(index)
for index in group)
newTokens = [token for _, token in sorted(token_by_index.items(),
key=lambda (index, _): index)]
to find the new token ids:
new_index_by_token = dict(map(lambda (i, t): (t, i), enumerate(newTokens))
numberTokenIDs = {(new_index_by_token[token_by_index[group[0]]],): value
for group, value in numberTokenIDs.items()}
locationTokenIDs = {(new_index_by_token[token_by_index[group[0]]],): value
for group, value in locationTokenIDs.items()}
来源:https://stackoverflow.com/questions/38917962/create-new-tokens-and-tuples-from-existing-ones-based-on-conditions