Disable token breaks on punctuation LUIS.ai

问题

I am working with Microsoft Cognitive Service's Language Understanding Service API, LUIS.ai.

Whenever text is parsed by LUIS, whitespace tokens are always inserted around punctuation.

This behavior is intentional, according to the documentation.

"English, French, Italian, Spanish: token breaks are inserted at any whitespace, and around any punctuation."

For my project, I need to preserve the original query string, without these tokens, as some entities trained for my model will include punctuation, and it's annoying and a bit hacky to strip the extra whitespace from the parsed entities.

Example of this behavior:

Is there a way to disable this? It would save quite a bit of effort.

Thanks!!

回答1:

Unfortunately there's no way to disable that for now, but the good news is that the predictions returned will deal with the original string, not the tokenized one you see in the example labeling process.

Here in the documentation of how to understand the JSON response you can see the example output preservers the original "query" string, and the extracted entities have the zero based character indices ("startIndex", "endIndex") in the original string; this will allow you to deal with the indices instead of parsed entity phrases.

{
"query": "Book me a flight to Boston on May 4",
"intents": [
  {
    "intent": "BookFlight",
    "score": 0.919818342
  },
  {
    "intent": "None",
    "score": 0.136909246
  },
  {
    "intent": "GetWeather",
    "score": 0.007304534
  }
],
"entities": [
  {
    "entity": "boston",
    "type": "Location::ToLocation",
    "startIndex": 20,
    "endIndex": 25,
    "score": 0.621795356
  },
  {
    "entity": "may 4",
    "type": "builtin.datetime.date",
    "startIndex": 30,
    "endIndex": 34,
    "resolution": {
      "date": "XXXX-05-04"
    }
  }
]

}

来源：https://stackoverflow.com/questions/38749246/disable-token-breaks-on-punctuation-luis-ai

标签

microsoft-cognitive

luis