Here is an idea:
We have web applications with exposed restful APIs which accepts json. Now how about using google speech APIs to take user voice input convert it to tex
This is called "intent analysis". There are such libraries, for example RASA
For example you input is "show me chinese restaurants". The output would be
{
"text": "show me chinese restaurants",
"intent": "restaurant_search",
"entities": [
{
"start": 8,
"end": 15,
"value": "chinese",
"entity": "cuisine"
}
]
}
Overall it is pretty advanced NLU.