I have a json file containing texts like:
dr. goldberg offers everything.parking is good.he\'s nice and easy to talk
How can I
How about parsing the string and looking at the values?
import json
def sen_or_none(string):
return "parking" in string.lower() and string or None
def walk(node):
if isinstance(node, list):
for item in node:
v = walk(item)
if v:
return v
elif isinstance(node, dict):
for key, item in node.items():
v = walk(item)
if v:
return v
elif isinstance(node, basestring):
for item in node.split("."):
v = sen_or_none(item)
if v:
return v
return None
with open('data.json') as data_file:
print walk(json.load(data_file))
you can use nltk.tokenize
:
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
f=open("test_data.json").read()
sentences=sent_tokenize(f)
my_sentence=[sent for sent in sentences if 'parking' in word_tokenize(sent)] #this gave you the all sentences that your special word is in it !
and as a complete way you can use a function :
>>> def sentence_finder(text,word):
... sentences=sent_tokenize(text)
... return [sent for sent in sentences if word in word_tokenize(sent)]
>>> s="dr. goldberg offers everything. parking is good. he's nice and easy to talk"
>>> sentence_finder(s,'parking')
['parking is good.']
You can use the standard library re
module:
import re
line = "dr. goldberg offers everything.parking is good.he's nice and easy to talk"
res = re.search("\.?([^\.]*parking[^\.]*)", line)
if res is not None:
print res.group(1)
It will print parking is good
.
Idea is simple - you search for sentence starting from optional dot character .
, than consume all non-dots, parking
word and the rest of non-dots.
Question mark handles the case where your sentence is in the start of the line.