I have a text file and I want to display all words that contains both z and x characters.
How can I do that ?
Sounds like a job for Regular Expressions. Read that and try it out. If you run into problems, update your question and we can help you with the specifics.
>>> import re
>>> print re.findall('(\w*x\w*z\w*|\w*z\w*x\w*)', 'axbzc azb axb abc axzb')
['axbzc', 'axzb']
I just want to point out how heavy-handed some of these regular expressions can be, in comparison to the simple string methods-based solution provided by Wooble.
Let's do some timings, shall we?
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import timeit
import re
import sys
WORD_RE_COMPILED = re.compile(r'\w+')
Z_RE_COMPILED = re.compile(r'(\b\w*z\w*\b)')
XZ_RE_COMPILED = re.compile(r'\b(\w*z\w*x\w*|\w*x\w*z\w*)\b')
##########################
# Tim Pietzcker's solution
# https://stackoverflow.com/questions/3962846/how-to-display-all-words-that-contain-these-characters/3962876#3962876
#
def xz_re_word_find(text):
for word in re.findall(r'\w+', text):
if 'x' in word and 'z' in word:
print word
# Tim's solution, compiled
def xz_re_word_compiled_find(text):
pattern = re.compile(r'\w+')
for word in pattern.findall(text):
if 'x' in word and 'z' in word:
print word
# Tim's solution, with the RE pre-compiled so compilation doesn't get
# included in the search time
def xz_re_word_precompiled_find(text):
for word in WORD_RE_COMPILED.findall(text):
if 'x' in word and 'z' in word:
print word
################################
# Steven Rumbalski's solution #1
# (provided in the comment)
# https://stackoverflow.com/questions/3962846/how-to-display-all-words-that-contain-these-characters/3963285#3963285
def xz_re_z_find(text):
for word in re.findall(r'(\b\w*z\w*\b)', text):
if 'x' in word:
print word
# Steven's solution #1 compiled
def xz_re_z_compiled_find(text):
pattern = re.compile(r'(\b\w*z\w*\b)')
for word in pattern.findall(text):
if 'x' in word:
print word
# Steven's solution #1 with the RE pre-compiled
def xz_re_z_precompiled_find(text):
for word in Z_RE_COMPILED.findall(text):
if 'x' in word:
print word
################################
# Steven Rumbalski's solution #2
# https://stackoverflow.com/questions/3962846/how-to-display-all-words-that-contain-these-characters/3962934#3962934
def xz_re_xz_find(text):
for word in re.findall(r'\b(\w*z\w*x\w*|\w*x\w*z\w*)\b', text):
print word
# Steven's solution #2 compiled
def xz_re_xz_compiled_find(text):
pattern = re.compile(r'\b(\w*z\w*x\w*|\w*x\w*z\w*)\b')
for word in pattern.findall(text):
print word
# Steven's solution #2 pre-compiled
def xz_re_xz_precompiled_find(text):
for word in XZ_RE_COMPILED.findall(text):
print word
#################################
# Wooble's simple string solution
def xz_str_find(text):
for word in text.split():
if 'x' in word and 'z' in word:
print word
functions = [
'xz_re_word_find',
'xz_re_word_compiled_find',
'xz_re_word_precompiled_find',
'xz_re_z_find',
'xz_re_z_compiled_find',
'xz_re_z_precompiled_find',
'xz_re_xz_find',
'xz_re_xz_compiled_find',
'xz_re_xz_precompiled_find',
'xz_str_find'
]
import_stuff = functions + [
'text',
'WORD_RE_COMPILED',
'Z_RE_COMPILED',
'XZ_RE_COMPILED'
]
if __name__ == '__main__':
text = open(sys.argv[1]).read()
timings = {}
setup = 'from __main__ import ' + ','.join(import_stuff)
for func in functions:
statement = func + '(text)'
timer = timeit.Timer(statement, setup)
min_time = min(timer.repeat(3, 10))
timings[func] = min_time
for func in functions:
print func + ":", timings[func], "seconds"
Running this script on a plaintext copy of Moby Dick obtained from Project Gutenberg, on Python 2.6, I get the following timings:
xz_re_word_find: 1.21829485893 seconds
xz_re_word_compiled_find: 1.42398715019 seconds
xz_re_word_precompiled_find: 1.40110301971 seconds
xz_re_z_find: 0.680151939392 seconds
xz_re_z_compiled_find: 0.673038005829 seconds
xz_re_z_precompiled_find: 0.673489093781 seconds
xz_re_xz_find: 1.11700701714 seconds
xz_re_xz_compiled_find: 1.12773990631 seconds
xz_re_xz_precompiled_find: 1.13285303116 seconds
xz_str_find: 0.590088844299 seconds
In Python 3.1 (after using 2to3 to fix the print statements), I get the following timings:
xz_re_word_find: 2.36110496521 seconds
xz_re_word_compiled_find: 2.34727501869 seconds
xz_re_word_precompiled_find: 2.32607793808 seconds
xz_re_z_find: 1.32204890251 seconds
xz_re_z_compiled_find: 1.34104800224 seconds
xz_re_z_precompiled_find: 1.34424304962 seconds
xz_re_xz_find: 2.33851099014 seconds
xz_re_xz_compiled_find: 2.29653286934 seconds
xz_re_xz_precompiled_find: 2.32416701317 seconds
xz_str_find: 0.656699895859 seconds
We can see that the regular expression-based functions tend to take twice as long to run as the string methods-based function in Python 2.6, and over 3 times as long in Python 3. The time difference is trivial for one-off parsing (nobody's going to miss those milliseconds), but for cases where the function must be called many times, the string methods-based approach is both simpler and faster.
Assuming you have the entire file as one large string in memory, and that the definition of a word is "a contiguous sequence of letters", then you could do something like this:
import re
for word in re.findall(r"\w+", mystring):
if 'x' in word and 'z' in word:
print word
>>> import re
>>> pattern = re.compile('\b(\w*z\w*x\w*|\w*x\w*z\w*)\b')
>>> document = '''Here is some data that needs
... to be searched for words that contain both z
... and x. Blah xz zx blah jal akle asdke asdxskz
... zlkxlk blah bleh foo bar'''
>>> print pattern.findall(document)
['xz', 'zx', 'asdxskz', 'zlkxlk']
If you don't want to have 2 problems:
for word in file('myfile.txt').read().split():
if 'x' in word and 'z' in word:
print word