I\'m trying to write a script which will extract strings from an executable binary and save them in a file. Having this file be newline-separated isn\'t an option since the
To quote man strings
:
STRINGS(1) GNU Development Tools STRINGS(1) NAME strings - print the strings of printable characters in files. [...] DESCRIPTION For each file given, GNU strings prints the printable character sequences that are at least 4 characters long (or the number given with the options below) and are followed by an unprintable character. By default, it only prints the strings from the initialized and loaded sections of object files; for other types of files, it prints the strings from the whole file.
You could achieve a similar result by using a regex
matching at least 4 printable characters. Something like that:
>>> import re
>>> content = "hello,\x02World\x88!"
>>> re.findall("[^\x00-\x1F\x7F-\xFF]{4,}", content)
['hello,', 'World']
Please note this solution require the entire file content to be loaded in memory.
Here's a generator that yields all the strings of printable characters >= min
(4 by default) in length that it finds in filename
:
import string
def strings(filename, min=4):
with open(filename, errors="ignore") as f: # Python 3.x
# with open(filename, "rb") as f: # Python 2.x
result = ""
for c in f.read():
if c in string.printable:
result += c
continue
if len(result) >= min:
yield result
result = ""
if len(result) >= min: # catch result at EOF
yield result
Which you can iterate over:
for s in strings("something.bin"):
# do something with s
... or store in a list:
sl = list(strings("something.bin"))
I've tested this very briefly, and it seems to give the same output as the Unix strings
command for the arbitrary binary file I chose. However, it's pretty naïve (for a start, it reads the whole file into memory at once, which might be expensive for large files), and is very unlikely to approach the performance of the Unix strings
command.