问题
I use currently Vim to read SO's data dump. However, my Macbook slows down when I roll down just a few rows. This suggests me that there must be more efficient ways to read the data.
I know little MySQL. The files are in .xml -format. It is rather hard to read the data at the moment in .xml. It may be more efficient to convert the xml -files to MySQL and then read the files. I know only MS db -tool for such actions. However, I would like to know another tool too.
Problems
- to parse .xml to SQL -queries such that MySQL understand it. We need to know data structures of the data.
- to run the data in MySQL
- to find some tool similar to MS db -tool by which we can read the data effectively
How do you read SO's data dump effectively?
--
[edit]
- How can you run the 523 SQL queries to create the database in your terminal? I have the commands at the moment in a text -file.
- How can you "switch to [the recovery mode] to a simple recovery mode in the database?
回答1:
I made my first ever python program to read them and output SQL insert statements for use with mysql (It's ugly but worked). You'll need to create the tables first though by hand.
import xml.sax.handler
import xml.sax
import sys
class SOHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self.errParse = 0
def startElement(self, name, attributes):
if name != "row":
self.table = name;
self.outFile = open(name+".sql","w")
self.errfile = open(name+".err","w")
else:
skip = 0
currentRow = u"insert into "+self.table+"("
for attr in attributes.keys():
currentRow += str(attr) + ","
currentRow = currentRow[:-1]
currentRow += u") values ("
for attr in attributes.keys():
try:
currentRow += u'"{0}",'.format(attributes[attr].replace('\\','\\\\').replace('"', '\\"').replace("'", "\\'"))
except UnicodeEncodeError:
self.errParse += 1;
skip = 1;
self.errfile.write(currentRow)
if skip != 1:
currentRow = currentRow[:-1]
currentRow += u");"
#print len(attributes.keys())
self.outFile.write(currentRow.encode("utf-8"))
self.outFile.write("\n")
self.outFile.flush()
print currentRow.encode("utf-8");
def characters(self, data):
pass
def endElement(self, name):
pass
if len(sys.argv) < 2:
print "Give me an xml file argument!"
sys.exit(1)
parser = xml.sax.make_parser()
handler = SOHandler()
parser.setContentHandler(handler)
parser.parse(sys.argv[1])
print handler.errParse
来源:https://stackoverflow.com/questions/1146470/to-read-sos-data-dump-effectively