问题
I have developed a public and open source App for Splunk (Nmon performance monitor for Unix and Linux Systems, see https://apps.splunk.com/app/1753/)
A master piece of the App is an old perl (recycled, modified and updated) script automatically launched by the App to convert the Nmon data (which is some kind of custom csv), reading it from stdin and writing out to formerly formatted csv files by section (a section is a performance monitor)
I want now to fully rewrite this script in Python, which is almost done for a first beta version... BUT i am facing difficulties to transpose data, and i'm afraid not being able to solve it myself.
This is why i am kindly asking for help today.
Here is the difficulty in details:
Nmon generates performance monitors for various sections (cpu, memory, disks...), for many of them there is no big difficulty but extracting the good timestamp and so on. But for all sections that have "device" notion (such as DISKBUSY in the provided example, which represents the percentage of time disks were busy) have to be transformed and transposed to be later exploitable
Currently, i am able to generate the data as follows:
Example:
time,sda,sda1,sda2,sda3,sda5,sda6,sda7,sdb,sdb1,sdc,sdc1,sdc2,sdc3
26-JUL-2014 11:10:44,4.4,0.0,0.0,0.0,0.4,1.9,2.5,0.0,0.0,10.2,10.2,0.0,0.0
26-JUL-2014 11:10:54,4.8,0.0,0.0,0.0,0.3,2.0,2.6,0.0,0.0,5.4,5.4,0.0,0.0
26-JUL-2014 11:11:04,4.8,0.0,0.0,0.0,0.4,2.3,2.1,0.0,0.0,17.8,17.8,0.0,0.0
26-JUL-2014 11:11:14,2.1,0.0,0.0,0.0,0.2,0.5,1.5,0.0,0.0,28.2,28.2,0.0,0.0
The goal is to transpose the data such as we will have in the header "time,device,value", example:
time,device,value
26-JUL-2014 11:10:44,sda,4.4
26-JUL-2014 11:10:44,sda1,0.0
26-JUL-2014 11:10:44,sda2,0.0
And so on.
One month ago, I've opened a question for almost the same need (for another app and not exactly the same data, but the same need to transpose columns to rows)
Python - CSV time oriented Transposing large number of columns to rows
I had a very great answer which perfectly did the trick, thus i am unable to recycle the piece of code into this new context. One of difference is that i want to include the data transposition inside within the code, such that the script only works in memory and avoid dealing with multiple temporary files.
Here is the piece of code:
Note: needs to use Python 2x
###################
# Dynamic Sections : data requires to be transposed to be exploitable within Splunk
###################
dynamic_section = ["DISKBUSY"]
for section in dynamic_section:
# Set output file
currsection_output = DATA_DIR + HOSTNAME + '_' + day + '_' + month + '_' + year + '_' + hour + minute + second + '_' + section + '.csv'
# Open output for writing
with open(currsection_output, "w") as currsection:
for line in data:
# Extract sections, and write to output
myregex = r'^' + section + '[0-9]*' + '|ZZZZ.+'
find_section = re.match( myregex, line)
if find_section:
# csv header
# Replace some symbols
line=re.sub("%",'_PCT',line)
line=re.sub(" ",'_',line)
# Extract header excluding data that always has Txxxx for timestamp reference
myregex = '(' + section + ')\,([^T].+)'
fullheader_match = re.search( myregex, line)
if fullheader_match:
fullheader = fullheader_match.group(2)
header_match = re.match( r'([a-zA-Z\-\/\_0-9]+,)([a-zA-Z\-\/\_0-9\,]*)', fullheader)
if header_match:
header = header_match.group(2)
# Write header
currsection.write('time' + ',' + header + '\n'),
# Extract timestamp
# Nmon V9 and prior do not have date in ZZZZ
# If unavailable, we'll use the global date (AAA,date)
ZZZZ_DATE = '-1'
ZZZZ_TIME = '-1'
# For Nmon V10 and more
timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\,(.+)\n', line)
if timestamp_match:
ZZZZ_TIME = timestamp_match.group(2)
ZZZZ_DATE = timestamp_match.group(3)
ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME
# For Nmon V9 and less
if ZZZZ_DATE == '-1':
ZZZZ_DATE = DATE
timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\n', line)
if timestamp_match:
ZZZZ_TIME = timestamp_match.group(2)
ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME
# Extract Data
myregex = r'^' + section + '\,(T\d+)\,(.+)\n'
perfdata_match = re.match( myregex, line)
if perfdata_match:
perfdata = perfdata_match.group(2)
# Write perf data
currsection.write(ZZZZ_timestamp + ',' + perfdata + '\n'),
# End for
# Open output for reading and show number of line we extracted
with open(currsection_output, "r") as currsection:
num_lines = sum(1 for line in currsection)
print (section + " section: Wrote", num_lines, "lines")
# End for
The line:
currsection.write('time' + ',' + header + '\n'),
will contain the header
And the line:
currsection.write(ZZZZ_timestamp + ',' + perfdata + '\n'),
contains the data line by line
Note: the final data (header and body data) should in target also contains other information, to simplify things i removed it in the code above
For static sections which does not require the data transposition, the same lines will be:
currsection.write('type' + ',' + 'serialnum' + ',' + 'hostname' + ',' + 'time' + ',' + header + '\n'),
And:
currsection.write(section + ',' + SN + ',' + HOSTNAME + ',' + ZZZZ_timestamp + ',' + perfdata + '\n'),
The great goal would be to be able to transpose the data just after the required definition and before writing it.
Also, performance and minimum system resources called (such as working with temporary files instead of memory) is a requirement to prevent from generating too high cpu load on systems periodically the script.
Could anyone help me to achieve this ? I've looked for it again and again, i'm pretty sure there is multiple ways to achieve this (zip, map, dictionary, list, split...) but i failed to achieve it...
Please be indulgent, this is my first real Python script :-)
Thank you very much for any help !
More details:
- testing nmon file
A small testing nmon file can be retrieved here: http://pastebin.com/xHLRbBU0
- Current complete script
The current complete script can be retrieved here: http://pastebin.com/QEnXj6Yh
To test the script, it is required to:
export the SPLUNK_HOME variable to anything relevant for you, ex:
mkdir /tmp/nmon2csv
--> place the script and nmon file here, allow execution on script
export SPLUNK_HOME=/tmp/nmon2csv
mkdir -p etc/apps/nmon
And finally:
cat test.nmon | ./nmon2csv.py
Data will be generated in /tmp/nmon2csv/etc/apps/nmon/var/*
Update: Working code using csv module:
###################
# Dynamic Sections : data requires to be transposed to be exploitable within Splunk
###################
dynamic_section = ["DISKBUSY","DISKBSIZE","DISKREAD","DISKWRITE","DISKXFER","DISKRIO","DISKWRIO","IOADAPT","NETERROR","NET","NETPACKET","JFSFILE","JFSINODE"]
for section in dynamic_section:
# Set output file (will opened after transpose)
currsection_output = DATA_DIR + HOSTNAME + '_' + day + '_' + month + '_' + year + '_' + hour + minute + second + '_' + section + '.csv'
# Open Temp
with TemporaryFile() as tempf:
for line in data:
# Extract sections, and write to output
myregex = r'^' + section + '[0-9]*' + '|ZZZZ.+'
find_section = re.match( myregex, line)
if find_section:
# csv header
# Replace some symbols
line=re.sub("%",'_PCT',line)
line=re.sub(" ",'_',line)
# Extract header excluding data that always has Txxxx for timestamp reference
myregex = '(' + section + ')\,([^T].+)'
fullheader_match = re.search( myregex, line)
if fullheader_match:
fullheader = fullheader_match.group(2)
header_match = re.match( r'([a-zA-Z\-\/\_0-9]+,)([a-zA-Z\-\/\_0-9\,]*)', fullheader)
if header_match:
header = header_match.group(2)
# Write header
tempf.write('time' + ',' + header + '\n'),
# Extract timestamp
# Nmon V9 and prior do not have date in ZZZZ
# If unavailable, we'll use the global date (AAA,date)
ZZZZ_DATE = '-1'
ZZZZ_TIME = '-1'
# For Nmon V10 and more
timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\,(.+)\n', line)
if timestamp_match:
ZZZZ_TIME = timestamp_match.group(2)
ZZZZ_DATE = timestamp_match.group(3)
ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME
# For Nmon V9 and less
if ZZZZ_DATE == '-1':
ZZZZ_DATE = DATE
timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\n', line)
if timestamp_match:
ZZZZ_TIME = timestamp_match.group(2)
ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME
# Extract Data
myregex = r'^' + section + '\,(T\d+)\,(.+)\n'
perfdata_match = re.match( myregex, line)
if perfdata_match:
perfdata = perfdata_match.group(2)
# Write perf data
tempf.write(ZZZZ_timestamp + ',' + perfdata + '\n'),
# Open final for writing
with open(currsection_output, "w") as currsection:
# Rewind temp
tempf.seek(0)
writer = csv.writer(currsection)
writer.writerow(['type', 'serialnum', 'hostname', 'time', 'device', 'value'])
for d in csv.DictReader(tempf):
time = d.pop('time')
for device, value in sorted(d.items()):
row = [section, SN, HOSTNAME, time, device, value]
writer.writerow(row)
# End for
# Open output for reading and show number of line we extracted
with open(currsection_output, "r") as currsection:
num_lines = sum(1 for line in currsection)
print (section + " section: Wrote", num_lines, "lines")
# End for
回答1:
The goal is to transpose the data such as we will have in the header "time,device,value"
This rough transposition logic looks like this:
text = '''time,sda,sda1,sda2,sda3,sda5,sda6,sda7,sdb,sdb1,sdc,sdc1,sdc2,sdc3
26-JUL-2014 11:10:44,4.4,0.0,0.0,0.0,0.4,1.9,2.5,0.0,0.0,10.2,10.2,0.0,0.0
26-JUL-2014 11:10:54,4.8,0.0,0.0,0.0,0.3,2.0,2.6,0.0,0.0,5.4,5.4,0.0,0.0
26-JUL-2014 11:11:04,4.8,0.0,0.0,0.0,0.4,2.3,2.1,0.0,0.0,17.8,17.8,0.0,0.0
26-JUL-2014 11:11:14,2.1,0.0,0.0,0.0,0.2,0.5,1.5,0.0,0.0,28.2,28.2,0.0,0.0
'''
import csv
for d in csv.DictReader(text.splitlines()):
time = d.pop('time')
for device, value in sorted(d.items()):
print time, device, value
Putting it all together into a complete script looks something like this:
import csv
with open('transposed.csv', 'wb') as destfile:
writer = csv.writer(destfile)
writer.writerow(['time', 'device', 'value'])
with open('data.csv', 'rb') as sourefile:
for d in csv.DictReader(sourcefile):
time = d.pop('time')
for device, value in sorted(d.items()):
row = [time, device, value]
writer.writerow(row)
来源:https://stackoverflow.com/questions/25005566/python-transpose-columns-to-rows-within-data-operation-and-before-writing-to-f