Python - Transpose columns to rows within data operation and before writing to file

问题

I have developed a public and open source App for Splunk (Nmon performance monitor for Unix and Linux Systems, see https://apps.splunk.com/app/1753/)

A master piece of the App is an old perl (recycled, modified and updated) script automatically launched by the App to convert the Nmon data (which is some kind of custom csv), reading it from stdin and writing out to formerly formatted csv files by section (a section is a performance monitor)

I want now to fully rewrite this script in Python, which is almost done for a first beta version... BUT i am facing difficulties to transpose data, and i'm afraid not being able to solve it myself.

This is why i am kindly asking for help today.

Here is the difficulty in details:

Nmon generates performance monitors for various sections (cpu, memory, disks...), for many of them there is no big difficulty but extracting the good timestamp and so on. But for all sections that have "device" notion (such as DISKBUSY in the provided example, which represents the percentage of time disks were busy) have to be transformed and transposed to be later exploitable

Currently, i am able to generate the data as follows:

Example:

time,sda,sda1,sda2,sda3,sda5,sda6,sda7,sdb,sdb1,sdc,sdc1,sdc2,sdc3
26-JUL-2014 11:10:44,4.4,0.0,0.0,0.0,0.4,1.9,2.5,0.0,0.0,10.2,10.2,0.0,0.0
26-JUL-2014 11:10:54,4.8,0.0,0.0,0.0,0.3,2.0,2.6,0.0,0.0,5.4,5.4,0.0,0.0
26-JUL-2014 11:11:04,4.8,0.0,0.0,0.0,0.4,2.3,2.1,0.0,0.0,17.8,17.8,0.0,0.0
26-JUL-2014 11:11:14,2.1,0.0,0.0,0.0,0.2,0.5,1.5,0.0,0.0,28.2,28.2,0.0,0.0

The goal is to transpose the data such as we will have in the header "time,device,value", example:

time,device,value
26-JUL-2014 11:10:44,sda,4.4
26-JUL-2014 11:10:44,sda1,0.0
26-JUL-2014 11:10:44,sda2,0.0

And so on.

One month ago, I've opened a question for almost the same need (for another app and not exactly the same data, but the same need to transpose columns to rows)

Python - CSV time oriented Transposing large number of columns to rows

I had a very great answer which perfectly did the trick, thus i am unable to recycle the piece of code into this new context. One of difference is that i want to include the data transposition inside within the code, such that the script only works in memory and avoid dealing with multiple temporary files.

Here is the piece of code:

Note: needs to use Python 2x

###################
# Dynamic Sections : data requires to be transposed to be exploitable within Splunk
###################


dynamic_section = ["DISKBUSY"]

for section in dynamic_section:

    # Set output file
    currsection_output = DATA_DIR + HOSTNAME + '_' + day + '_' + month + '_' + year + '_' + hour + minute + second + '_' + section + '.csv'

    # Open output for writing
    with open(currsection_output, "w") as currsection:

        for line in data:

            # Extract sections, and write to output
            myregex = r'^' + section + '[0-9]*' + '|ZZZZ.+'
            find_section = re.match( myregex, line)
            if find_section:

                # csv header

                # Replace some symbols
                line=re.sub("%",'_PCT',line)
                line=re.sub(" ",'_',line)

                # Extract header excluding data that always has Txxxx for timestamp reference
                myregex = '(' + section + ')\,([^T].+)'
                fullheader_match = re.search( myregex, line)            

                if fullheader_match:
                    fullheader = fullheader_match.group(2)

                    header_match = re.match( r'([a-zA-Z\-\/\_0-9]+,)([a-zA-Z\-\/\_0-9\,]*)', fullheader)    

                    if header_match:
                        header = header_match.group(2)

                        # Write header
                        currsection.write('time' + ',' + header + '\n'),


                # Extract timestamp

                # Nmon V9 and prior do not have date in ZZZZ
                # If unavailable, we'll use the global date (AAA,date)
                ZZZZ_DATE = '-1'
                ZZZZ_TIME = '-1'                

                # For Nmon V10 and more             

                timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\,(.+)\n', line)
                if timestamp_match:
                    ZZZZ_TIME = timestamp_match.group(2)
                    ZZZZ_DATE = timestamp_match.group(3)            
                    ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME

                # For Nmon V9 and less                  

                if ZZZZ_DATE == '-1':
                    ZZZZ_DATE = DATE
                    timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\n', line)
                    if timestamp_match:
                        ZZZZ_TIME = timestamp_match.group(2)                    
                        ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME

                # Extract Data
                myregex = r'^' + section + '\,(T\d+)\,(.+)\n'
                perfdata_match = re.match( myregex, line)
                if perfdata_match:
                    perfdata = perfdata_match.group(2)

                    # Write perf data
                    currsection.write(ZZZZ_timestamp + ',' + perfdata + '\n'),

        # End for

    # Open output for reading and show number of line we extracted
    with open(currsection_output, "r") as currsection:

        num_lines = sum(1 for line in currsection)
        print (section + " section: Wrote", num_lines, "lines")

# End for

The line:

                currsection.write('time' + ',' + header + '\n'),

will contain the header

And the line:

            currsection.write(ZZZZ_timestamp + ',' + perfdata + '\n'),

contains the data line by line

Note: the final data (header and body data) should in target also contains other information, to simplify things i removed it in the code above

For static sections which does not require the data transposition, the same lines will be:

                    currsection.write('type' + ',' + 'serialnum' + ',' + 'hostname' + ',' + 'time' + ',' + header + '\n'),

And:

                currsection.write(section + ',' + SN + ',' + HOSTNAME + ',' + ZZZZ_timestamp + ',' + perfdata + '\n'),

The great goal would be to be able to transpose the data just after the required definition and before writing it.

Also, performance and minimum system resources called (such as working with temporary files instead of memory) is a requirement to prevent from generating too high cpu load on systems periodically the script.

Could anyone help me to achieve this ? I've looked for it again and again, i'm pretty sure there is multiple ways to achieve this (zip, map, dictionary, list, split...) but i failed to achieve it...

Please be indulgent, this is my first real Python script :-)

Thank you very much for any help !

More details:

testing nmon file

A small testing nmon file can be retrieved here: http://pastebin.com/xHLRbBU0

Current complete script

The current complete script can be retrieved here: http://pastebin.com/QEnXj6Yh

To test the script, it is required to:

export the SPLUNK_HOME variable to anything relevant for you, ex:

mkdir /tmp/nmon2csv

--> place the script and nmon file here, allow execution on script

export SPLUNK_HOME=/tmp/nmon2csv
mkdir -p etc/apps/nmon

And finally:

cat test.nmon | ./nmon2csv.py

Data will be generated in /tmp/nmon2csv/etc/apps/nmon/var/*

Update: Working code using csv module:

###################
# Dynamic Sections : data requires to be transposed to be exploitable within Splunk
###################

dynamic_section = ["DISKBUSY","DISKBSIZE","DISKREAD","DISKWRITE","DISKXFER","DISKRIO","DISKWRIO","IOADAPT","NETERROR","NET","NETPACKET","JFSFILE","JFSINODE"]

for section in dynamic_section:

    # Set output file (will opened after transpose)
    currsection_output = DATA_DIR + HOSTNAME + '_' + day + '_' + month + '_' + year + '_' + hour + minute + second + '_' + section + '.csv'

    # Open Temp
    with TemporaryFile() as tempf:

        for line in data:

            # Extract sections, and write to output
            myregex = r'^' + section + '[0-9]*' + '|ZZZZ.+'
            find_section = re.match( myregex, line)
            if find_section:

                # csv header

                # Replace some symbols
                line=re.sub("%",'_PCT',line)
                line=re.sub(" ",'_',line)

                # Extract header excluding data that always has Txxxx for timestamp reference
                myregex = '(' + section + ')\,([^T].+)'
                fullheader_match = re.search( myregex, line)            

                if fullheader_match:
                    fullheader = fullheader_match.group(2)

                    header_match = re.match( r'([a-zA-Z\-\/\_0-9]+,)([a-zA-Z\-\/\_0-9\,]*)', fullheader)    

                    if header_match:
                        header = header_match.group(2)

                        # Write header
                        tempf.write('time' + ',' + header + '\n'),  

                # Extract timestamp

                # Nmon V9 and prior do not have date in ZZZZ
                # If unavailable, we'll use the global date (AAA,date)
                ZZZZ_DATE = '-1'
                ZZZZ_TIME = '-1'                

                # For Nmon V10 and more             

                timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\,(.+)\n', line)
                if timestamp_match:
                    ZZZZ_TIME = timestamp_match.group(2)
                    ZZZZ_DATE = timestamp_match.group(3)            
                    ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME

                # For Nmon V9 and less                  

                if ZZZZ_DATE == '-1':
                    ZZZZ_DATE = DATE
                    timestamp_match = re.match( r'^ZZZZ\,(.+)\,(.+)\n', line)
                    if timestamp_match:
                        ZZZZ_TIME = timestamp_match.group(2)                    
                        ZZZZ_timestamp = ZZZZ_DATE + ' ' + ZZZZ_TIME

                # Extract Data
                myregex = r'^' + section + '\,(T\d+)\,(.+)\n'
                perfdata_match = re.match( myregex, line)
                if perfdata_match:
                    perfdata = perfdata_match.group(2)

                    # Write perf data
                    tempf.write(ZZZZ_timestamp + ',' + perfdata + '\n'),


        # Open final for writing
        with open(currsection_output, "w") as currsection:

            # Rewind temp
            tempf.seek(0)

            writer = csv.writer(currsection)
            writer.writerow(['type', 'serialnum', 'hostname', 'time', 'device', 'value'])           

            for d in csv.DictReader(tempf):
                time = d.pop('time')
                for device, value in sorted(d.items()):
                    row = [section, SN, HOSTNAME, time, device, value]
                    writer.writerow(row)            

            # End for

    # Open output for reading and show number of line we extracted
    with open(currsection_output, "r") as currsection:

        num_lines = sum(1 for line in currsection)
        print (section + " section: Wrote", num_lines, "lines")

# End for

回答1:

The goal is to transpose the data such as we will have in the header "time,device,value"

This rough transposition logic looks like this:

text = '''time,sda,sda1,sda2,sda3,sda5,sda6,sda7,sdb,sdb1,sdc,sdc1,sdc2,sdc3
26-JUL-2014 11:10:44,4.4,0.0,0.0,0.0,0.4,1.9,2.5,0.0,0.0,10.2,10.2,0.0,0.0
26-JUL-2014 11:10:54,4.8,0.0,0.0,0.0,0.3,2.0,2.6,0.0,0.0,5.4,5.4,0.0,0.0
26-JUL-2014 11:11:04,4.8,0.0,0.0,0.0,0.4,2.3,2.1,0.0,0.0,17.8,17.8,0.0,0.0
26-JUL-2014 11:11:14,2.1,0.0,0.0,0.0,0.2,0.5,1.5,0.0,0.0,28.2,28.2,0.0,0.0
'''

import csv

for d in csv.DictReader(text.splitlines()):
    time = d.pop('time')
    for device, value in sorted(d.items()):
        print time, device, value

Putting it all together into a complete script looks something like this:

import csv

with open('transposed.csv', 'wb') as destfile:
    writer = csv.writer(destfile)
    writer.writerow(['time', 'device', 'value'])
    with open('data.csv', 'rb') as sourefile:
        for d in csv.DictReader(sourcefile):
            time = d.pop('time')
            for device, value in sorted(d.items()):
                row = [time, device, value]
                writer.writerow(row)

来源：https://stackoverflow.com/questions/25005566/python-transpose-columns-to-rows-within-data-operation-and-before-writing-to-f

标签

python

csv

transpose