AWS Athena export array of structs to JSON

后端 未结 2 1928
梦谈多话
梦谈多话 2021-02-12 19:24

I\'ve got an Athena table where some fields have a fairly complex nested format. The backing records in S3 are JSON. Along these lines (but we have several more levels of nest

相关标签:
2条回答
  • 2021-02-12 20:22

    The python code from @tarun almost got me there, but I had to modify it in several ways due to my data. In particular, I have:

    • json structures saved in Athena as strings
    • Strings that contain multiple words, and therefore need to be in between double quotes. Some of them contain "[]" and "{}" symbols.

    Here is the code that worked for me, hopefully will be useful for others:

    #!/usr/bin/env python
    import io
    import re, sys
    
    pattern1 = re.compile(r'(?<={)([a-z]+)=', re.I)
    pattern2 = re.compile(r':([a-z][^,{}. [\]]+)', re.I)
    pattern3 = re.compile(r'\\"', re.I)
    
    with io.open(sys.argv[1]) as f:
        headers = list(map(lambda f: f.strip(), f.readline().split(",")))
        print(headers)
        for line in f.readlines():
    
            orig_line = line
            #save the double quote cases, which mean there is a string with quotes inside
            line = re.sub('""', "#", orig_line)
            data = []
            for i, l in enumerate(line.split('","')):
                item = re.sub('^"|"$', "", l.rstrip())
                if (item[0] == "{" and item[-1] == "}") or (item[0] == "[" and item[-1] == "]"):
                    data.append(headers[i] + ":" + item)
                else: #we have a string
                    data.append(headers[i] + ": \"" + item + "\"")
    
            line = "{" + ','.join(data) + "}"
            line = pattern1.sub(r'"\1":', line)
            line = pattern2.sub(r':"\1"', line)
    
            #restate the double quotes to single ones, once inside the json
            line = re.sub("#", '"', line)
    
            print(line)
    
    0 讨论(0)
  • 2021-02-12 20:23

    I have skimmed through all the documentation and unfortunately there seems to be no way to do this as of now. The only possible workaround is

    converting a struct to a json when querying athena

    SELECT
      my_field,
      my_field.a,
      my_field.b,
      my_field.c.d,
      my_field.c.e
    FROM 
      my_table
    

    Or I would convert the data to json using post processing. Below script shows how

    #!/usr/bin/env python
    import io
    import re
    
    pattern1 = re.compile(r'(?<={)([a-z]+)=', re.I)
    pattern2 = re.compile(r':([a-z][^,{}. [\]]+)', re.I)
    pattern3 = re.compile(r'\\"', re.I)
    
    with io.open("test.csv") as f:
        headers = list(map(lambda f: f.strip(), f.readline().split(",")))
        for line in f.readlines():
            orig_line = line
            data = []
            for i, l in enumerate(line.split('","')):
                data.append(headers[i] + ":" + re.sub('^"|"$', "", l))
    
            line = "{" + ','.join(data) + "}"
            line = pattern1.sub(r'"\1":', line)
            line = pattern2.sub(r':"\1"', line)
            print(line)
    

    The output on your input data is

    {"timestamp":1.520640777666096E9,"stats":[{"time":15.0, "mean":45.23, "var":0.31}, {"time":19.0, "mean":17.315, "var":2.612}],"dets":[{"coords":[2.4, 1.7, 0.3], "header":{"frame":1, "seq":1, "name":"hello"}}],"pos":{"x":5.0, "y":1.4, "theta":0.04}
    }
    

    Which is a valid JSON

    0 讨论(0)
提交回复
热议问题