Reading the data written to s3 by Amazon Kinesis Firehose stream

后端 未结 9 2054
感情败类
感情败类 2021-02-18 15:17

I am writing record to Kinesis Firehose stream that is eventually written to a S3 file by Amazon Kinesis Firehose.

My record object looks like

ItemPurcha         


        
相关标签:
9条回答
  • 2021-02-18 15:27

    You can find the each valid JSON by counting the brackets. Assuming the file starts with a { this python snippet should work:

    import json
    
    def read_block(stream):
        open_brackets = 0
        block = ''
        while True:
            c = stream.read(1)
            if not c:
                break
    
            if c == '{':
                open_brackets += 1
            elif c == '}':
                open_brackets -= 1
    
            block += c
    
            if open_brackets == 0:
                yield block
                block = ''
    
    
    if __name__ == "__main__":
        c = 0
        with open('firehose_json_blob', 'r') as f:
            for block in read_block(f):
                record = json.loads(block)
                print(record)
    
    0 讨论(0)
  • 2021-02-18 15:27

    If there's a way to change the way data is written, please separate all the records by a line. That way you can read the data simply, line by line. If not, then simply build a scanner object which takes "}" as a delimiter and use the scanner to read. That would do the job.

    0 讨论(0)
  • 2021-02-18 15:28

    If the input source for the firehose is an Analytics application, this concatenated JSON without a delimiter is a known issue as cited here. You should have a lambda function as here that outputs JSON objects in multiple lines.

    0 讨论(0)
  • 2021-02-18 15:33

    I think the best ways to tackle this is to first create a properly formatted json file containing well separated json objects within them. In my case I added ',' to the events which was pushed into the firehose. Then After a file is saved in s3, all the files will contain json object separated by some delimitter(comma- in our case). Another thing that must be added are '[' and ']' at the beginning and end of the file. Then you have a proper json file containing multiple json objects. Parsing them will be possible now.

    0 讨论(0)
  • 2021-02-18 15:34

    I used a transformation Lambda to add a line break at the end of every record

    def lambda_handler(event, context):
        output = []
    
        for record in event['records']:
    
            # Decode from base64 (Firehose records are base64 encoded)
            payload = base64.b64decode(record['data'])
    
            # Read json as utf-8    
            json_string = payload.decode("utf-8")
    
            # Add a line break
            output_json_with_line_break = json_string + "\n"
    
            # Encode the data
            encoded_bytes = base64.b64encode(bytearray(output_json_with_line_break, 'utf-8'))
            encoded_string = str(encoded_bytes, 'utf-8')
    
            # Create a deep copy of the record and append to output with transformed data
            output_record = copy.deepcopy(record)
            output_record['data'] = encoded_string
            output_record['result'] = 'Ok'
    
            output.append(output_record)
    
        print('Successfully processed {} records.'.format(len(event['records'])))
    
        return {'records': output}
    
    0 讨论(0)
  • 2021-02-18 15:38

    I also had the same problem, here is how I solved.

    1. replace "}{" with "}\n{"
    2. line split by "\n".

      input_json_rdd.map(lambda x : re.sub("}{", "}\n{", x, flags=re.UNICODE))
                    .flatMap(lambda line: line.split("\n"))
      

    A nested json object has several "}"s, so split line by "}" doesn't solve the problem.

    0 讨论(0)
提交回复
热议问题