问题
I am creating a AWS Lambda function in Java to process Kinesis Data Stream.
My current setup of parsing involves:
- Stringify using UTF-8 as suggested in AWS Documentation
for(KinesisEvent.KinesisEventRecord rec : event.getRecords())
{
String stringRecords = new String(rec.getKinesis().getData().array(), "UTF-8");
pageEventList.add(pageEvent);
}
- Clean up characters using Regex Patterns
a. non-ascii: "[^\\x00-\\x7F]";
b. ascii-control-characters: "[\\p{Cntrl}&&[^\r\n\t]]";
c. non-printable-characters: "\\p{C}";
- Format json string objects without square brackets and commas
int firstBeginningCurlyBracketIndex = cleanString.indexOf("{");
if (firstBeginningCurlyBracketIndex != -1 ){
cleanString = cleanString.substring(firstBeginningCurlyBracketIndex + 1);
cleanString = "[{" + cleanString;
}
int lastIndexOfCurlyBracketIndex = cleanString.lastIndexOf("}");
if (lastIndexOfCurlyBracketIndex != -1) {
cleanString = cleanString.substring(0, lastIndexOfCurlyBracketIndex);
cleanString = cleanString + "}]";
}
cleanString = cleanString.replaceAll("}\\{", "\\},\\{");
Currently, when I got this far, I am using Regex parsing to separate and parse them into JSON object. Reference: How to match string within parentheses (nested) in Java?
String REGEX_BRACKET_PATTERN_TWO_LAYERS = "(\\{(?:[^}{]+|\\{(?:[^}{]+|\\{[^}{]*\\})*\\})*\\})";
Pattern splitDelRegex = Pattern.compile(REGEX_BRACKET_PATTERN_TWO_LAYERS);
Matcher regexMatcher = splitDelRegex.matcher(nonAsciiRemovedString);
List<String> matcherList = new ArrayList<String>();
while (regexMatcher.find()) {
String perm = regexMatcher.group(1);
matcherList.add(perm);
}
I have attempted to use Gson and Jackson to parse string-json-array after step 3 (ref: How to parse JSON in Java). Parsing works fine until a random invalid JSON / string appears out of Data Stream and throws exception - java.lang.Exception: com.google.gson.JsonSyntaxException: java.lang.IllegalStateException: Expected BEGIN_ARRAY but was STRING at line 2 column 1 path $
Invalid json that causes this exception looks something like this:
[
...
{
"name": "banana"
"description": "description"
},
{
"name": "orange"
"description": "description"
}
GD~
{}
FDSE-}
]
My questions are:
Since the last random string part is very random, I am having difficulties formatting the whole string into valid string json array. If anybody has a good Idea to make sure this string json array is always valid.
Aside from what I have described in steps to parse Kinesis Data Stream to Json data, which by the way is working using REGEX although I still notice that random string at the end, if anybody has experience in this parsing process, please share with the community. I feel like AWS Documentation on this topic of Lambda-Kinesis is not detail enough to make sure the whole parsing process.
Adding to this, I am aware that this could just all be because of the quality of data from data stream. It would also be nice just to hear other people's experience on handling their data on this topic.
来源:https://stackoverflow.com/questions/63071209/how-to-parse-kinesis-data-stream-in-aws-lambda-java