I am reading a file containing keywords line by line and found a strange problem. I hope lines that following each other if their contents are the same, they should be handl
There must be a space
or some non-printable character in the start. So, either fix that or trim the Strings
during/before comparison.
[Edited]
In case String.trim()
is of no avail. Try String.replaceAll()
using proper regex
. Try this, str.replaceAll("\\p{Cntrl}", "")
.
What is the encoding of the file?
The unseen char at the start of the file could be the Byte Order Mark
Saving with ANSI or UTF-8 without BOM can help highlight this for you.
The Byte Order Mark (BOM) is a Unicode character. You will get characters like  at the start of a text stream, because BOM use is optional, and, if used, should appear at the start of the text stream.
File file = new File( csvFilename );
FileInputStream inputStream = new FileInputStream(file);
// [{"Key2":"21","Key1":"11","Key3":"31"} ]
InputStreamReader inputStreamReader = new InputStreamReader( inputStream, "UTF-8" );
We can resolve by explicitly specifying charset as UTF-8
to InputStreamReader. Then in UTF-8, the byte sequence 
decodes to one character, which is U+FEFF (?
).
Using Google Guava's jar CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:
String printable = CharMatcher.INVISIBLE.removeFrom( input );
String clean = CharMatcher.ASCII.retainFrom( printable );
Full Example to read data from the CSV file to JSON Object:
public class CSV_FileOperations {
static List<HashMap<String, String>> listObjects = new ArrayList<HashMap<String,String>>();
protected static List<JSONObject> jsonArray = new ArrayList<JSONObject >();
public static void main(String[] args) {
String csvFilename = "D:/Yashwanth/json2Bson.csv";
csvToJSONString(csvFilename);
String jsonData = jsonArray.toString();
System.out.println("File JSON Data : \n"+ jsonData);
}
@SuppressWarnings("deprecation")
public static String csvToJSONString( String csvFilename ) {
try {
File file = new File( csvFilename );
FileInputStream inputStream = new FileInputStream(file);
String fileExtensionName = csvFilename.substring(csvFilename.indexOf(".")); // fileName.split(".")[1];
System.out.println("File Extension : "+ fileExtensionName);
// [{"Key2":"21","Key1":"11","Key3":"31"} ]
InputStreamReader inputStreamReader = new InputStreamReader( inputStream, "UTF-8" );
BufferedReader buffer = new BufferedReader( inputStreamReader );
Stream<String> readLines = buffer.lines();
boolean headerStream = true;
List<String> headers = new ArrayList<String>();
for (String line : (Iterable<String>) () -> readLines.iterator()) {
String[] columns = line.split(",");
if (headerStream) {
System.out.println(" ===== Headers =====");
for (String keys : columns) {
//  - UTF-8 - ? « https://stackoverflow.com/a/11021401/5081877
String printable = CharMatcher.INVISIBLE.removeFrom( keys );
String clean = CharMatcher.ASCII.retainFrom(printable);
String key = clean.replace("\\P{Print}", "");
headers.add( key );
}
headerStream = false;
System.out.println(" ===== ----- Data ----- =====");
} else {
addCSVData(headers, columns );
}
}
inputStreamReader.close();
buffer.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
@SuppressWarnings("unchecked")
public static void addCSVData( List<String> headers, String[] row ) {
if( headers.size() == row.length ) {
HashMap<String,String> mapObj = new HashMap<String,String>();
JSONObject jsonObj = new JSONObject();
for (int i = 0; i < row.length; i++) {
mapObj.put(headers.get(i), row[i]);
jsonObj.put(headers.get(i), row[i]);
}
jsonArray.add(jsonObj);
listObjects.add(mapObj);
} else {
System.out.println("Avoiding the Row Data...");
}
}
}
json2Bson.csv
File data.
Key1 Key2 Key3
11 21 31
12 22 32
13 23 33
If spaces are not important in the processing it would probably be worth doing a strLine.trim()
call each time anyway. This is what I generally do when handling input like this - spaces can easily creep into a file if it has to be edited manually and if they're not important they can and should be ignored.
Edit: is the file encoded as UTF-8? You may need to specify the encoding when you open the file. It could be the byte order mark or something like that, if it's happening on the first line.
Try:
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"))
Try trimming whitespace at the beginning and end of lines read. Just replace your while with:
while ((strLine = bufferedReader.readLine()) != null) {
strLine = strLine.trim();
logger.info(Arrays.toString(strLine.toCharArray()));
if(strLine.contentEquals(prevLine)){
logger.info("Skipping the duplicate lines " + strLine);
continue;
}
prevLine = strLine;
}
I had a similar case in my previous project. The culprit was the Byte order mark, which I had to get rid of. Eventually I implemented a hack based on this example. Check it out, might be that you have the same problem.