I have a data stream that may contain \\r, \\n, \\r\\n, \\n\\r or any combination of them. Is there a simple way to normalize the data to make all of them simply become \\r
This is the answer to the question. The given solution replaces a string by the given translation table. It does not use an expensive regex function. It also does not use multiple replacement functions that each individually did loop over the data with several checks etc.
So the search is done directly in 1 for loop. For the number of times that the capacity of the result array has to be increased, a loop is also used within the Array.Copy function. That are all the loops. In some cases, a larger page size might be more efficient.
public static string NormalizeNewLine(this string val)
{
if (string.IsNullOrEmpty(val))
return val;
const int page = 6;
int a = page;
int j = 0;
int len = val.Length;
char[] res = new char[len];
for (int i = 0; i < len; i++)
{
char ch = val[i];
if (ch == '\r')
{
int ni = i + 1;
if (ni < len && val[ni] == '\n')
{
res[j++] = '\r';
res[j++] = '\n';
i++;
}
else
{
if (a == page) //ensure capacity
{
char[] nres = new char[res.Length + page];
Array.Copy(res, 0, nres, 0, res.Length);
res = nres;
a = 0;
}
res[j++] = '\r';
res[j++] = '\n';
a++;
}
}
else if (ch == '\n')
{
int ni = i + 1;
if (ni < len && val[ni] == '\r')
{
res[j++] = '\r';
res[j++] = '\n';
i++;
}
else
{
if (a == page) //ensure capacity
{
char[] nres = new char[res.Length + page];
Array.Copy(res, 0, nres, 0, res.Length);
res = nres;
a = 0;
}
res[j++] = '\r';
res[j++] = '\n';
a++;
}
}
else
{
res[j++] = ch;
}
}
return new string(res, 0, j);
}
The translation table really appeals to me even if '\n\r' is not actually used on basic platforms. Who would use two types of linebreaks for indicate 2 linebreaks? If you want to know that, than you need to take a look before to know if the \n and \r both are used seperatly in the same document.
A Regex would help.. could do something roughly like this..
(\r\n|\n\n|\n\r|\r|\n) replace with \r\n
This regex produced these results from the table posted (just testing left side) so a replace should normalize.
\r => \r
\n => \n
\n\n => \n\n
\n\r => \n\r
\r\n => \r\n
\r\n => \r\n
\n => \n
Normalise breaks, so that they are all \r\n
var normalisedString =
sourceString
.Replace("\r\n", "\n")
.Replace("\n\r", "\n")
.Replace("\r", "\n")
.Replace("\n", "\r\n");
You're thinking too complicated. Ignore every \r and turn every \n into an \r\n.
In Pseudo-C#:
char[] chunk = new char[X];
StringBuffer output = new StringBuffer();
buffer.Read(chunk);
foreach (char c in chunk)
{
switch (c)
{
case '\r' : break; // ignore
case '\n' : output.Append("\r\n");
default : output.Append(c);
}
}
EDIT: \r alone is no line-terminator so I doubt you really want to expand \r to \r\n.
I'm with Jamie Zawinski on RegEx:
"Some people, when confronted with a problem, think "I know, I’ll use regular expressions." Now they have two problems"
For those of us who prefer readability:
Step 1
Replace \r\n by \n
Replace \n\r by \n (if you really want this, some posters seem to think not)
Replace \r by \n
Step 2 Replace \n by Environment.NewLine or \r\n or whatever.
I believe this will do what you need:
using System.Text.RegularExpressions;
// ...
string normalized = Regex.Replace(originalString, @"\r\n|\n\r|\n|\r", "\r\n");
I'm not 100% sure on the exact syntax, and I don't have a .Net compiler handy to check. I wrote it in perl, and converted it into (hopefully correct) C#. The only real trick is to match "\r\n" and "\n\r" first.
To apply it to an entire stream, just run in on chunks of input. (You could do this with a stream wrapper if you want.)
The original perl:
$str =~ s/\r\n|\n\r|\n|\r/\r\n/g;
The test results:
[bash$] ./test.pl
\r -> \r\n
\n -> \r\n
\n\n -> \r\n\r\n
\n\r -> \r\n
\r\n -> \r\n
\r\n\n -> \r\n\r\n
Update: Now converts \n\r to \r\n, though I wouldn't call that normalization.