I have several textboxes where users can enter information into them. This can include commas, so I can't use the standard comma delimited strings.
What is a good delimiter to denote that strings should be separated based on that character that isn't typically used by users in their writings? I'm going to be combining these fields into a string string and passing them off to my Encryption method I have. After I decrypt them I need to be able to reliably separate them.
I'm using C# if it matters.
| would be next on my list and is often used as an alternative to CSV. google "pipe delimited" and you will find many examples.
string[] items = new string[] {"Uno","Dos","Tres"};
string toEncrypt = String.Join("|", items);
items = toEncrypt.Split(new char[] {'|'}, StringSplitOptions.RemoveEmptyEntries);
foreach(string s in items)
Console.WriteLine(s);
And since everyone likes to be a critic about the encoding and not provide the code, here is one way to encode the text so your | delim won't collide.
string[] items = new string[] {"Uno","Dos","Tres"};
for (int i = 0; i < items.Length; i++)
items[i] = Convert.ToBase64String(Encoding.UTF8.GetBytes(items[i]));
string toEncrypt = String.Join("|", items);
items = toEncrypt.Split(new char[] {'|'}, StringSplitOptions.RemoveEmptyEntries);
foreach (string s in items)
Console.WriteLine(Encoding.UTF8.GetString(Convert.FromBase64String(s)));
I have seen unusal characters used as delimiters, even unusal character combinarions like -|::|-
, but eventhough they are more unlikely to occur, they still can.
You have basically two options if you want to make it water tight:
1: Use a character that is impossible to type, like the '\0' character:
Join:
string combined = string.Join("\0", inputArray);
Split:
string[] result = combined.Split('\0');
2: Escape the string and use an escaped character as delimiter, like url encoding the values and use & as delimiter:
Join:
string combined = string.Join("&", inputArray.Select<string,string>(System.Web.HttpUtility.UrlEncode).ToArray());
Split:
string[] result = combined.Split('&').Select<string,string>(System.Web.HttpUtility.UrlDecode).ToArray();
I don't think I've willingly self-delimited a collection of strings since I stopped using C. There's just no need for it in a "modern" language, and - while trivial - the number of edge cases are enough to annoy you to death.
Store them in a List<string> or string[] and serialize/deserialize them. Use XML if you want human readability or interop - or binary serialze them if you don't. You can encrypt the output easily either way, and there's no ambiguity or create your own escaping routines needed.
In C#, it's less LOC and takes less time to write than this answer did. There's no excuse to rolling your own solution.
The best solution is to stick to commas and introduce support for character escaping. Whatever character you select will eventually need to be entered so you may aswell provide support for this.
Think backslases + double quotes inside double quoted strings.
Don't pick a character like backtick because some users might not know how to type it in...
Any of the non-standard character pipe |, backtick `, tilde ~, bang !, or semi-colon ; would probably work. However, if you go this route you are really venturing away from usability. Asking them to escape commas with a backslash or something is begging for them to miss one.
If CSV is not possible then you should consider changing your UI. (Heck, you should stay away from CSV anyway for a user input!) You say textbox so I assume you're in web or some kind of win forms or WPF (definitely not a console). All of those give you better UI control than a single textbox and forcing users to conform to your difficult UI design.
More information would definitely help better guide answers.
However, as an example of escaping a comma with a backslash. Note that you cannot escape the backslash before a comma with this. So @"uno, dos, tr\\,es" will end up with {"uno", " dos", "tr\es"}.
string data = @"uno, dos, tr\,es";
string[] items = data.Split(','); // {"uno", " dos", @"tr\", "es"}
List<string> realitems = new List<string>();
for (int i=items.Length-1; i >= 0; i--)
{
string item = items[i];
if (item.Length == 0) { realitems.Insert(0, ""); continue; }
if (realitems.Count == 0) { realitems.Insert(0, item); }
else
{
if (item[item.Length - 1] == '\\') { realitems[0] = item + "," + realitems[0]; }
else { realitems.Insert(0, item); }
}
}
// Should end up with {"uno", " dos", "tr,es"}
Will the user be entering delimited strings into the textboxes, or will they be entering individual strings which will then be built into delimited strings by your code?
In the first case it might be better to rethink your UI instead. eg, The user could enter one string at a time into a textbox and click an "Add to list" button after each one.
In the second case it doesn't really matter what delimiter you use. Choose any character you like, just ensure that you escape any other occurrences of that character.
EDIT
Since several comments on other answers are asking for code, here's a method to create a comma-delimited string, using backslash as the escape character:
public static string CreateDelimitedString(IEnumerable<string> items)
{
StringBuilder sb = new StringBuilder();
foreach (string item in items)
{
sb.Append(item.Replace("\\", "\\\\").Replace(",", "\\,"));
sb.Append(",");
}
return (sb.Length > 0) ? sb.ToString(0, sb.Length - 1) : string.Empty;
}
And here's the method to convert that comma-delimited string back to a collection of individual strings:
public static IEnumerable<string> GetItemsFromDelimitedString(string s)
{
bool escaped = false;
StringBuilder sb = new StringBuilder();
foreach (char c in s)
{
if ((c == '\\') && !escaped)
{
escaped = true;
}
else if ((c == ',') && !escaped)
{
yield return sb.ToString();
sb.Remove(0, sb.Length);
}
else
{
sb.Append(c);
escaped = false;
}
}
yield return sb.ToString();
}
And here's some example usage:
string[] test =
{
"no commas or backslashes",
"just one, comma",
@"a comma, and a\ backslash",
@"lots, of\ commas,\ and\, backslashes",
@"even\\ more,, commas\\ and,, backslashes"
};
string delimited = CreateDelimitedString(test);
Console.WriteLine(delimited);
foreach (string item in GetItemsFromDelimitedString(delimited))
{
Console.WriteLine(item);
}
I figure eventually, every character is going to be used by someone. Users always find a way to break our HL7 parser.
Instead of a single character, maybe try a string that would be random enough that nobody'd ever use it. Something like "#!@!#".
Detect a character that is not used, and then use that. Your final combined string can start with the character that is to be from that point used as the delimiter.
example: your users enter "pants" ",;,;,;,;,;" and "|~~|" You iterate through a set of characters until you find one that is not used. Could be, say, "$" Your final, concatenated string then, is "$pants$,;,;,;,;,;$|~~|" The initial character tells your program what character is to be used as the delimiter. This way, there are no forbidden characters, period.
I assume from what you say that the user is entering data into separate fields, and then you are combining it. So the user never needs to know or care what the delimiter is.
Don't just try to pick a character that "nobody ever uses", because either by accident or in order to try to break your code, some user will eventually use it.
So, I would either:
Insert backslashes to escape commas and backslashes in the user input, then combine the strings with commas. To separate, you split on unescaped commas (which is a job for a state machine), then unescape each component.
Use an off-the-shelf means of serializing a list of strings. What's available depends on your environment, I don't know C#/.NET well enough to advise. In Java you could just serialize a vector or whatever.
Separate the data with a control character like ASCII-BEL or ASCII-VT (or ASCII-NUL if your strings are never treated as nul-terminated), and reject user input which contains that character.
The first option is good if the user has to be allowed to enter any char values they like. The second option is good if you don't care about bloating the data significantly. The third option is good if you don't mind rejecting smart-alec users (or those with unusual requirements) who try to insert funny data.
As has been noted, any character that you choose has the chance of appearing in the input, so you have to handle escaping. XML may be a good serialization format to use, since I believe that .NET has good XML creation and deletion support. This is likely to be much more robust than trying to implement your own character escaping, and will also be more extensible in the future.
Nobody said TAB? Tab delimited is great but it isn't easy to type tabs into GUIs (it tends to move you to the next screen element). But for files generated by computer TAB is perfect since it really should never appear in user generated text.
Why don't you just wrap each input in quotes?
That way you end up with this:
"Aaron","Johnson","25","I like cats, and dogs"
Don't forget to escape quotes on input...
Mark Brackett has the correct answer. I'll only add that the very number of answers to this simple question should put you off of using delimited strings, ever. Let this be a "word to the wise".
The backtick. Nobody uses the backtick.
The pipe character (|), perhaps? If your user base is remotely IT-shy, then this approach (asking them to delimit their text) might not be the best one to take; you could try something else, e.g. provide some means of dynamically adding a text box on the fly which accepts another string, etc.
If you provide a little more information about what you're doing, and for whom, it might be possible for someone to suggest an alternative approach.
Newline? (i.e. use a multi-line text box)
I would suggest using ";"
I prefer to use a combination of characters that would not likely be entered a by a normal person as my delimiter when possible. For example, I've used ")^&^(" and set it up as a const "cDelimiter" in my code; then concatenated all of my fields with that. By using a small unique string, I greatly reduce the likely hood of the user accidentally entering my delimiter. The likely hood of a user entering a | or a ~ is admittedly unlikely, but it doesn't mean it won't happen.
Use a tab (or maybe \n) - which if entered by the user would cause the text box to be exited.
I also support the selection of TAB (\t) and to some extend the PIPE (|) symbol.
But the most used one in my experience is the semicolon (;) together with quoted fields and the escapes for \ and \" which is just perfect. Just needs a parser keeping the state. The actual delimiting char becomes unimportant.
If you use no escape it is wise to count the "fields" per line and compare them to your expected results. As most applications of this kind of files use some kind of fixed number of fields you can catch errors in the entry and get this everything is good feeling if it does not trigger.
I know this response is considerably late but I experienced this issue a while back and addressed it reasonably well (IMHO). Hopefully, in the future, this will help someone else looking for an answer to a similar question.
While I would generally put myself in camps similar to Mike Ottum, John Saunders and Mark Brackett the simple fact of the matter is that sometimes us developers have to do things we'd rather not. My particular case had a need for providing a (mostly) human readable "id" to use in a RESTful URI that was derived from an organic composite key of an object. Binary or XML serialization was not really an option. So? I chose to re-invent as little of the wheel as I could. The System.Text.RegularExpressions.Regex class has escape/unescape methods that operate on those crazy regex patterns. There's a handful of escapable chars to pick from. I settled on the pipe ('|') character.
Here was my implementation (class-ified for reuse but you could carve out the good bits for a 7 line "inline" solution if that's how you like to roll):
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
namespace RPlus.DTO
{
/// <summary>
/// Provide safe string un/concatenating
/// </summary>
static class Glob
{
// a Regex Split param that basically says:
// Split on the pipe char unless the preceeding char is a backslash
private const string _splitterer = @"(?<!\\)\|";
// no explanation needed (hopefully)
private const char _delimiter = '|';
/// <summary>
/// Produce a properly escaped concatenation
/// from some number of strings
/// </summary>
/// <param name="items">strings to escape/concate</param>
/// <returns>an escaped concatenation of items</returns>
public static string To(IEnumerable<string> items)
{
var escapedItems = new List<string>();
foreach (var s in items) escapedItems.Add(Regex.Escape(s));
return string.Join(_delimiter.ToString(), escapedItems);
}
/// <summary>
/// Unconcatenate/unescape a string into its original strings
/// </summary>
/// <param name="globbedValue">
/// A value returned from Glob.To()
/// </param>
/// <returns>
/// The orignal strings used to construct the globbedValue
/// </returns>
public static List<string> From(string globbedValue)
{
return From(globbedValue, default(int?));
}
/// <summary>
/// Unconcatenate/unescape a string into its original strings
/// </summary>
/// <param name="globbedValue">
/// A value returned from Glob.To()
/// </param>
/// <param name="expectedTokens">
/// The number of string tokens that
/// should be found in the concatenation
/// </param>
/// <returns>
/// The orignal strings used to construct the globbedValue
/// </returns>
public static List<string> From(string value, int? expectedTokens)
{
var nugs = Regex.Split(value, _splitterer);
if (expectedTokens.HasValue && nugs.Length != expectedTokens.Value)
throw new ArgumentException("Unexpected number of tokens");
var unescapedItems = new List<string>();
foreach (var s in nugs) unescapedItems.Add(Regex.Unescape(s));
return unescapedItems;
}
}
}
And here's some sample usage:
var glob = Glob.To(new string[] { "Foo|Bar", "Bar|Baz", "Baz|Qux" });
var orig = Glob.From(glob);
CAVEAT: Please don't try to find "a character that users wont ever input" for use as a delimiter of concatenated strings. Users WILL eventually input it. There's already enough "magic number" code out there waiting to explode. And there are manifold tried and tested solutions to the problem.
If you want to use a really unique seperator, I would recommend ╡
or unique string \u2561
.
Why not use ASCII Code 31 (Unit Separator), which was specifically reserved for the purpose of separating data elements within strings?
It is impossible for a regular person to enter that character, difficult for someone with programmer tools to accidentally enter it and so you don't have to worry about escaping in edge cases.
来源:https://stackoverflow.com/questions/815782/what-is-a-more-unique-delimiter-than-comma-for-separating-strings