Algorithm for joining e.g. an array of strings

后端未结

关注

 16  1770

I have wondered for some time, what a nice, clean solution for joining an array of strings might look like. Example: I have [\"Alpha\", \"Beta\", \"Gamma\"] and want to join

相关标签:

16条回答

你的背包

2021-01-01 22:05
All of these solutions are decent ones, but for an underlying library, both independence of separator and decent speed are important. Here is a function that fits the requirement assuming the language has some form of string builder.
```
public static string join(String[] strings, String sep) {
  if(strings.length == 0) return "";
  if(strings.length == 1) return strings[0];
  StringBuilder sb = new StringBuilder();
  sb.append(strings[0]);
  for(int i = 1; i < strings.length; i++) {
    sb.append(sep);
    sb.append(strings[i]);
  }
  return sb.toString();
}
```
EDIT: I suppose I should mention why this would be speedier. The main reason would be because any time you call c = a + b; the underlying construct is usually c = (new StringBuilder()).append(a).append(b).toString();. By reusing the same string builder object, we can reduce the amount of allocations and garbage we produce.

And before someone chimes in with optimization is evil, we're talking about implementing a common library function. Acceptable, scalable performance is one of the requirements them. A join that takes a long time is one that's going to be not oft used.
0 讨论(0)
发布评论:

提交评论
- 加载中...
说谎

2021-01-01 22:08
collecting different language implementations ?
Here is, for your amusement, a Smalltalk version:
```
join:collectionOfStrings separatedBy:sep

  |buffer|

  buffer := WriteStream on:''.
  collectionOfStrings 
      do:[:each | buffer nextPutAll:each ]
      separatedBy:[ buffer nextPutAll:sep ].
  ^ buffer contents.
```
Of course, the above code is already in the standard library found as:

Collection >> asStringWith:

so, using that, you'd write:
```
#('A' 'B' 'C') asStringWith:','
```
But here's my main point:

I would like to put more emphasis on the fact that using a StringBuilder (or what is called "WriteStream" in Smalltalk) is highly recommended. Do not concatenate strings using "+" in a loop - the result will be many many intermediate throw-away strings. If you have a good Garbage Collector, thats fine. But some are not and a lot of memory needs to be reclaimed. StringBuilder (and WriteStream, which is its grand-grand-father) use a buffer-doubling or even adaptive growing algorithm, which needs MUCH less scratch memory.

However, if its only a few small strings you are concatenating, dont care, and "+" them; the extra work using a StringBuilder might be actually counter-productive, up to an implementation- and language-dependent number of strings.
0 讨论(0)
发布评论:

提交评论
- 加载中...
心在旅途

2021-01-01 22:10
In Perl, I just use the join command:
```
$ echo "Alpha
Beta
Gamma" | perl -e 'print(join(", ", map {chomp; $_} <> ))'
Alpha, Beta, Gamma
```
(The map stuff is mostly there to create a list.)

In languages that don't have a built in, like C, I use simple iteration (untested):
```
for (i = 0; i < N-1; i++){
    strcat(s, a[i]);
    strcat(s, ", ");
}
strcat(s, a[N]);
```
Of course, you'd need to check the size of s before you add more bytes to it.

You either have to special case the first entry or the last.
0 讨论(0)
发布评论:

提交评论
- 加载中...

夕颜

2021-01-01 22:13

@Mendelt Siebenga

Strings are corner-stone objects in programming languages. Different languages implement strings differently. An implementation of join() strongly depends on underlying implementation of strings. Pseudocode doesn't reflect underlying implementation.

Consider join() in Python. It can be easily used:

print ", ".join(["Alpha", "Beta", "Gamma"])
# Alpha, Beta, Gamma

It could be easily implemented as follow:

def join(seq, sep=" "):
    if not seq:         return ""
    elif len(seq) == 1: return seq[0]
    return reduce(lambda x, y: x + sep + y, seq)

print join(["Alpha", "Beta", "Gamma"], ", ")
# Alpha, Beta, Gamma

And here how join() method is implemented in C (taken from trunk):

PyDoc_STRVAR(join__doc__,
"S.join(sequence) -> string\n\
\n\
Return a string which is the concatenation of the strings in the\n\
sequence.  The separator between elements is S.");

static PyObject *
string_join(PyStringObject *self, PyObject *orig)
{
    char *sep = PyString_AS_STRING(self);
    const Py_ssize_t seplen = PyString_GET_SIZE(self);
    PyObject *res = NULL;
    char *p;
    Py_ssize_t seqlen = 0;
    size_t sz = 0;
    Py_ssize_t i;
    PyObject *seq, *item;

    seq = PySequence_Fast(orig, "");
    if (seq == NULL) {
        return NULL;
    }

    seqlen = PySequence_Size(seq);
    if (seqlen == 0) {
        Py_DECREF(seq);
        return PyString_FromString("");
    }
    if (seqlen == 1) {
        item = PySequence_Fast_GET_ITEM(seq, 0);
        if (PyString_CheckExact(item) || PyUnicode_CheckExact(item)) {
            Py_INCREF(item);
            Py_DECREF(seq);
            return item;
        }
    }

    /* There are at least two things to join, or else we have a subclass
     * of the builtin types in the sequence.
     * Do a pre-pass to figure out the total amount of space we'll
     * need (sz), see whether any argument is absurd, and defer to
     * the Unicode join if appropriate.
     */
    for (i = 0; i < seqlen; i++) {
        const size_t old_sz = sz;
        item = PySequence_Fast_GET_ITEM(seq, i);
        if (!PyString_Check(item)){
#ifdef Py_USING_UNICODE
            if (PyUnicode_Check(item)) {
                /* Defer to Unicode join.
                 * CAUTION:  There's no gurantee that the
                 * original sequence can be iterated over
                 * again, so we must pass seq here.
                 */
                PyObject *result;
                result = PyUnicode_Join((PyObject *)self, seq);
                Py_DECREF(seq);
                return result;
            }
#endif
            PyErr_Format(PyExc_TypeError,
                     "sequence item %zd: expected string,"
                     " %.80s found",
                     i, Py_TYPE(item)->tp_name);
            Py_DECREF(seq);
            return NULL;
        }
        sz += PyString_GET_SIZE(item);
        if (i != 0)
            sz += seplen;
        if (sz < old_sz || sz > PY_SSIZE_T_MAX) {
            PyErr_SetString(PyExc_OverflowError,
                "join() result is too long for a Python string");
            Py_DECREF(seq);
            return NULL;
        }
    }

    /* Allocate result space. */
    res = PyString_FromStringAndSize((char*)NULL, sz);
    if (res == NULL) {
        Py_DECREF(seq);
        return NULL;
    }

    /* Catenate everything. */
    p = PyString_AS_STRING(res);
    for (i = 0; i < seqlen; ++i) {
        size_t n;
        item = PySequence_Fast_GET_ITEM(seq, i);
        n = PyString_GET_SIZE(item);
        Py_MEMCPY(p, PyString_AS_STRING(item), n);
        p += n;
        if (i < seqlen - 1) {
            Py_MEMCPY(p, sep, seplen);
            p += seplen;
        }
    }

    Py_DECREF(seq);
    return res;
}

Note that the above Catenate everything. code is a small part of the whole function.

In pseudocode:

/* Catenate everything. */
for each item in sequence
    copy-assign item
    if not last item
        copy-assign separator

0 讨论(0)

鱼传尺愫

2021-01-01 22:14
The most elegant solution i found for problems like this is something like this (in pseudocode)
```
separator = ""
foreach(item in stringCollection)
{
    concatenatedString += separator + item
    separator = ","
}
```
You just run the loop and only after the second time around the separator is set. So the first time it won't get added. It's not as clean as I'd like it to be so I'd still add comments but it's better than an if statement or adding the first or last item outside the loop.
0 讨论(0)
发布评论:

提交评论
- 加载中...

醉梦人生

2021-01-01 22:15

join() in Perl:

use List::Util qw(reduce);

sub mjoin($@) {$sep = shift; reduce {$a.$sep.$b} @_ or ''}

say mjoin(', ', qw(Alpha Beta Gamma));
# Alpha, Beta, Gamma

Or without reduce:

 sub mjoin($@) 
 {
   my ($sep, $sum) = (shift, shift); 
   $sum .= $sep.$_ for (@_); 
   $sum or ''
 }

0 讨论(0)