Algorithm for joining e.g. an array of strings

后端 未结 16 1770
陌清茗
陌清茗 2021-01-01 21:40

I have wondered for some time, what a nice, clean solution for joining an array of strings might look like. Example: I have [\"Alpha\", \"Beta\", \"Gamma\"] and want to join

相关标签:
16条回答
  • 2021-01-01 22:05

    All of these solutions are decent ones, but for an underlying library, both independence of separator and decent speed are important. Here is a function that fits the requirement assuming the language has some form of string builder.

    public static string join(String[] strings, String sep) {
      if(strings.length == 0) return "";
      if(strings.length == 1) return strings[0];
      StringBuilder sb = new StringBuilder();
      sb.append(strings[0]);
      for(int i = 1; i < strings.length; i++) {
        sb.append(sep);
        sb.append(strings[i]);
      }
      return sb.toString();
    }
    

    EDIT: I suppose I should mention why this would be speedier. The main reason would be because any time you call c = a + b; the underlying construct is usually c = (new StringBuilder()).append(a).append(b).toString();. By reusing the same string builder object, we can reduce the amount of allocations and garbage we produce.

    And before someone chimes in with optimization is evil, we're talking about implementing a common library function. Acceptable, scalable performance is one of the requirements them. A join that takes a long time is one that's going to be not oft used.

    0 讨论(0)
  • 2021-01-01 22:08

    collecting different language implementations ?
    Here is, for your amusement, a Smalltalk version:

    join:collectionOfStrings separatedBy:sep
    
      |buffer|
    
      buffer := WriteStream on:''.
      collectionOfStrings 
          do:[:each | buffer nextPutAll:each ]
          separatedBy:[ buffer nextPutAll:sep ].
      ^ buffer contents.
    

    Of course, the above code is already in the standard library found as:

    Collection >> asStringWith:

    so, using that, you'd write:

    #('A' 'B' 'C') asStringWith:','
    

    But here's my main point:

    I would like to put more emphasis on the fact that using a StringBuilder (or what is called "WriteStream" in Smalltalk) is highly recommended. Do not concatenate strings using "+" in a loop - the result will be many many intermediate throw-away strings. If you have a good Garbage Collector, thats fine. But some are not and a lot of memory needs to be reclaimed. StringBuilder (and WriteStream, which is its grand-grand-father) use a buffer-doubling or even adaptive growing algorithm, which needs MUCH less scratch memory.

    However, if its only a few small strings you are concatenating, dont care, and "+" them; the extra work using a StringBuilder might be actually counter-productive, up to an implementation- and language-dependent number of strings.

    0 讨论(0)
  • 2021-01-01 22:10

    In Perl, I just use the join command:

    $ echo "Alpha
    Beta
    Gamma" | perl -e 'print(join(", ", map {chomp; $_} <> ))'
    Alpha, Beta, Gamma
    

    (The map stuff is mostly there to create a list.)

    In languages that don't have a built in, like C, I use simple iteration (untested):

    for (i = 0; i < N-1; i++){
        strcat(s, a[i]);
        strcat(s, ", ");
    }
    strcat(s, a[N]);
    

    Of course, you'd need to check the size of s before you add more bytes to it.

    You either have to special case the first entry or the last.

    0 讨论(0)
  • 2021-01-01 22:13

    @Mendelt Siebenga

    Strings are corner-stone objects in programming languages. Different languages implement strings differently. An implementation of join() strongly depends on underlying implementation of strings. Pseudocode doesn't reflect underlying implementation.

    Consider join() in Python. It can be easily used:

    print ", ".join(["Alpha", "Beta", "Gamma"])
    # Alpha, Beta, Gamma
    

    It could be easily implemented as follow:

    def join(seq, sep=" "):
        if not seq:         return ""
        elif len(seq) == 1: return seq[0]
        return reduce(lambda x, y: x + sep + y, seq)
    
    print join(["Alpha", "Beta", "Gamma"], ", ")
    # Alpha, Beta, Gamma
    

    And here how join() method is implemented in C (taken from trunk):

    PyDoc_STRVAR(join__doc__,
    "S.join(sequence) -> string\n\
    \n\
    Return a string which is the concatenation of the strings in the\n\
    sequence.  The separator between elements is S.");
    
    static PyObject *
    string_join(PyStringObject *self, PyObject *orig)
    {
        char *sep = PyString_AS_STRING(self);
        const Py_ssize_t seplen = PyString_GET_SIZE(self);
        PyObject *res = NULL;
        char *p;
        Py_ssize_t seqlen = 0;
        size_t sz = 0;
        Py_ssize_t i;
        PyObject *seq, *item;
    
        seq = PySequence_Fast(orig, "");
        if (seq == NULL) {
            return NULL;
        }
    
        seqlen = PySequence_Size(seq);
        if (seqlen == 0) {
            Py_DECREF(seq);
            return PyString_FromString("");
        }
        if (seqlen == 1) {
            item = PySequence_Fast_GET_ITEM(seq, 0);
            if (PyString_CheckExact(item) || PyUnicode_CheckExact(item)) {
                Py_INCREF(item);
                Py_DECREF(seq);
                return item;
            }
        }
    
        /* There are at least two things to join, or else we have a subclass
         * of the builtin types in the sequence.
         * Do a pre-pass to figure out the total amount of space we'll
         * need (sz), see whether any argument is absurd, and defer to
         * the Unicode join if appropriate.
         */
        for (i = 0; i < seqlen; i++) {
            const size_t old_sz = sz;
            item = PySequence_Fast_GET_ITEM(seq, i);
            if (!PyString_Check(item)){
    #ifdef Py_USING_UNICODE
                if (PyUnicode_Check(item)) {
                    /* Defer to Unicode join.
                     * CAUTION:  There's no gurantee that the
                     * original sequence can be iterated over
                     * again, so we must pass seq here.
                     */
                    PyObject *result;
                    result = PyUnicode_Join((PyObject *)self, seq);
                    Py_DECREF(seq);
                    return result;
                }
    #endif
                PyErr_Format(PyExc_TypeError,
                         "sequence item %zd: expected string,"
                         " %.80s found",
                         i, Py_TYPE(item)->tp_name);
                Py_DECREF(seq);
                return NULL;
            }
            sz += PyString_GET_SIZE(item);
            if (i != 0)
                sz += seplen;
            if (sz < old_sz || sz > PY_SSIZE_T_MAX) {
                PyErr_SetString(PyExc_OverflowError,
                    "join() result is too long for a Python string");
                Py_DECREF(seq);
                return NULL;
            }
        }
    
        /* Allocate result space. */
        res = PyString_FromStringAndSize((char*)NULL, sz);
        if (res == NULL) {
            Py_DECREF(seq);
            return NULL;
        }
    
        /* Catenate everything. */
        p = PyString_AS_STRING(res);
        for (i = 0; i < seqlen; ++i) {
            size_t n;
            item = PySequence_Fast_GET_ITEM(seq, i);
            n = PyString_GET_SIZE(item);
            Py_MEMCPY(p, PyString_AS_STRING(item), n);
            p += n;
            if (i < seqlen - 1) {
                Py_MEMCPY(p, sep, seplen);
                p += seplen;
            }
        }
    
        Py_DECREF(seq);
        return res;
    }
    

    Note that the above Catenate everything. code is a small part of the whole function.

    In pseudocode:

    /* Catenate everything. */
    for each item in sequence
        copy-assign item
        if not last item
            copy-assign separator
    
    0 讨论(0)
  • 2021-01-01 22:14

    The most elegant solution i found for problems like this is something like this (in pseudocode)

    separator = ""
    foreach(item in stringCollection)
    {
        concatenatedString += separator + item
        separator = ","
    }
    

    You just run the loop and only after the second time around the separator is set. So the first time it won't get added. It's not as clean as I'd like it to be so I'd still add comments but it's better than an if statement or adding the first or last item outside the loop.

    0 讨论(0)
  • 2021-01-01 22:15

    join() in Perl:

    use List::Util qw(reduce);
    
    sub mjoin($@) {$sep = shift; reduce {$a.$sep.$b} @_ or ''}
    
    say mjoin(', ', qw(Alpha Beta Gamma));
    # Alpha, Beta, Gamma
    

    Or without reduce:

     sub mjoin($@) 
     {
       my ($sep, $sum) = (shift, shift); 
       $sum .= $sep.$_ for (@_); 
       $sum or ''
     }
    
    0 讨论(0)
提交回复
热议问题