Parse string into array based on spaces or “double quotes strings”

前端 未结 4 624
庸人自扰
庸人自扰 2021-01-21 09:29

Im trying to take a user input string and parse is into an array called char *entire_line[100]; where each word is put at a different index of the array but if a part of the str

4条回答
  •  囚心锁ツ
    2021-01-21 10:19

    The strtok function is a terrible way to tokenize in C, except for one (admittedly common) case: simple whitespace-separated words. (Even then it's still not great due to lack of re-entrance and recursion ability, which is why we invented strsep for BSD way back when.)

    Your best bet in this case is to build your own simple state-machine:

    char *p;
    int c;
    enum states { DULL, IN_WORD, IN_STRING } state = DULL;
    
    for (p = buffer; *p != '\0'; p++) {
        c = (unsigned char) *p; /* convert to unsigned char for is* functions */
        switch (state) {
        case DULL: /* not in a word, not in a double quoted string */
            if (isspace(c)) {
                /* still not in a word, so ignore this char */
                continue;
            }
            /* not a space -- if it's a double quote we go to IN_STRING, else to IN_WORD */
            if (c == '"') {
                state = IN_STRING;
                start_of_word = p + 1; /* word starts at *next* char, not this one */
                continue;
            }
            state = IN_WORD;
            start_of_word = p; /* word starts here */
            continue;
    
        case IN_STRING:
            /* we're in a double quoted string, so keep going until we hit a close " */
            if (c == '"') {
                /* word goes from start_of_word to p-1 */
                ... do something with the word ...
                state = DULL; /* back to "not in word, not in string" state */
            }
            continue; /* either still IN_STRING or we handled the end above */
    
        case IN_WORD:
            /* we're in a word, so keep going until we get to a space */
            if (isspace(c)) {
                /* word goes from start_of_word to p-1 */
                ... do something with the word ...
                state = DULL; /* back to "not in word, not in string" state */
            }
            continue; /* either still IN_WORD or we handled the end above */
        }
    }
    

    Note that this does not account for the possibility of a double quote inside a word, e.g.:

    "some text in quotes" plus four simple words p"lus something strange"
    

    Work through the state machine above and you will see that "some text in quotes" turns into a single token (that ignores the double quotes), but p"lus is also a single token (that includes the quote), something is a single token, and strange" is a token. Whether you want this, or how you want to handle it, is up to you. For more complex but thorough lexical tokenization, you may want to use a code-building tool like flex.

    Also, when the for loop exits, if state is not DULL, you need to handle the final word (I left this out of the code above) and decide what to do if state is IN_STRING (meaning there was no close-double-quote).

提交回复
热议问题