How to tokenize (words) classifying punctuation as space

后端 未结 2 1713
遥遥无期
遥遥无期 2020-12-03 19:11

Based on this question which was closed rather quickly:
Trying to create a program to read a users input then break the array into seperate words are my pointers all val

相关标签:
2条回答
  • 2020-12-03 20:01

    Already covered by a lot of questions is how to tokenize a stream in C++.
    Example: How to read a file and get words in C++

    But what is harder to find is how get the same functionality as strtok():

    Basically strtok() allows you to split the string on a whole bunch of user defined characters, while the C++ stream only allows you to use white space as a separator. Fortunately the definition of white space is defined by the locale so we can modify the locale to treat other characters as space and this will then allow us to tokenize the stream in a more natural fashion.

    #include <locale>
    #include <string>
    #include <sstream>
    #include <iostream>
    
    // This is my facet that will treat the ,.- as space characters and thus ignore them.
    class WordSplitterFacet: public std::ctype<char>
    {
        public:
            typedef std::ctype<char>    base;
            typedef base::char_type     char_type;
    
            WordSplitterFacet(std::locale const& l)
                : base(table)
            {
                std::ctype<char> const&  defaultCType  = std::use_facet<std::ctype<char> >(l);
    
                // Copy the default value from the provided locale
                static  char data[256];
                for(int loop = 0;loop < 256;++loop) { data[loop] = loop;}
                defaultCType.is(data, data+256, table);
    
                // Modifications to default to include extra space types.
                table[',']  |= base::space;
                table['.']  |= base::space;
                table['-']  |= base::space;
            }
        private:
            base::mask  table[256];
    };
    

    We can then use this facet in a local like this:

        std::ctype<char>*   wordSplitter(new WordSplitterFacet(std::locale()));
    
        <stream>.imbue(std::locale(std::locale(), wordSplitter));
    

    The next part of your question is how would I store these words in an array. Well, in C++ you would not. You would delegate this functionality to the std::vector/std::string. By reading your code you will see that your code is doing two major things in the same part of the code.

    • It is managing memory.
    • It is tokenizing the data.

    There is basic principle Separation of Concerns where your code should only try and do one of two things. It should either do resource management (memory management in this case) or it should do business logic (tokenization of the data). By separating these into different parts of the code you make the code more generally easier to use and easier to write. Fortunately in this example all the resource management is already done by the std::vector/std::string thus allowing us to concentrate on the business logic.

    As has been shown many times the easy way to tokenize a stream is using operator >> and a string. This will break the stream into words. You can then use iterators to automatically loop across the stream tokenizing the stream.

    std::vector<std::string>  data;
    for(std::istream_iterator<std::string> loop(<stream>); loop != std::istream_iterator<std::string>(); ++loop)
    {
        // In here loop is an iterator that has tokenized the stream using the
        // operator >> (which for std::string reads one space separated word.
    
        data.push_back(*loop);
    }
    

    If we combine this with some standard algorithms to simplify the code.

    std::copy(std::istream_iterator<std::string>(<stream>), std::istream_iterator<std::string>(), std::back_inserter(data));
    

    Now combining all the above into a single application

    int main()
    {
        // Create the facet.
        std::ctype<char>*   wordSplitter(new WordSplitterFacet(std::locale()));
    
        // Here I am using a string stream.
        // But any stream can be used. Note you must imbue a stream before it is used.
        // Otherwise the imbue() will silently fail.
        std::stringstream   teststr;
        teststr.imbue(std::locale(std::locale(), wordSplitter));
    
        // Now that it is imbued we can use it.
        // If this was a file stream then you could open it here.
        teststr << "This, stri,plop";
    
        cout << "die monster !";
        std::vector<std::string>    data;
        std::copy(std::istream_iterator<std::string>(teststr), std::istream_iterator<std::string>(), std::back_inserter(data));
    
        // Copy the array to cout one word per line
        std::copy(data.begin(), data.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
    }
    
    0 讨论(0)
  • 2020-12-03 20:04

    Have a look at boost tokenizer for something that's much better in a C++ context than strtok().

    0 讨论(0)
提交回复
热议问题