Wednesday, October 24, 2018

Splitting a string in C++


Splitting a string in C++

June 11, 2013; Updated November 25, 2018

Introduction

A common task in programming is to split a delimited string into an array of tokens. For example, it may be necessary to split a string containing spaces into an array of words. This is one area where programming languages like Java and Python surpass C++ since both of these programming languages include support for this in their standard libraries while C++ does not. However, this task can be accomplished in C++ in various ways.

In Java this task could be accomplished using the String.split method as follows.




In Python this task could be accomplished using the str.split method as follows.




In C++ this task is not quite so simple. It can still be accomplished in a variety of different ways.

Using the C runtime library

One option is to use the C runtime library. The following C runtime library functions can be used to split a string.
Unfortunately, there are various differences between platforms. This necessitates the use of C preprocessor directives to determine what is appropriate for the current platform.

The following code demonstrates the technique.


char* FindToken(
    char* str, const char* delim, char** saveptr)
{
#if (_SVID_SOURCE || _BSD_SOURCE || _POSIX_C_SOURCE >= 1 \
  || _XOPEN_SOURCE || _POSIX_SOURCE)
    return ::strtok_r(str, delim, saveptr);
#elif defined(_MSC_VER) && (_MSC_VER >= 1800)
    return strtok_s(token, delim, saveptr);
#else
    return std::strtok(token, delim);
#endif
}

wchar_t* FindToken(
    wchar_t* token, const wchar_t* delim, wchar_t** saveptr)
{
#if ( (defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L)) \
  || (defined(__cplusplus) && (__cplusplus >= 201103L)) )
    return std::wcstok(token, delim, saveptr);
#elif defined(_MSC_VER) && (_MSC_VER >= 1800)
    return wcstok_s(token, delim, saveptr);
#else
    return std::wcstok(token, delim);
#endif
}

char* CopyString(char* destination, const char* source)
{
    return std::strcpy(destination, source);
}

wchar_t* CopyString(wchar_t* destination, const wchar_t* source)
{
    return std::wcscpy(destination, source);
}

template <class charType>
size_t splitWithFindToken(
    const std::basic_string<charType>& str,
    const std::basic_string<charType>& delim,
    std::vector< std::basic_string<charType> >& tokens)
{
    std::unique_ptr<charType[]> ptr = std::make_unique<charType[]>(str.length() + 1);
    memset(ptr.get(), 0, (str.length() + 1) * sizeof(charType));
    CopyString(ptr.get(), str.c_str());
    charType* saveptr;
    charType* token = FindToken(ptr.get(), delim.c_str(), &saveptr);
    while (token != nullptr)
    {
        tokens.push_back(token);
        token = FindToken(nullptr, delim.c_str(), &saveptr);
    }
    return tokens.size();
}

Using the basic_istringstream class

Solution 1: std::istream_iterator

Perhaps the simplest method of accomplishing this task is using the basic_istringstream class as follows.




The splitWithStringStream function can be used as follows.




The splitWithStringStream function has the advantage of using nothing beyond functions that are part of the C++ standard library. To use it you just need to include the following C++ standard library headers: algorithm, iterator, sstream, string, and vector.

An alternate version of the function is as follows.




The splitWithStringStream and the splitWithStringStream1 functions do have two drawbacks. First, the functions are potentially inefficient and slow since the entire string is copied into the stream, which takes up just as much memory as the string. Second, they only support space delimited strings.



Solution 2: std::getline


The following function makes it possible to use a character other than space as the delimiter.




This function can be used as follows.




Note that this solution does not skip empty tokens, so the above example will result in tokens containing four items, one of which would be an empty string.

This function, like the splitWithStringStream and splitWithStringStream1 functions, is potentially inefficient and slow. It does allow the delimiter character to be specified. However, it only supports a single delimiting character. This function does not support strings in which several delimiting characters may be used.

To use the splitWithGetLine function you just need to include the following C++ standard library headers: algorithm, iterator, sstream, string, and vector.

Using only members of the basic_string class

Solution 1

It is possible to accomplish this task using only member functions of the basic_string class. The following function allows you to specify the delimiting character and uses only the find_first_not_of, find, and substr members of the basic_string class. The function also has optional parameters that allow you to specify that empty tokens should be ignored and to specify a maximum number of segments that the string should be split into.




This function does not suffer from the same potential performance issues as the stream based functions and it allows you to specify the delimiting character. However, it only supports a single delimiting character. This function does not support strings in which several delimiting characters may be used.

To use the splitWithBasicString function you just need to include the following C++ standard library headers: string and vector.

Solution 2

Sometimes the string that is to split uses several different delimiting characters. At other times it may simply be impossible to know for certain in advance what delimiting characters are used. In these cases you may know that the delimiting character could be one of several possibilities. In this case it is necessary for the function to be able to accept a string containing each possible delimiting character. This too can be accomplished using only member functions of the basic_string class.

The following function allows you to specify the delimiting character and uses only the find_first_not_of, find_first_of, and substr members of the basic_string class. The function also has optional parameters that allow you to specify that empty tokens should be ignored and to specify a maximum number of segments that the string should be split into.



Using Boost

Boost is a collection of peer-reviewed, cross-platform, open source C++ libraries that are designed to complement and extend the C++ standard library. Boost provides at least two methods for splitting a string.

Solution 1

One option is to use the boost::algorithm::split function in the Boost String Algorithms Library.

In order to use the split function simply include boost/algorithm/string.hpp and then call the function as follows.



Solution 2

Another option is to use the Boost Tokenizer Library. In order to use the Boost Tokenizer Library simply include boost/tokenizer.hpp. Then you can use the Boost Tokenizer as follows.



Using the C++ String Toolkit Library

Another option is to use the C++ String Toolkit Library. The following example shows how the strtk::parse function can be used to split a string.



Other Options

Of course there are many other options. Feel free to refer to the web pages listed in the references section below for many other options.

Summary

In this article I discussed several options for splitting strings in C++.
The code for the basic_istringstream class, the basic_string class, and Boost along with a complete sample demonstrating the use of the the functions can be found on Ideone.

References

No comments:

Post a Comment