Boost.Nowide
|
Table of Contents:
Boost.Nowide is a library implemented by Artyom Beilis that make cross platform Unicode aware programming easier.
The library provides an implementation of standard C and C++ library functions, such that their inputs are UTF-8 aware on Windows without requiring to use Wide API.
Consider a simple application that splits a big file into chunks, such that they can be sent by e-mail. It requires doing few very simple taks:
int main(int argc,char **argv)
std::fstream::open(char const *,std::ios::openmode m)
std::remove(char const *file)
std::cout << file_name
Unfortunately it is impossible to implement this simple task in a plain C++ if the file names contain non-ASCII characters
The simple program that uses the API would work on the systems that use UTF-8 internally -- the vast majority of Unix-Line operating systems: Linux, Mac OS X, Solaris, BSD. But it would fail on files like War and Peace - Война и мир - מלחמה ושלום.zip
under Microsoft Windows because the native Windows Unicode aware API is Wide-API - UTF-16.
This, such a trivial task is very hard to implement in a cross platform manner.
Boost.Nowide provides a set of standard library functions that are UTF-8 aware and makes Unicode aware programming easier.
The library provides:
argc
, argc
and env
main
parameters to use UTF-8stdio.h
functions:fopen
freopen
remove
rename
stdlib.h
functionssystem
getenv
setenv
unsetenv
putenv
fstream
filebuf
fstream/ofstream/ifstream
iostream
cout
cerr
clog
cin
Why not to provide both Wide and Narrow implementations so the developer can choose to use Wide characters on Unix-Like platforms
Several reasons:
wchar_t
is not really portable, it can be 2 bytes, 4 bytes or even 1 byte making Unicode aware programming harderfopen(wchar_t const *,wchar_t const *)
in the standard library, so it is better to stick to the standards rather than re-implement Wide API in "Microsoft Windows Style"The library is mostly header only library, only console I/O requires separate compilation under Windows.
As a developer you are expected to to boost::nowide
functions instead of the function avalible in the std
namespace.
For example, Unicode unaware implementation of line counter:
#include <fstream> #include <iostream> int main(int argc,char **argv) { if(argc!=2) { std::cerr << "Usage: file_name" << std::endl; return 1; } std::ifstream f(argv[1]); if(!f) { std::cerr << "Can't open a file " << argv[1] << std::endl; return 1; } int total_lines = 0; while(f) { if(f.get() == '\n') total_lines++; } f.close(); std::cout << "File " << argv[1] << " has " << total_lines << " lines" << std::endl; return 0; }
To make this program handle Unicode properly we do the following changes:
#include <boost/nowide/args.hpp> #include <boost/nowide/fstream.hpp> #include <boost/nowide/iostream.hpp> int main(int argc,char **argv) { boost::nowide::args a(argc,argv); // Fix arguments - make them UTF-8 if(argc!=2) { boost::nowide::cerr << "Usage: file_name" << std::endl; // Unicode aware console return 1; } boost::nowide::ifstream f(argv[1]); // argv[1] - is UTF-8 if(!f) { // the console can display UTF-8 boost::nowide::cerr << "Can't open a file " << argv[1] << std::endl; return 1; } int total_lines = 0; while(f) { if(f.get() == '\n') total_lines++; } f.close(); // the console can display UTF-8 boost::nowide::cout << "File " << argv[1] << " has " << total_lines << " lines" << std::endl; return 0; }
This is very simple and straight forward approach helps writing Unicode aware programs.
Of course this simple set of functions does not cover all needs. However if you need to access Wide API from Windows application using UTF-8 encoding internally you can use functions like boost::nowide::widen
and boost::nowide::narrow
.
For example
CopyFileW( boost::nowide::widen(existing_file).c_str(), boost::nowide::widen(new_file).c_str(), TRUE);
So the conversion is done at the last stage and you continue using UTF-8 strings anywhere and only at glue points you switch to Wide API.
boost::nowide::widen
returns std::string
. Sometimes it is convenient to prevent allocation and use on stack buffers if possible. Boot.Nowide provides boost::nowide::basic_stackstring
class.
Such that the example above can be rewritten as:
boost::nowide::basic_stackstring<wchar_t,char,64> wexisting_file,wnew_file; if(!wexisting_file.convert(existing_file) || !wnew_file.convert(new_file)) { // invalid UTF-8 return -1; } CopyFileW(wexisting_file.c_str(),wnew_file.c_str(),TRUE);
stackstring
, wstackstring
, short_stackstring
and wshort_stackstring
that use buffers of size 256 or 16 characters, and if the string is longer, they fall-back to memory allocationThe library does not include the windows.h
in order to prevent namespace pollution with numerous defines and types. The library rather defines the prototypes to the Win32 API functions.
However if you may request to use original windows.h
header by setting BOOST_NOWIDE_USE_WINDOWS_H
define before including any of the Boost.Nowide headers
The library provide UTF-8 aware functions for Microsoft Windows in boost::nowide
namespace that usually lay in std:
: namespace, for example std::fopen
goes to boost::nowide::fopen
.
Under POSIX platforms the boost::nowide::fopen and all other functions are aliases to standard library functions:
namespace boost { namespace nowide { #ifdef BOOST_WINDOWS inline FILE *fopen(char const *name,char const *mode) { ... } #else using std::fopen #endif } // nowide } // boost
Console I/O implemented as wrapper over ReadConsoleW/WriteConsoleW unless the stream is not "atty" like a pipe than ReadFile/WriteFile is used.
This approach eliminates a need of manual code page handling. If TrueType fonts are used the Unicode aware input and output would work.
Q: Why the library does not convert the string from Locale's encoding not UTF-8 and wise versa on POSIX systems
A: It is inherently incorrect to convert strings to/from locale encodings on POSIX platforms.
You can create a file named "\xFF\xFF.txt" (invalid UTF-8), remove it, pass its name as a parameter to program and it would work whether the current locale is UTF-8 locale or not. Also changing the locale from let's say en_US.UTF-8
to en_US.ISO-8859-1
would not magically change all files in OS or the strings a user may pass to the program (which is different on Windows)
POSIX OSs treat strings as NUL
terminated cookies.
So altering their content according to the locale would actually lead to incorrect behavior.
For example, this is a naive implementation of a standard program "rm"
#include <cstdio> int main(int argc,char **argv) { for(int i=1;i<argc;i++) std::remove(argv[i]); return 0; }
It would work with ANY locale and changing the strings would lead to incorrect behavior.
The meaning of a locale under POSIX and Windows paltforms is different and has very different effects.