String Theory

Published Monday April 28th, 2008 | by

You know that every toolkit and language that is more recent than C has its own string class. Some say it’s a feature of the C language not to have a string built-in, keeping it lean and simple. Others say it’s a drawback, thus making people have to roll out their own and make beginners always fall to the trap of comparing two strings for equality with == instead of strcmp.

C++ inherited that and its initial versions did not have a string class. It wasn’t until the Standard Template Library was standardised that the language got std::string. But that was too late for Qt: Qt 1.x had long been out and it included its own string class (QString). And for a long while, Qt could not rely on there being a sane implementation of STL in the user’s compilers.

Today that problem is solved, but QString is still there. For several reasons, including:

  • QString is implicitly-shared
  • QString is a Unicode string

The first of those innovations was already present in Qt 1.x, but the second is the one that interests us now. For Qt 2.0, QString was transformed from a simple array of bytes into a proper Unicode string. The old Qt 1 string class became the QCString class from Qt 2 and 3 fame and survives today in the form of Q3CString in Qt4, but implemented using the new QByteArray. (The old Qt 1 to 3 QByteArray class is still available as Q3MemArray today, using the almost identical class that used to be Qt 1.x’s QGArray).

As soon as you have Unicode strings in an 8-bit source code, a question always comes about: what encoding is your file? Even today, with the widespread use of UTF-8, we can’t rely on that fact (text editors in Windows being the worst example). And since Qt must convert from a given 8-bit encoding into the UTF-16 format used by QString, it has to know what you used. That’s why you can find the function QTextCodec::setCodecForCStrings() in the Qt API.

That’s fine for application developers, but how about library developers? If you want your library to be used by application developers who use other encodings in their applications, how do you solve the problem? One solution is to restrict yourself to ASCII codepoints only (that is, from 0 to 127), hoping that will be the correct thing. In most cases, that will work out just fine, but there are some weird encodings out there that are not fully compatible with ASCII.

(For example, did you know that the Shift-JIS encoding for Japanese does not have a backslash? Most Japanese users of Windows think that “C:¥Windows¥System32″ is what everyone sees)

There’s a drawback though: if the application developer set the codec to be something complex, all of your simple ASCII strings will still be parsed using that codec. You’ll then suggest using QString::fromLatin1, which appeared in Qt 2. That function converts a Latin 1-encoded (ISO-8859-1) string into UTF-16. That is, actually, a very simple operation because the ISO-8859-1 encoding matches exactly the first 256 codepoints in Unicode. Therefore, that’s a fast operation and you won’t have a performance penalty due to slow codecs.

The use of QString::fromLatin1 is widespread in Qt 2 and 3 applications and libraries, as well as Qt itself. In fact, you’re likely to find code like the following in many source files:

#define QFL1(x)     QString::fromLatin1(x)

That doesn’t eliminate all performance penalty, though. It creates a QString in any case and that could be a waste of processor time for some simple operations, especially on temporaries. For example, a comparison like:

    if (text == QString::fromLatin1("Qt"))
        doSomething()

will trigger the construction of a QString temporary and its internal data structures, will allocate memory on the heap to accommodate the UTF-16 version of “Qt”, execute the comparison and then delete all of that. Seems like a waste, right?

Well, we thought so too. For Qt 4.0, we introduced a very small new class called QLatin1String. I can summarise it like this:

class QLatin1String
{
public:
    inline explicit QLatin1String(const char *s) : chars(s) {}
    inline QLatin1String &operator=(const QLatin1String &other)
    { chars = other.chars; return *this; }

    // add here inlined operator==, operator!=, operator< , operator>, etc.
    // against QString and against const char*
private:
    const char *chars;
};

You can look up the complete version in Qt’s source code of today. But you can readily see that it’s small and almost trivial. This allows us to rewrite the comparison above as:

    if (text == QLatin1String("Qt"))
        doSomething()

And given a suitable operator== function that compares a QString against a QLatin1String without having to first convert any of them, we’re set. This is the best we can squeeze out of it.

Or is it?

Well, unfortunately, there are two problems with current Qt 4.4 code: first, there are many overloads missing in QString that could take a QLatin1String. Take, for example, QString::replace. It has overloads taking a QChar, QString and QRegExp. What that means is that a QString temporary is created if you were to call that function with a QLatin1String argument. We’re addressing this issue in Qt 4.5, along with a series of other optimisations. (the replace function itself will get an improvement)

The other issue is the signature of those functions. Let me take the simplest example from qstring.h, the QString constructor:

    inline QString(const QLatin1String &latin1);

As you can see, the function takes a constant reference to a QLatin1String object. And we saw in the definition above that QLatin1String is nothing more than a wrapper around a const char * pointer. That unfortunately means we’re telling the compiler to pass a reference to a pointer, or a double indirection to the actual data. You’d never write a function whose argument is of type const char * const &, would you?

In other words, in order to complete that call, the compiler must compute the address of the data, save it somewhere on a scratch area and then pass the address of that scratch area as argument to the callee. An analysis of the assembly code reveals our mistake:

C++ code x86 assembly x86-64 assembly IA-64 assembly
extern "C" function(const char*);
function("foo");
Non-PIC
        movl    $.LC0, (%esp)
        call    _function
        movl    $.LC0, %edi
        call    _function
Doesn’t exist
PIC code
        leal    .LC0@GOTOFF(%ebx), %eax
        movl    %eax, (%esp)
        call    _function
        leaq    .LC0(%rip), %rdi
        call    _function
        addl    out0 = @gprel(.LC0), r1;;
        br.call rp = _function
extern "C" function(const QLatin1String &);
function(QLatin1String("foo"));
Non-PIC code
        leal    -4(%ebp), %eax
        movl    $.LC0, -4(%ebp)
        movl    %eax, (%esp)
        call    _function
        movq    %rsp, %rdi
        movq    $.LC0, (%rsp)
        call    _function
Doesn’t exist
PIC code
        leal    .LC0@GOTOFF(%ebx), %eax
        movl    %eax, -8(%ebp)
        leal    -8(%ebp), %eax
        movl    %eax, (%esp)
        call    _function
        leaq    .LC0(%rip), %rax
        movq    %rsp, %rdi
        movq    %rax, (%rsp)
        call    _function
        addl r14 = @gprel(.LC0), r1
        adds out0 = 16, r12;;
        st8 [out0] = r14
        br.call rp = _function
    

The assembly generated for a calling with a character constant almost doesn’t involve memory operations: on the x86, function arguments are always passed on the stack, so there’s no avoiding that. However, on both the x86-64 and the IA-64, where the arguments are passed on registers, there are no memory operations at all. When the code gets assembled, all that will remain is an addition. However, the call with a contant reference always causes memory operations, in all architectures.

So what happens if we passed the QLatin1String object by value?

C++ code x86 assembly x86-64 assembly IA-64 assembly
extern "C" function(QLatin1String);
function(QLatin1String("foo"));
Non-PIC
        movl    $.LC0, (%esp)
        call    _function
        movl    $.LC0, %edi
        call    _function
Doesn’t exist
PIC code
        leal    .LC0@GOTOFF(%ebx), %eax
        movl    %eax, (%esp)
        call    _function
        leaq    .LC0(%rip), %rdi
        call    _function
        addl    out0 = @gprel(.LC0), r1;;
        br.call rp = _function

As you can see from the assembly output above, it’s exactly the same as the const char * version! So what should I do now? I should replace all of the arguments that take a const QLatin1String & in Qt’s API with a simple QLatin1String argument. And I have to do that now without breaking source or binary compatibility.

Some assembly explanation: PIC means “Position Independent Code”, meaning the code in question can be loaded at any address in memory. That’s the default and only mode for IA-64, whereas on the x86 and x86-64 platforms, it’s usually used in libraries in ELF-based systems. The .LC0 symbol is where the compiler put my "foo" NUL-terminated string. On the IA-64, the r1 register is called “global pointer” (sometimes seen as “gp” in the source code) and points to the center of the module’s data, whereas the r12 register is the “stack pointer” (sometimes seen as “sp”). On the x86, the compiler uses any available register to store the address to the GOT and, in this case, it chose the %ebx register; on the x86-64, it is not necessary since it used addressing relative to the instruction pointer. Those symbols starting with @ are linker flags: they tell the linker to generate some relocation information at that point in the code (GOTOFF is the offset to the GOT [Global Offset Table], @gprel is "relative to the gp", which is the exact same thing under a different name, so the operation above was: "out0 = (.LCO - gp) + gp").

The code above was generated with gcc -O2 -S, but I removed the instructions that weren’t relevant to the calling sequence.

Did you like this? Share it:
Bookmark and Share

Posted in Qt

15 comments to String Theory

Giovanni Bajo says:

Why don’t we have a QUtf8String() as well? Most of the optimizations would apply for comparing QString and UTF-8 as well. Suggesting Latin1 for programmer’s string is a little “European” ;)

Thiago Macieira says:

There are two reasons for restricting it to Latin 1:
1) it’s the fastest conversion from any encoding to UTF-16 and cannot fail, whereas UTF-8 can have invalid sequences and requires a bit more of code
2) it was meant for ASCII only, since that’s the language that most source code and even translatable strings are written in, avoiding any encoding issues, but the “fromAscii” and “toAscii” operations are already taken and don’t actually mean what they say (misnomer).

Besides, some old compilers cannot cope with the high-byte characters, sometimes even on comments.

In any case, as soon as C++0x is standardised with the new string literals (see http://en.wikipedia.org/wiki/C++0x#New_string_literals), we’ll probably add overloads for char16_t and char32_t.

Enrico Ros says:

Wow, this post is really useful to me.
Just for cuoriosity’s: what would have happened if the to the pass-by-value case if compiler optimization was turned off?

Enrico Ros says:

Sorry for the poor wording :-) , I rewrite:
Just for cuoriosity’s sake: what would have happened to the pass-by-value case if the compiler optimization was turned off?

Al_Ku says:

What I never quite understood is why can’t we just use

if(text == “Qt”) {…}

instead of

if(text == QLatin1String(“Qt”)) {…}

? QString::operator==() does take a (const char *) so I always thought this won’t internally create a QString from the char-array for comparing it. So I always thought that you guys use QLatin1String() throughout Qt to ensure that if QT_NO_CAST_FROM_ASCII is set that it will still work or something.

Thiago Macieira says:

@Enrico: if you turn optimisations off, the calling sequence for the character constant remains unchanged. However, both uses of QLatin1String will require a call to the non-inlined function QLatin1String::QLatin1String(const char*). That kills any performance optimisations you may introduce: to call an out-of-line member function, the object must have an address in memory.

@Al_Ku: you’re almost right. But you forgot the case where the application developer set the codec for C strings. In that case, we’re forced to convert to QString first, using the encoding that the developer chose. And that codec can be slow, relatively speaking. By using QLatin1String, you’re telling QString: this is Latin 1, not what the developer chose for his application.

Fabrizio says:

Thanks for this post Thiago – it helped me remember about the QLatinString issue once again ;-)
I’m really wondering if it will be possible to apply this “hack” without breaking binary compatibility though. But I hope so.
Let me say again that I love the Troll’s Labs blog so thanks for all the time you all spend blogging ;-)

Andreas Nilsson says:

Very interesting post. I’m not a very accomplished c/c++/qt programmer but i would be a bit careful with using assembly code to verify a ‘search replace’.

The assembly code shown above obviously shows that for this setup everything is fine, but what would happen with other opimizations or another compiler?

I may be wrong, but i would like a more thorough eplanation/proof to believe this.

David Johnson says:

“beginners always fall to the trap of comparing two strings for equality with == instead of strcmp.”

I’m sure you meant “strncmp”. Of course you did!

John K says:

Hm, I’d have to disagree with the idea that Qt 4.0 QStrings are ‘proper unicode’ strings. Essentially, they are UCS-2, an obsolete 16-bit encoding, and are therefore limited to supporting only codepoints from the Basic Multilingual Plane. This leaves many CJK, mathematical, and academic symbols in the dark.

John K says:

Sorry, disregard previous post. I think the documentation for QString has changed since I read it last.

John K says:

Sorry, disregard my previous post. I guess the documentation for QString has been improving since I last read it…

Thiago Macieira says:

@Andreas: QLatin1String is such a trivial class (the exact size of a pointer, which generally means the size of a general-purpose register) that there isn’t much optimisation to be done. There are very few ways to optimise the calling sequence. A non-optimised way would be to save the character constant address to a scratch area, then load copy it to the parameter passing area (the stack on x86 or a register on x86-64 and IA-64). That’s exactly what gcc does on -O0 or with -O2 -fno-inline.

This optimisation (passing by value instead of const-ref) applies to any type that:
1) is relatively small (to be on the safe side, I’d say 16 bytes, but different ABI apply to different platforms)
2) use the default copy constructor (that is, not defined or using the C++0x “= default” construct) in *all* of its types and bases. In other words, the copying does not involve calling any functions.

QLatin1String and QLatin1Char match both constraints.

Thiago Macieira says:

@John K: You’re right that, in their inception, QStrings were encoded in UCS-2. One of the reasons for that was that the Unicode representation on Windows systems is done using 16-bit integers. Also, at the time, Unicode was restricted to 16 bits anyways, so there was no reason to use 2 extra bytes.

However, that is also past. Since Qt 4.0, and with improvements done in other minor editions, QString no longer uses UCS-2 as its internal encoding. It actually uses UTF-16, which supports the full Unicode range (from U+0000 to U+10FFFF). The higher planes are achieved by way of UTF-16 surrogate pairs, found in the Unicode range U+D800 to U+DFFF. There’s also some UCS-4/UTF-32 API present in QString and QChar:
- http://doc.trolltech.com/4.4rc1/qstring.html#toUcs4
- http://doc.trolltech.com/4.4rc1/qstring.html#fromUcs4
- all of QChar’s static public members that return properties of Unicode codepoints

Unfortunately, QChar itself is a single UTF-16 word, which means it cannot represent a higher-plane codepoint. So you can’t use the non-static member functions when it contains a surrogate.

For instance, the following code is wrong:

  QChar *p = str.data();
  for ( ; !p->isNull(); ++p)
    *p = p->toUpper();

However, QString implements the correct process by converting surrogate pair to its UCS-4 form internally before applying any transformation. (The code is also wrong because the uppercase of a given letter may be two or more letters; QString does this correctly)

I am unsure about the state of text painting with higher planes in Qt. And I know that the support in QUrl (the ACE transformation and special case-folding) is missing. If you find anything lacking (and I’m sure you will), please let us know.

Andreas says:

Wow, Thiago. Cool blog.

Commenting closed.