On column positions in vim

Published on

This post describes some of my findings about how locations (specifically, column positions) work in vim. My interest in this originated from the work on ariadne-vim, a TAGS-like plugin for vim.

ariadne-vim works by sending source locations back and forth to ariadne, the server process that does all the intellectual work — parsing and resolving the code. Those source locations had better be computed identically on the server side and the vim side.

Byte offsets («columns»)

The main way column positions are represented in vim is by byte offsets. That’s what functions like getpos, col, and cursor work with.

As an example, consider the following code:

data    Maybe α = Just α | Nothing

To make things interesting, I used a Greek variable name and put a tab after data. These two things — multibyte characters and tabs — are what we will be concerned with.

So, let’s calculate the byte offset of the capital N in Nothing. The tab is just one byte. The Greek alpha, on the other hand… may occupy any number of bytes, depending on the encoding being used.

Assuming the UTF-8 encoding, where alpha occupies two bytes, the byte offset of N is 27. But if the file was encoded using ISO/IEC 8859-7, where alpha is just one byte, then, as seen by vim, the position would be… still 27. That’s because besides the file encoding (as specified by the fileencoding option) vim also has its own internal encoding (the encoding option), and that’s what is used to compute those byte offsets.

The internal encoding is global for vim (unlike the file encoding, which is local to buffers) and is typically set after the locale’s encoding.

Isn’t it great that byte offsets do not depend on file encodings? Not at all. It means that you cannot simply compute offsets externally just by counting bytes in the file. Instead, you have to decode the file using fileencoding, and then re-encode it using encoding — and of course you need to know what those encodings are!

Besides, the parser used by the server process, haskell-src-exts, computes all locations as characters, not bytes. It would be nice if we didn’t have to perform tricky conversions on those locations.

Virtual columns

Virtual column of a position is where on the screen that position actually occurs. It can be obtained using the virtcol function. It’s much closer to the character count than the ordinary column (the byte offset), because even if a character is multibyte, it still takes one column on the screen. (I’m going to ignore combining characters here.)

The tabs are also interpreted differently by virtcol. They occupy variable number of columns — just as they do on the screen! For ariadne it’s a good thing, actually, because the column numbers are computed by the haskell-src-exts parser in the same way, using the tab stops placed every 8 characters.

In our example with Maybe above, the position of N is 28, and the position of M in Maybe is 9, because it comes after the tab character and the preceding text is shorter than 8 characters. (This is all assuming the tab stop size of 8.)

The only issue is that virtcol is computed based on the current value of the ts option, which specifies the tab stop size. Generally speaking, users may have any value of ts, while the Haskell report specifies that the tab stop size is 8, and haskell-src-exts computes locations based on that.

So, in ariadne-vim I temporarily set ts to 8 characters. Initially I was concerned that this will lead to screen flickering, because every time we change the ts value, vim reformats the buffer accordingly. But an experiment revealed that it is done after the full command is completed. As soon as we restore the ts value in the same command, the user won’t notice anything. It’s a hack, but the proper alternative — converting positions to byte counts on the server side — is very complicated.

That’s how we query the current position. How do we jump to a different one? Fortunately, the | motion operates with virtual columns, so we use that. We cannot use the cursor function, which deals with byte counts. And, of course, | is also sensitive to the value of ts, which again has to be modified temporarily.

Credits

Thanks to Ingo Karkat for explaining the situation to me.