On column positions in vim
Published on
This post describes some of my findings about how locations (specifically, column positions) work in vim. My interest in this originated from the work on ariadne-vim, a TAGS-like plugin for vim.
ariadne-vim works by sending source locations back and forth to ariadne, the server process that does all the intellectual work — parsing and resolving the code. Those source locations had better be computed identically on the server side and the vim side.
Byte offsets («columns»)
The main way column positions are represented in vim is by byte
offsets. That’s what functions like getpos
,
col
, and cursor
work with.
As an example, consider the following code:
data Maybe α = Just α | Nothing
To make things interesting, I used a Greek variable name and put a
tab after data
. These two things — multibyte characters and
tabs — are what we will be concerned with.
So, let’s calculate the byte offset of the capital N in
Nothing
. The tab is just one byte. The Greek alpha, on the
other hand… may occupy any number of bytes, depending on the encoding
being used.
Assuming the UTF-8 encoding, where alpha occupies two bytes, the byte
offset of N is 27. But if the file was encoded using ISO/IEC 8859-7,
where alpha is just one byte, then, as seen by vim, the position would
be… still 27. That’s because besides the file encoding (as specified by
the fileencoding
option) vim also has its own internal
encoding (the encoding
option), and that’s what is used to
compute those byte offsets.
The internal encoding is global for vim (unlike the file encoding, which is local to buffers) and is typically set after the locale’s encoding.
Isn’t it great that byte offsets do not depend on file encodings? Not
at all. It means that you cannot simply compute offsets externally just
by counting bytes in the file. Instead, you have to decode the file
using fileencoding
, and then re-encode it using
encoding
— and of course you need to know what those
encodings are!
Besides, the parser used by the server process, haskell-src-exts, computes all locations as characters, not bytes. It would be nice if we didn’t have to perform tricky conversions on those locations.
Virtual columns
Virtual column of a position is where on the screen that position
actually occurs. It can be obtained using the virtcol
function. It’s much closer to the character count than the ordinary
column (the byte offset), because even if a character is multibyte, it
still takes one column on the screen. (I’m going to ignore combining
characters here.)
The tabs are also interpreted differently by virtcol
.
They occupy variable number of columns — just as they do on the screen!
For ariadne it’s a good thing, actually, because the column numbers are
computed by the haskell-src-exts parser in the same way, using the tab
stops placed every 8 characters.
In our example with Maybe
above, the position of N is
28, and the position of M in Maybe
is 9, because it comes
after the tab character and the preceding text is shorter than 8
characters. (This is all assuming the tab stop size of 8.)
The only issue is that virtcol
is computed based on the
current value of the ts
option, which specifies the tab
stop size. Generally speaking, users may have any value of
ts
, while the Haskell report specifies that the tab stop
size is 8, and haskell-src-exts computes locations based on that.
So, in ariadne-vim I temporarily set ts
to 8 characters.
Initially I was concerned that this will lead to screen flickering,
because every time we change the ts
value, vim reformats
the buffer accordingly. But an experiment revealed that it is done after
the full command is completed. As soon as we restore the ts value in the
same command, the user won’t notice anything. It’s a hack, but the
proper alternative — converting positions to byte counts on the server
side — is very complicated.
That’s how we query the current position. How do we jump to a
different one? Fortunately, the |
motion operates with
virtual columns, so we use that. We cannot use the cursor
function, which deals with byte counts. And, of course, |
is also sensitive to the value of ts
, which again has to be
modified temporarily.
Credits
Thanks to Ingo Karkat for explaining the situation to me.