If you ever have to work with any legacy banking or card payment systems (or anything coming from a similar era a few steps before XML), your software likely touches a few archaic file formats that have a certain flavour in common — easy for software to read but awkward for humans.
Normally, fingers crossed, everything just works and you don't have to care, but while diagnosing problems, doing initial testing or otherwise working below the level of a proper parsing system, sometimes you just can't avoid looking at the damn things…
As an example, let's imagine a pointless hypothetical format, the FLAP5 file. it looks like this:
01HI 090925840+000000001.0009092510092581084739273 FFAILMismatched routing 01HI 090925840+000000128.3309092510092581083472983 FOK 9900020003
In my opinion this is, like many programming languages, much easier to read (and edit) with a bit of colour brought in to break up the noise:
01HI 090925840+000000001.0009092510092581084739273 FFAILMismatched routing $ 01HI 090925840+000000128.3309092510092581083472983 FOK $ 9900020003 $
This can even help reduce the confusion when using "word wrap", helping to overcome the unfortunate placement of spaces:
01HI 090925840+000000001.0009092510092581084739273 FFAILMismatched routing $ 01HI 090925840+000000128.3309092510092581083472983 FOK $ 9900020003 $
The problem with making this kind of minor life enhancement is it rarely feels worth it, unless you're spending enough time with these files to immediately justify the effort.
The effort also tends to be greater than normal, as there seems to be relatively little overlap between this kind of horrible backend job and the kind of person who's a "toolchain enthusiast", so it seems rare for any pre-made editor plugins to exist for a given institution's quirky specs.
Therefore, what I'm trying to show here is a way to quickly and conveniently generate highlighting rules for new arbitrary file formats in this style.
The general idea:
FILE flap5 .fl5 LINE transaction 01 109 type 5 date 6 currency 3 sign 1 amount 12 processeddate 6 updateddate 6 acct 12 flag 1 status 4 comment 51 LINE summary 99 19 count 4 status 4 comment 9
My version of (2) and (3) is on Github as an example, and may already work for you if your problems are similar enough. The converter is written in good old Ruby, and currently generates highlighting for either Sublime Text or Vim (sorry, these are the things I use). I'll explain in more detail below.
I'm not trying to produce any kind of manual about how to write a syntax spec for any editor — these manuals don't look fun to write and I doubt I would do a good job. However, this particular kind of file format is pretty consistent and relatively simple, so I'm hoping my explanation should be OK if you've not written any syntax rules before, if it goes well. Let's see.
In general, syntax highlighting is about recognizing meaningful elements inside a text file, so the game is to describe exactly what a particular "thing" looks like and how the "things" can be nested (or otherwise in what context each kind of "thing" can legally show up).
For programming languages, a description of the syntax is also something that the language's implementation has to have in order to be able to parse source code. Sadly there isn't really any common framework for these things though; there are thousands of different parser frameworks and their capabilities (and situations they're convenient for) vary wildly.
The situation with these parsers is unfortunately similar with editors' highlighting definitions, and not every editor is even fully capable of accurately highlighting a particular language. The complexity can be quite intense even when trying to support things like <tags></tags>
(whose names have to match), or trying to model a common language like C (or even Python!) which has non-regular elements. Ironically, Ruby (which I'm using here) is probably one of the most complex programming languages to parse.
None of this really matters for this job though, as for these files we can just think about the simplest version of reality — the parser is a "state machine", which transitions through different contexts where it expects specific things to appear next. In our case, it's just expecting single things (fields of n characters) to show up one at a time in order — once inside a line that started with a particular prefix to trigger the right context for the rest of that line.
Since we're just highlighting fixed cells, we don't really care about the content but just about the positions, so the fields can simply be matched like .{6}
in regex terms ("any 6 characters").
Let's look at an example for Sublime Text to highlight a simple file which has two kinds of line, each with a prefix and then only two fields:
contexts: main: - match: "^01" scope: tabular.prefix push: l01field1 - match: "^99" scope: tabular.prefix push: l99field1 l01field1: - match: ".{10}" scope: tabular.plain pop: true push: l01field2 l01field2: - match: ".{12}" scope: tabular.plain2 pop: true push: l01_should_end l01_should_end: - match: ".*" scope: invalid.illegal pop: true l99field1: - match: ".{10}" scope: tabular.plain pop: true push: l99field2 l99field2: - match: ".{12}" scope: tabular.plain2 pop: true push: l99_should_end l99_should_end: - match: ".*" scope: invalid.illegal pop: true
In the main
context (the default context, which applies anywhere in the file outside any other match), either the 01
or the 99
prefix just after the start of a line (^
) can appear. They will push their own context into the highlighter's stack (so the following text will be inside a new l01field1
context within the main
context).
Once this happens, the next field can then match, which will push a new l01field2
context but will first pop the original l01field1
context (so the second field will still be within "main", but won't be within the first field).
This continues for each field in order, after which hopefully the line ends and nothing more matches. But if the line overruns, the .*
in line_should_end
will collect any extra characters that do show up, tagging them as invalid.illegal
which will make them show up in bright red. (Otherwise, that rule will just capture the line break and things will look normal).
Back in the main
context, the next line prefix will now be ready to match, and the cycle repeats.
The converter script is just autogenerating chains of rules in this style. By default it just uses four different "scopes" (tags for regions of text, to which the color scheme can apply styles):
invalid.illegal
(the bright red "invalid syntax" warning built into Sublime) for over-running linestabular.prefix
for the brightly highlighted prefix, and then tabular.plain
and tabular.plain2
for alternating grey cells, like in the example above.The reason for the custom scopes is that other than invalid.illegal
, no built-in scopes exist which guarantee background colors. I think this is essential to make the fields' positions show up clearly, so it's necessary to define something reliable for this purpose.
Additional custom styles can be defined to make specific fields stand out. The input format has an optional third column after the field name and length, which if set will replace plain
or plain2
in the output scope for that field, so it's just a matter of defining extra rules in the color scheme to match.
Settings > Customize Color Scheme
This will spawn a split window with the selected color scheme on the left, and a file for extensions and customization on the right.
Into the right pane, copy and paste the basic color definition "rules" from the example sublime/user.sublime-color-scheme
.
The target filename that Sublime presents you to paste and save into will depend on your selected color scheme — it's expecting you to be mostly customizing that, rather than adding new definitions.
(If you want to define custom styles for certain fields, you can do this with additional rules in this file. amount
as a custom style in the input table would correspond to tabular.amount
as the "scope" here.)
Tools > Developer > New Syntax
This will spawn a window which has no filename by default, but will save into the correct directory for Sublime to recognize the syntax definition.
Copy and paste the generated sublime/<format>.sublime-syntax
file into this window, and save with the same name and extension.
(Alternatively, just find that same directory and copy the generated file directly in.)
(You could create a link straight from this Packages/User
directory to the generated output, but Sublime only detects changes to files in the folder, and doesn't follow links, so live updates won't happen if you do this.)
Similarly you'll need a file in ~/.config/nvim/syntax
for each format, and also something in ~/.config/nvim/ftdetect
for the autocmd
to select the filetype
/syntax
based on the file extension. Copying from the vim-syntax
and vim-ftdetect
output folders is hopefully easy, but live reloading on changes doesn't seem to be a thing that can easily happen.
The basic styles — tblPrefix
, tblPlain
, tblPlain2
and tblWarn
— have highlight
declarations in each generated syntax file.
If using custom styles shared across multiple file types, it's potentially easier to define those in init.vim
or somewhere shared, rather than modifying the code to dump them or trying to somehow merge the generated and manual config. It's a matter of taste I suppose.
In this case, tagging a field as amount
in the input table will result in the tag tblAmount
for highlighting purposes.