Quick Fixed-width Syntax Highlighting for Tabular Padded "Flat File" Formats

If you ever have to work with any legacy banking or card payment systems (or anything coming from a similar era a few steps before XML), your software likely touches a few archaic file formats that have a certain flavour in common — easy for software to read but awkward for humans.

Normally, fingers crossed, everything just works and you don't have to care, but while diagnosing problems, doing initial testing or otherwise working below the level of a proper parsing system, sometimes you just can't avoid looking at the damn things…

As an example, let's imagine a pointless hypothetical format, the FLAP5 file. it looks like this:

01HI   090925840+000000001.0009092510092581084739273 FFAILMismatched routing
01HI   090925840+000000128.3309092510092581083472983 FOK
9900020003

There are multiple types of record
Each record (or row) is represented by one line
The type of each line is indicated by some prefix
Lines are of a fixed width and composed of fixed width fields
Currencies are in annoying numeric codes
Amounts are in some absurd fixed decimal layout
It uses some randomly chosen date format that you have to look up every time all the days are less than 13
Data columns are crammed together without any gaps
Variable length strings are padded with spaces
The last field on each line is a long, vaguely defined comment field

In my opinion this is, like many programming languages, much easier to read (and edit) with a bit of colour brought in to break up the noise:

01HI   090925840+000000001.0009092510092581084739273 FFAILMismatched routing                                 $
01HI   090925840+000000128.3309092510092581083472983 FOK                                                     $
9900020003         $

This can even help reduce the confusion when using "word wrap", helping to overcome the unfortunate placement of spaces:

01HI   090925840+000000001.0009092510092581084739273 FFAILMismatched routing                                 $
01HI   090925840+000000128.3309092510092581083472983 FOK                                                     $
9900020003         $

The problem with making this kind of minor life enhancement is it rarely feels worth it, unless you're spending enough time with these files to immediately justify the effort.

The effort also tends to be greater than normal, as there seems to be relatively little overlap between this kind of horrible backend job and the kind of person who's a "toolchain enthusiast", so it seems rare for any pre-made editor plugins to exist for a given institution's quirky specs.

Therefore, what I'm trying to show here is a way to quickly and conveniently generate highlighting rules for new arbitrary file formats in this style.

The general idea:

Copy the field names and widths from the table that the files' awful documentation PDF hopefully contains.
Wrangle them into some kind of minimal/consistent format:

FILE flap5 .fl5
LINE transaction 01 109
type 5
date 6
currency 3
sign 1
amount 12
processeddate 6
updateddate 6
acct 12
flag 1
status 4
comment 51

LINE summary 99 19
count 4
status 4
comment 9

Make a quick script to expand these field definitions into usable syntax highlighting rules.

My version of (2) and (3) is on Github as an example, and may already work for you if your problems are similar enough. The converter is written in good old Ruby, and currently generates highlighting for either Sublime Text or Vim (sorry, these are the things I use). I'll explain in more detail below.

I'm not trying to produce any kind of manual about how to write a syntax spec for any editor — these manuals don't look fun to write and I doubt I would do a good job. However, this particular kind of file format is pretty consistent and relatively simple, so I'm hoping my explanation should be OK if you've not written any syntax rules before, if it goes well. Let's see.

How the highlighting rules work

In general, syntax highlighting is about recognizing meaningful elements inside a text file, so the game is to describe exactly what a particular "thing" looks like and how the "things" can be nested (or otherwise in what context each kind of "thing" can legally show up).

For programming languages, a description of the syntax is also something that the language's implementation has to have in order to be able to parse source code. Sadly there isn't really any common framework for these things though; there are thousands of different parser frameworks and their capabilities (and situations they're convenient for) vary wildly.

The situation with these parsers is unfortunately similar with editors' highlighting definitions, and not every editor is even fully capable of accurately highlighting a particular language. The complexity can be quite intense even when trying to support things like <tags></tags> (whose names have to match), or trying to model a common language like C (or even Python!) which has non-regular elements. Ironically, Ruby (which I'm using here) is probably one of the most complex programming languages to parse.

None of this really matters for this job though, as for these files we can just think about the simplest version of reality — the parser is a "state machine", which transitions through different contexts where it expects specific things to appear next. In our case, it's just expecting single things (fields of n characters) to show up one at a time in order — once inside a line that started with a particular prefix to trigger the right context for the rest of that line.

Since we're just highlighting fixed cells, we don't really care about the content but just about the positions, so the fields can simply be matched like .{6} in regex terms ("any 6 characters").

Let's look at an example for Sublime Text to highlight a simple file which has two kinds of line, each with a prefix and then only two fields:

contexts:
  main:
  - match: "^01"
    scope: tabular.prefix
    push: l01field1
  - match: "^99"
    scope: tabular.prefix
    push: l99field1
  l01field1:
  - match: ".{10}"
    scope: tabular.plain
    pop: true
    push: l01field2
  l01field2:
  - match: ".{12}"
    scope: tabular.plain2
    pop: true
    push: l01_should_end
  l01_should_end:
  - match: ".*"
    scope: invalid.illegal
    pop: true
  l99field1:
  - match: ".{10}"
    scope: tabular.plain
    pop: true
    push: l99field2
  l99field2:
  - match: ".{12}"
    scope: tabular.plain2
    pop: true
    push: l99_should_end
  l99_should_end:
  - match: ".*"
    scope: invalid.illegal
    pop: true

In the main context (the default context, which applies anywhere in the file outside any other match), either the 01 or the 99 prefix just after the start of a line (^) can appear. They will push their own context into the highlighter's stack (so the following text will be inside a new l01field1 context within the main context).

Once this happens, the next field can then match, which will push a new l01field2 context but will first pop the original l01field1 context (so the second field will still be within "main", but won't be within the first field).

This continues for each field in order, after which hopefully the line ends and nothing more matches. But if the line overruns, the .* in line_should_end will collect any extra characters that do show up, tagging them as invalid.illegal which will make them show up in bright red. (Otherwise, that rule will just capture the line break and things will look normal).

Back in the main context, the next line prefix will now be ready to match, and the cycle repeats.

The converter script is just autogenerating chains of rules in this style. By default it just uses four different "scopes" (tags for regions of text, to which the color scheme can apply styles):

invalid.illegal (the bright red "invalid syntax" warning built into Sublime) for over-running lines
three custom scopes which need to be added manually to the color scheme: tabular.prefix for the brightly highlighted prefix, and then tabular.plain and tabular.plain2 for alternating grey cells, like in the example above.

The reason for the custom scopes is that other than invalid.illegal, no built-in scopes exist which guarantee background colors. I think this is essential to make the fields' positions show up clearly, so it's necessary to define something reliable for this purpose.

Additional custom styles can be defined to make specific fields stand out. The input format has an optional third column after the field name and length, which if set will replace plain or plain2 in the output scope for that field, so it's just a matter of defining extra rules in the color scheme to match.

How to use the generator with Sublime Text

Settings > Customize Color Scheme

This will spawn a split window with the selected color scheme on the left, and a file for extensions and customization on the right.

Into the right pane, copy and paste the basic color definition "rules" from the example sublime/user.sublime-color-scheme.

The target filename that Sublime presents you to paste and save into will depend on your selected color scheme — it's expecting you to be mostly customizing that, rather than adding new definitions.

(If you want to define custom styles for certain fields, you can do this with additional rules in this file. amount as a custom style in the input table would correspond to tabular.amount as the "scope" here.)

Tools > Developer > New Syntax

This will spawn a window which has no filename by default, but will save into the correct directory for Sublime to recognize the syntax definition.

Copy and paste the generated sublime/<format>.sublime-syntax file into this window, and save with the same name and extension.

(Alternatively, just find that same directory and copy the generated file directly in.)

(You could create a link straight from this Packages/User directory to the generated output, but Sublime only detects changes to files in the folder, and doesn't follow links, so live updates won't happen if you do this.)

How to use the generator with Vim (or Neovim)

Similarly you'll need a file in ~/.config/nvim/syntax for each format, and also something in ~/.config/nvim/ftdetect for the autocmd to select the filetype/syntax based on the file extension. Copying from the vim-syntax and vim-ftdetect output folders is hopefully easy, but live reloading on changes doesn't seem to be a thing that can easily happen.

The basic styles — tblPrefix, tblPlain, tblPlain2 and tblWarn — have highlight declarations in each generated syntax file.

If using custom styles shared across multiple file types, it's potentially easier to define those in init.vim or somewhere shared, rather than modifying the code to dump them or trying to somehow merge the generated and manual config. It's a matter of taste I suppose.

In this case, tagging a field as amount in the input table will result in the tag tblAmount for highlighting purposes.