Improving Pygments for Arbitrary LaTeX Escaping and Custom Highlighting

In LaTeX, the minted syntax highlighting package does not provide the ability to highlight/gray out certain code segments. The limitations of the LaTeX formatter in Pygments also makes it difficult to use escaped command sequences in strings or comments. In this post, I am sharing my ways of solving these two challenges. This is based on my answer to a TeX.SE question: Resources for “beautifying” beamer presentations with source code?.

In the current implementation of minted, one cannot escape contents in comments and strings. The escape sequence is also limited to one character long, which is inconvenient in practice. This problem can be fixed with the following file ( It is essentially the same as pygmentize, but with the LatexEmbeddedLexer replaced with our own version. In the new implementation (LatexExtendedEmbeddedLexer), we can specify our own escape sequences in __init__ function. In this example, we used %* *). The _find_safe_escape_tokens function is modified so that we can escape contents anywhere in the code.

from pygments.token import Token
from pygments.lexer import Lexer, do_insertions
import pygments.formatters.latex as pygfmtlatex

class LatexExtendedEmbeddedLexer(pygfmtlatex.LatexEmbeddedLexer):
    def __init__(self, left, right, lang, **options):
        super().__init__('', '', lang, **options)
        # define left and right delimieters here
        self.left = '%*'
        self.right = '*)'

    # modified so that we can escape stuff in comments and strings
    def _find_safe_escape_tokens(self, text):
        for i, t, v in self._filter_to(
            lambda t: False
            if t is None:
                for i2, t2, v2 in self._find_escape_tokens(v):
                    yield i + i2, t2, v2
                yield i, None, v

# replace LatexEmbeddedLexer
pygfmtlatex.LatexEmbeddedLexer = LatexExtendedEmbeddedLexer

# the rest is the same as pygmentize
import re
import sys
from pygments.cmdline import main

if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])

The next step is to implement our own formatter that supports highlight and gray out. Different from the existing LatexFormatter, I chose to format every character individually, because it is easier to implement background color in this way. However, as a trade-off, borders are not supported. To reduce the space of output material, I decided to define styles dynamically based on their occurrence in the code listing.

To support highlight/gray out, a check is added in the formatter. We can use special sequences to toggle highlight/gray out modes. For example, I used !!! to toggle highlight mode, and --- to toggle gray out mode. The exact behavior of each mode can be changed in the format_unencoded function.

from io import StringIO

from pygments import highlight
from pygments.formatter import Formatter
import pygments.formatters.latex
from pygments.lexers.python import PythonLexer
from pygments.token import Token, STANDARD_TYPES
from pygments.util import get_bool_opt, get_int_opt
import copy

__all__ = ['LatexFormatter']

class Escaped:

    def __init__(self, s):
        self.s = s

def escape_tex(text, commandprefix):
    return text.replace('\\', '\x00'). \
                replace('{', '\x01'). \
                replace('}', '\x02'). \
                replace('\x00', r'\%sZbs' % commandprefix). \
                replace('\x01', r'\%sZob' % commandprefix). \
                replace('\x02', r'\%sZcb' % commandprefix). \
                replace('^', r'\%sZca' % commandprefix). \
                replace('_', r'\%sZus' % commandprefix). \
                replace('&', r'\%sZam' % commandprefix). \
                replace('<', r'\%sZlt' % commandprefix). \
                replace('>', r'\%sZgt' % commandprefix). \
                replace('#', r'\%sZsh' % commandprefix). \
                replace('%', r'\%sZpc' % commandprefix). \
                replace('$', r'\%sZdl' % commandprefix). \
                replace('-', r'\%sZhy' % commandprefix). \
                replace("'", r'\%sZsq' % commandprefix). \
                replace('"', r'\%sZdq' % commandprefix). \
                replace('~', r'\%sZti' % commandprefix). \
                replace(' ', r'\%sZsp' % commandprefix)

def escape_tex_new(text, commandprefix):
    chars = []
    for c in text:
        new_c = escape_tex(c, commandprefix)
        if c != new_c:
    return chars






%% for compatibility with earlier versions

def _get_ttype_name(ttype):
    fname = STANDARD_TYPES.get(ttype)
    if fname:
        return fname
    aname = ''
    while fname is None:
        aname = ttype[-1] + aname
        ttype = ttype.parent
        fname = STANDARD_TYPES.get(ttype)
    return fname + aname

class LaTeXStyleManager:

    def __init__(self, prefix):
        self.prefix = prefix
        self.style_to_index_d = dict()
        self.style_to_def_d = dict()
        self.style_to_cmdname_d = dict()

        self.toggle_styles = ['bold', 'italic', 'underline', 'roman', 'sans', 'mono']
        self.toggle_style_cmds = [r'\bfseries', r'\itshape', r'\underline', r'\rmfamily', r'\sffamily', r'\ttfamily']

    def number_to_base(self, n, b):
        if n == 0:
            return [0]
        digits = []
        while n:
            digits.append(int(n % b))
            n //= b
        return digits

    def number_to_base_zero_less(self, n, b):
        digits = self.number_to_base(n, b)
        i = 0
        carry = False
        while i < len(digits):
            if carry:
                digits[i] -= 1
                carry = False
            if digits[i] < 0:
                carry = True
                digits[i] = b + digits[i]
            if digits[i] == 0:
                if i < len(digits) - 1:
                    carry = True
                    digits[i] = b
            i += 1
        assert not carry
        if digits[-1] == 0:
            del digits[-1]
        return digits

    def int_to_alph(self, num):
        base26 = self.number_to_base_zero_less(num + 1, 26)
        a_cc = ord('a')
        digits = [chr(a_cc + x - 1) for x in base26]
        s = ''.join(digits)
        return s

    def get_style_def(self, style_ind, style_d, comment_str=None):
        plain_defs = []
        surround_defs = []

        if style_d['cmd']:

        for st in self.toggle_styles:
            if style_d[st]:
                if st == 'underline':
                    ind = self.toggle_styles.index(st)

        if style_d['color']:
            plain_defs.append(r'\color[HTML]{%s}' % style_d['color'].upper())

        if style_d['bgcolor']:
            surround_defs.append(r'\colorbox[HTML]{%s}' % style_d['bgcolor'].upper())

        if style_d['bgcolor']:
            def_string = ''.join(plain_defs) + r'\strut\relax#1'
            def_string = ''.join(plain_defs) + r'\relax#1'
        for sd in surround_defs:
            def_string = sd + '{' + def_string + '}'

        if style_d['bgcolor']:
            def_string = r'\setlength{\fboxsep}{0pt}' + def_string

        def_string = r'\newcommand{\%s@%s}[1]{{%s}}' % (self.prefix, self.int_to_alph(style_ind), def_string)
        if comment_str is not None:
            def_string += '%' + comment_str
        cmd_name = r'\%s@%s' % (self.prefix, self.int_to_alph(style_ind))
        return def_string, cmd_name

    def rgb_color(self, col):
        if col:
            return [int(col[i] + col[i + 1], 16) for i in (0, 2, 4)]
            return [0,0,0]

    def get_default_style_d(self):
        ds = dict()
        for key in self.toggle_styles:
            ds[key] = None
        ds['color'] = None
        ds['bgcolor'] = None
        ds['cmd'] = ''
        return ds

    def style_to_tuple(self, style_d):
        toggle_bits = [False] * len(self.toggle_styles)
        for i in range(len(self.toggle_styles)):
            if style_d[self.toggle_styles[i]]:
                toggle_bits[i] = True
        color = self.rgb_color(style_d['color'])
        bg_color = self.rgb_color(style_d['bgcolor'])
        cmd = style_d.get('cmd', '')
        final = toggle_bits + [cmd] + color + bg_color
        return tuple(final)

    def has_style(self, style_d):
        style_tup = self.style_to_tuple(style_d)
        return style_tup in self.style_to_index_d

    def get_style_index(self, style_d, comment_str=None):
        style_tup = self.style_to_tuple(style_d)
        if style_tup in self.style_to_index_d:
            return self.style_to_index_d[style_tup]
            self.style_to_index_d[style_tup] = len(self.style_to_index_d)
            st_ind = self.style_to_index_d[style_tup]
            complete_def, cmd_name = self.get_style_def(st_ind, style_d, comment_str)
            self.style_to_def_d[st_ind] = complete_def
            self.style_to_cmdname_d[st_ind] = cmd_name
            return st_ind

    def merge_styles(self, style_ds):
        sty = self.get_default_style_d()
        for style_d in style_ds:
            for key in self.toggle_styles:
                sty[key] = sty[key] or style_d[key]
            if style_d['color']:
                sty['color'] = style_d['color']
            if style_d['bgcolor']:
                sty['bgcolor'] = style_d['bgcolor']
            if 'cmd' in style_d:
                sty['cmd'] += style_d['cmd']
        return sty

    def get_style_cmdname(self, tts, styles):
        if len(tts) == 0:
            return None
        style_ds = []
        for item in tts:
            if isinstance(item, dict):
        merged_style = self.merge_styles(style_ds)
        comment_str = None
        if not self.has_style(merged_style):
            comment_str = '+'.join([str(x) for x in tts])
        st_ind = self.get_style_index(merged_style, comment_str=comment_str)
        return self.style_to_cmdname_d[st_ind]

class LatexFormatter(pygments.formatters.latex.LatexFormatter):
    name = 'LaTeX'
    aliases = ['latex', 'tex']
    filenames = ['*.tex']

    def __init__(self, **options):
        Formatter.__init__(self, **options)
        self.docclass = options.get('docclass', 'article')
        self.preamble = options.get('preamble', '')
        self.linenos = get_bool_opt(options, 'linenos', False)
        self.linenostart = abs(get_int_opt(options, 'linenostart', 1))
        self.linenostep = abs(get_int_opt(options, 'linenostep', 1))
        self.verboptions = options.get('verboptions', '')
        self.nobackground = get_bool_opt(options, 'nobackground', False)
        self.commandprefix = options.get('commandprefix', 'PY')
        self.texcomments = get_bool_opt(options, 'texcomments', False)
        self.mathescape = get_bool_opt(options, 'mathescape', False)
        self.escapeinside = options.get('escapeinside', '')
        if len(self.escapeinside) == 2:
            self.left = self.escapeinside[0]
            self.right = self.escapeinside[1]
            self.escapeinside = ''
        self.envname = options.get('envname', 'Verbatim')

        self.style_manager = LaTeXStyleManager(prefix=self.commandprefix)


    def _create_tt_to_style_d(self):
        self.tt_to_style_d = dict()
        for ttype, ndef in
            self.tt_to_style_d[ttype] = ndef

    def format_unencoded(self, tokensource, outfile):
        t2n = self.ttype2name
        cp = self.commandprefix

        segments = []
        current_line = []

        # define custom modes
        # highlight mode
        highlight_mode_d = copy.copy(self.tt_to_style_d[Token.Text])
        highlight_mode_d['bgcolor'] = 'FFFF00'
        highlight_mode_d['cmd'] = '\\small'

        grayout_mode_d = copy.copy(self.tt_to_style_d[Token.Text])
        grayout_mode_d['color'] = '808080'
        grayout_mode_d['cmd'] = '\\tiny'

        # define the toggle strings of different modes
        mode_strings = ['!!!', '---']
        mode_styles = [highlight_mode_d, grayout_mode_d]
        mode_flags = [False] * len(mode_strings)

        def flush_line():
            nonlocal segments, current_line
            if len(current_line) > 0:
            current_line = []

        toks = list(tokensource)
        for ttype, value in toks:
            new_value = []

            if ttype in Token.Comment:
                new_value.extend(escape_tex_new(value, cp))
                value = escape_tex(value, cp)
            elif ttype not in Token.Escape:
                new_value.extend(escape_tex_new(value, cp))
                value = escape_tex(value, cp)

            raw_ttype = []

            def write_content(cmdname=None, raw=False):
                nonlocal current_line, new_value
                real_cmdname = cmdname
                if raw:
                    for item in new_value:
                        if isinstance(item, Escaped):
                    if real_cmdname is None:
                        real_cmdname = ''
                    for item in new_value:
                        if isinstance(item, Escaped):
                            current_line.append('%s{%s}' % (real_cmdname, item.s))
                            if item == '\n':
                            current_line.append('%s{%s}' % (real_cmdname, item))

            if ttype in Token.Escape:
                # deal with mode toggle strings
                if value in mode_strings:
                    ind = mode_strings.index(value)
                    mode_flags[ind] = not mode_flags[ind]
                while ttype is not Token:
                    ttype = ttype.parent

                for i in range(len(mode_flags)):
                    if mode_flags[i]:

                cmdname = self.style_manager.get_style_cmdname(raw_ttype, self.tt_to_style_d)


        if self.full:
            realoutfile = outfile
            outfile = StringIO()

        # write the command definition
        style_defs = '\n'.join(self.style_manager.style_to_def_d.values())
        outfile.write(style_defs + '\n')
        outfile.write(r'\newcommand{\PYG@COMMAND@PREFIX}{%s}' % cp + '\n')
        outfile.write(r'\newcommand{\PYG@NUM@STYLES}{%d}' % len(self.style_manager.style_to_index_d) + '\n')
        outfile.write(CMD_TEMPLATE % {'cp': cp})

        outfile.write('\\begin{' + self.envname + '}[commandchars=\\\\\\{\\}')
        if self.linenos:
            start, step = self.linenostart, self.linenostep
            outfile.write(',numbers=left' +
                          (start and ',firstnumber=%d' % start or '') +
                          (step and ',stepnumber=%d' % step or ''))
        if self.mathescape or self.texcomments or self.escapeinside:
        if self.verboptions:
            outfile.write(',' + self.verboptions)

        all_lines = '\n'.join(map(lambda x : ''.join(x), segments))
        outfile.write(all_lines + '\n')

        outfile.write('\\end{' + self.envname + '}\n')

        if self.full:
            encoding = self.encoding or 'utf8'
            # map known existings encodings from LaTeX distribution
            encoding = {
                'utf_8': 'utf8',
                'latin_1': 'latin1',
                'iso_8859_1': 'latin1',
            }.get(encoding.replace('-', '_'), encoding)
            realoutfile.write(DOC_TEMPLATE %
                dict(docclass  = self.docclass,
                     preamble  = self.preamble,
                     title     = self.title,
                     encoding  = encoding,
                     styledefs = '',
                     code      = outfile.getvalue()))

In the next step, we need to hack minted so that it uses our lexer and formatter. This is achieved with the following file. We changed the executable from pygmentize to python3; and we used -x -f -P commandprefix=PY so that our own formatter is used. This code is quite long because I had to copy-paste the definition of \minted@pygmentize. Somehow \patchcmd doesn’t work with it.


% patch \MintedPygmentize so that it uses our script
% this has to be placed after the preamble

% patch \minted@pygmentize so that it uses our own formatter
      \csname minted@optlistcl@lang\minted@lang @i\endcsname}}%
        \detokenize{for /f "usebackq tokens=*"}\space\@percentchar\detokenize{a in (`kpsewhich}\space\minted@codefile\detokenize{`) do}\space
    \MintedPygmentize\space -l #2
    -x -f -P commandprefix=PY -F tokenmerge % using custom formatter and prefix
    \minted@optlistcl@g \csname minted@optlistcl@lang\minted@lang\endcsname 
    \minted@optlistcl@cmd -o \minted@outputdir\minted@infile\space
        \detokenize{`}kpsewhich \minted@codefile\space
          \detokenize{||} \minted@codefile\detokenize{`}%
  % For debugging, uncomment: %%%%
  % \immediate\typeout{\minted@cmd}%
  % %%%%
              \pdf@mdfivesum{\minted@cmd autogobble(\ifx\FancyVerbStartNum\z@ 0\else\FancyVerbStartNum\fi-\ifx\FancyVerbStopNum\z@ 0\else\FancyVerbStopNum\fi)}}}%
              {\immediate\write\minted@code{autogobble(\ifx\FancyVerbStartNum\z@ 0\else\FancyVerbStartNum\fi-\ifx\FancyVerbStopNum\z@ 0\else\FancyVerbStopNum\fi)}}{}%
            %Cheating a little here by using ASCII codes to write `{` and `}`
            %in the Python code
              \detokenize{python -c "import hashlib; import os;
                hasher = hashlib.sha1();
                f = open(os.path.expanduser(os.path.expandvars(\"}\minted@tmpfname@esc.mintedcmd\detokenize{\")), \"rb\");
                f = open(os.path.expanduser(os.path.expandvars(\"}\minted@argone@esc\detokenize{\")), \"rb\");
                f = open(os.path.expanduser(os.path.expandvars(\"}\minted@tmpfname@esc.mintedmd5\detokenize{\")), \"w\");
                macro = \"\\edef\\minted@hash\" + chr(123) + hasher.hexdigest() + chr(125) + \"\";
                f.write(\"\\makeatletter\" + macro + \"\\makeatother\\endinput\n\");
             {\edef\minted@hash{\mdfivesum file {#1}%
                \mdfivesum{\minted@cmd autogobble(\ifx\FancyVerbStartNum\z@ 0\else\FancyVerbStartNum\fi-\ifx\FancyVerbStopNum\z@ 0\else\FancyVerbStopNum\fi)}}}%
             {\edef\minted@hash{\mdfivesum file {#1}%
            \ShellEscape{move /y \minted@outputdir\minted@infile@windows\space\minted@outputdir\minted@actualinfile@windows}%
            \ShellEscape{mv -f \minted@outputdir\minted@infile\space\minted@outputdir\minted@actualinfile}%

An Example

To run this example, make sure:



% must not include in the prealbme

% declare listing style
  listing engine=minted,
  listing only,
  minted language={#1},
  minted options={
    escapeinside=|| % the value here doesn't matter, it just has to be present

% redefine line number style

    # Euclidean algorithm
    def GCD(x , y):
        """This is used to calculate the GCD of the given two numbers.
        You remember the farm land problem where we need to find the 
        largest , equal size , square plots of a given plot?
        Just testing an equation here: %*$\int_a^b \frac{x}{\sin(x)} dx$*)
        if y == 0:
            return x
        r = int(x % y)
        return GCD(y , r)

    # Euclidean algorithm
    def GCD(x , y):
        """This is used to calculate the GCD of the given two numbers.
        You remember the farm land problem where we need to find the 
        largest , equal size , square plots of a given plot?"""
        if y == 0:
            return x
        %*!!!*)r = int(x % y)
        return GCD(y , r)%*!!!*)

    # Euclidean algorithm
    def GCD(x , y):
        """This is used to calculate the GCD of the given two numbers.
        %*---*)You remember the farm land problem where we need to find the 
        largest , equal size , square plots of a given plot?"""%*---*)
        if y == 0:
            %*---*)return x%*---*)
        r = int(x % y)
        return GCD(y , r)


Escaping content in strings




Gray out