Reveal Thyself! Visualizing LuaTeX Node Structure

LuaTeX is a wonderful extension to the traditional LaTeX family (e.g. pdfTeX, XeLaTeX), for it provides a powerful programming back-end for LaTeX. With the help of Lua, it is easier to handle file system concepts, carry out numerical computations, apply simple data processing, handle verbatim contents and so on. LuaTeX is the dream tool for automatic high-quality PDF document generation.

The TeX compiler stores the document structure in terms of nodes, which are kept around with linked lists that contain data and metadata about each character, paragraph and page. With LuaTeX, these internal data structures become visible to TeX users for the first time, which means it is possible to view or even manipulate a document by accessing the internals of a TeX compiler. This feature sounds very promising. Sadly, the documentations about LuaTeX is scarce. The most comprehensive document about LuaTeX is its manual, which spends no more than few paragraphs, if not few sentences for most of its functionalities. With this manual alone, it is very difficult to understand how LuaTeX nodes work thoroughly. In this post, I am going to provide a visualization of LuaTeX’s node structure that revolutionizes all existing presentations of LuaTeX nodes, which are usually faulty and text-based.

GitHub: https://github.com/xziyue/luatex-node-inspect

LuaTeX nodes are “black boxes”

Possibly as a design choice to facilitate the speed of LuaTeX, nodes are implemented as userdata objects, instead of Lua tables that most of us are familiar with. These objects resemble C/C++ extension modules of Python, but they are more opaque compared to their Python counterparts. In Python, a C/C++ extension module would at least contain __dir__ or __doc__ attributes to indicate its members or provide documentation. However, these information are not avaliable in userdata objects. In Lua, one only needs to implement an __index method for these extension modules, which returns the corresponding attribute value when given the name as a string. Since this function is written in C and compiled to binary form, there is no way for users to find out what members a userdata object contains. This is exactly the case for LuaTeX nodes: given a node object, there is no easy way to tell what data it is holding.

Unboxing LuaTeX nodes

In order to understand node objects, we must find a way to get the members of them. Although it is impossible to do this directly in Lua, these information are tabulated in the LuaTeX manual. Therefore, I manually entered the members of all LuaTeX node types in luatex_node_inspect.py. These information are passed to Lua with a json file: luatex-node.json. With these in hand, the Lua script luatex-node-to-table.lua can convert LuaTeX nodes from userdata to Lua table in a recursive manner.

Generating Visualization

The Lua script can generate a Lua table representation of LuaTeX nodes. With inspect.lua library, we can already gain deeper insights to a TeX document. Nevertheless, text-based representation is still too long to interpret. For example, the inspect string for a blank document is shown as below. It is very difficult for one to understand the content and structure of a Lua document. As a result, I attempt to save the Lua table as json file and use Python’s graphviz package to form a graphical representation of the node structure. The graphing code is generate_vis.py.

{
  attr = {
    ["0"] = 0,
    id = "attribute_list(40)"
  },
  depth = "0.000000pt",
  dir = "TLT",
  glue_order = 0,
  glue_set = 0.0,
  glue_sign = 0,
  head = { {
      attr = {
        ["0"] = 0,
        id = "attribute_list(40)"
      },
      id = "glue(12)",
      leader = {},
      next = { "<node    565 <    673 >    nil : vlist 0>" },
      prev = {},
      shrink = 0,
      shrink_over = {},
      stretch = 0,
      stretch_over = {},
      subtype = "userskip(0)",
      width = "16.000000pt"
    }, {
      attr = {
        ["0"] = 0,
        id = "attribute_list(40)"
      },
      depth = "0.000000pt",
      dir = "TLT",
      glue_order = 0,
      glue_set = 0.0,
      glue_sign = 0,
      head = { {
          attr = {
            ["0"] = 0,
            id = "attribute_list(40)"
          },
          depth = "0.000000pt",
          dir = "TLT",
          glue_order = 2,
          glue_set = 12.0,
          glue_sign = 1,
          head = { {
              attr = {
                ["0"] = 0,
                id = "attribute_list(40)"
              },
              id = "glue(12)",
              leader = {},
              next = { "<node    586 <    600 >    nil : hlist 2>" },
              prev = {},
              shrink = 0,
              shrink_over = {},
              stretch = 65536,
              stretch_over = {},
              subtype = "userskip(0)",
              width = "0.000000pt"
            }, {
              attr = {
                ["0"] = 0,
                id = "attribute_list(40)"
              },
              depth = "0.000000pt",
              dir = "TLT",
              glue_order = 0,
              glue_set = 0.0,
              glue_sign = "normal(0)",
              head = {},
              height = "0.000000pt",
              id = "hlist(0)",
              list = {},
              next = {},
              prev = { "<node    nil <    586 >    600 : glue 0>" },
              shift = 0,
              subtype = "box(2)",
              width = "345.000000pt"
            } },
          height = "12.000000pt",
          id = "vlist(1)",
          next = { "<node    618 <    579 >    627 : glue 0>" },
          prev = {},
          shift = 0,
          subtype = "unknown(0)",
          width = "345.000000pt"
        }, {
          attr = {
            ["0"] = 0,
            id = "attribute_list(40)"
          },
          id = "glue(12)",
          leader = {},
          next = { "<node    579 <    627 >    354 : glue 1>" },
          prev = { "<node    nil <    618 >    579 : vlist 0>" },
          shrink = 0,
          shrink_over = {},
          stretch = 0,
          stretch_over = {},
          subtype = "userskip(0)",
          width = "25.000000pt"
        }, {
          attr = {
            ["0"] = 0,
            id = "attribute_list(40)"
          },
          id = "glue(12)",
          leader = {},
          next = { "<node    627 <    354 >    650 : vlist 0>" },
          prev = { "<node    618 <    579 >    627 : glue 0>" },
          shrink = 0,
          shrink_over = {},
          stretch = 0,
          stretch_over = {},
          subtype = "lineskip(1)",
          width = "0.000000pt"
        }, {
          attr = {
            ["0"] = 0,
            id = "attribute_list(40)"
          },
          depth = "0.000000pt",
          dir = "TLT",
          glue_order = 2,
          glue_set = 539.94232177734,
          glue_sign = 1,
          head = { {
              attr = {
                ["0"] = 0,
                id = "attribute_list(40)"
              },
              data = "",
              id = "whatsit(8)",
              next = { "<node     80 <    405 >    345 : glue 10>" },
              prev = {},
              stream = 129,
              subtype = "write(1)"
            }, {
              attr = {
                ["0"] = 0,
                id = "attribute_list(40)"
              },
              id = "glue(12)",
              leader = {},
              next = { "<node    405 <    345 >    500 : hlist 2>" },
              prev = { "<node    nil <     80 >    405 : whatsit 1>" },
              shrink = 0,
              shrink_over = {},
              stretch = 0,
              stretch_over = {},
              subtype = "topskip(10)",
              width = "10.000000pt"
            }, {
              attr = {
                ["0"] = 0,
                id = "attribute_list(40)"
              },
              depth = "0.000000pt",
              dir = "TLT",
              glue_order = 0,
              glue_set = 0.0,
              glue_sign = "normal(0)",
              head = {},
              height = "0.000000pt",
              id = "hlist(0)",
              list = {},
              next = { "<node    345 <    500 >    398 : glue 0>" },
              prev = { "<node     80 <    405 >    345 : glue 10>" },
              shift = 0,
              subtype = "box(2)",
              width = "0.000000pt"
            }, {
              attr = {
                ["0"] = 0,
                id = "attribute_list(40)"
              },
              id = "glue(12)",
              leader = {},
              next = { "<node    500 <    398 >    493 : glue 0>" },
              prev = { "<node    405 <    345 >    500 : hlist 2>" },
              shrink = 0,
              shrink_over = {},
              stretch = 65536,
              stretch_over = {},
              subtype = "userskip(0)",
              width = "0.000000pt"
            }, {
              attr = {
                ["0"] = 0,
                id = "attribute_list(40)"
              },
              id = "glue(12)",
              leader = {},
              next = { "<node    398 <    493 >    nil : glue 0>" },
              prev = { "<node    345 <    500 >    398 : glue 0>" },
              shrink = 0,
              shrink_over = {},
              stretch = 0,
              stretch_over = {},
              subtype = "userskip(0)",
              width = "0.000000pt"
            }, {
              attr = {
                ["0"] = 0,
                id = "attribute_list(40)"
              },
              id = "glue(12)",
              leader = {},
              next = {},
              prev = { "<node    500 <    398 >    493 : glue 0>" },
              shrink = 0,
              shrink_over = {},
              stretch = 7,
              stretch_over = {},
              subtype = "userskip(0)",
              width = "0.000000pt"
            } },
          height = "550.000000pt",
          id = "vlist(1)",
          next = { "<node    354 <    650 >    664 : glue 2>" },
          prev = { "<node    579 <    627 >    354 : glue 1>" },
          shift = 0,
          subtype = "unknown(0)",
          width = "0.000000pt"
        }, {
          attr = {
            ["0"] = 0,
            id = "attribute_list(40)"
          },
          id = "glue(12)",
          leader = {},
          next = { "<node    650 <    664 >    nil : hlist 2>" },
          prev = { "<node    627 <    354 >    650 : vlist 0>" },
          shrink = 0,
          shrink_over = {},
          stretch = 0,
          stretch_over = {},
          subtype = "baselineskip(2)",
          width = "22.578125pt"
        }, {
          attr = {
            ["0"] = 0,
            id = "attribute_list(40)"
          },
          depth = "0.000000pt",
          dir = "TLT",
          glue_order = 2,
          glue_set = 169.31884765625,
          glue_sign = "stretching(1)",
          head = { {
              attr = {
                ["0"] = 0,
                id = "attribute_list(40)"
              },
              id = "glue(12)",
              leader = {},
              next = { "<node    643 <    636 >    657 : glyph 256>" },
              prev = {},
              shrink = 0,
              shrink_over = {},
              stretch = 65536,
              stretch_over = {},
              subtype = "userskip(0)",
              width = "0.000000pt"
            }, {
              attr = {
                ["0"] = 0,
                id = "attribute_list(40)"
              },
              char = "'1'(49)",
              components = {},
              data = 0,
              depth = "0.000000pt",
              expansion_factor = 0,
              font = 37,
              height = "7.421875pt",
              id = "glyph(29)",
              lang = 0,
              left = 2,
              next = { "<node    636 <    657 >    nil : glue 0>" },
              prev = { "<node    nil <    643 >    636 : glue 0>" },
              right = 3,
              subtype = "*unknown*(256)",
              uchyph = 1,
              width = "6.362305pt",
              xoffset = 0,
              yoffset = 0
            }, {
              attr = {
                ["0"] = 0,
                id = "attribute_list(40)"
              },
              id = "glue(12)",
              leader = {},
              next = {},
              prev = { "<node    643 <    636 >    657 : glyph 256>" },
              shrink = 0,
              shrink_over = {},
              stretch = 65536,
              stretch_over = {},
              subtype = "userskip(0)",
              width = "0.000000pt"
            } },
          height = "7.421875pt",
          id = "hlist(0)",
          next = {},
          prev = { "<node    354 <    650 >    664 : glue 2>" },
          shift = 0,
          subtype = "box(2)",
          width = "345.000000pt"
        } },
      height = "617.000000pt",
      id = "vlist(1)",
      next = {},
      prev = { "<node    nil <    565 >    673 : glue 0>" },
      shift = 4063232,
      subtype = "unknown(0)",
      width = "345.000000pt"
    } },
  height = "633.000000pt",
  id = "vlist(1)",
  next = {},
  prev = {},
  shift = 0,
  subtype = "unknown(0)",
  width = "407.000000pt"
}

Visualizing TeX documents’ internal representation

Throughout this section, I will be using the document template below, where contents are inserted between \begin{document} and \end{document} blocks. The \AtBeginShipout is called for each page. Therefore, the current implementation only preserves the last page of the document. Modifications need to be made if one wishes to visualize multi-page documents.

\documentclass{article}
\usepackage{fontspec}
\usepackage{luacode}
\usepackage{graphicx}
\usepackage{xcolor}
\usepackage{atbegshi}

\setmainfont{DejaVu Serif}

\begin{luacode*}
require "luatex-node-to-table"
inspect = require"inspect"

local my_param = get_default_param()
my_param["expand_depth"] = 0

function recursive_expand_node(n)
  local tbl = luatex_node_to_table(n, my_param)
  local head = n.head
  if head ~= nil then
    local lst_tbl = {}
    tbl["list"] = nil
    local item = nil
    for n1 in node.traverse(head) do
      item = recursive_expand_node(n1)
      table.insert(lst_tbl, item)
    end
    tbl["head"] = lst_tbl
  end
  return tbl
end
\end{luacode*}

\AtBeginShipout{%
  \directlua{
    local n = tex.box["AtBeginShipoutBox"]
    local all_n = recursive_expand_node(n)
    local inspect_text = inspect(all_n)
    texio.write_nl(inspect_text)
    local json_text = json.encode(all_n)
    local file = io.open("temp.json", "w")
    file:write(json_text)
    file:close()
  }
}

\begin{document}

\end{document}

Blank document

vis-1

Docuemnt with text “abc”

vis-2

Document with a picture

vis-3

Document with colored text

vis-4

Document with math

vis-5

vis-6

Document with rich media

vis-7

More reading