This is a simple C/C++ and Java tokenizer program written in C.
This same repository also offers separate programs for a Python tokenizer
(pytokenize
) and a JavaScript tokenizer (jstokenize
). They all share most
of the command-line options and have the same output formats.
Here we focus on the C/C++/Java tokenizer (tokenize
), but most of this
documentation equally applies to the other tokenizer program. The Makefile
builds them all.
The following lexeme classes are recognized:
- identifier
- reserved word/keyword
- binary, octal, decimal, hexadecimal and floating-point numbers
- double-quoted string literal
- single-quoted character literal
- all single, double, and triple operator and punctuation symbols
- the preprocessor tokens # and ##
For each correctly recognized token, the program determines its class/type and the exact coordinates (line number and column) in the input text of its starting character. All token literals are output exactly as they appear in the source text, without any interpretation of escaped characters.
A newline is defined as a single linefeed character \n
or the combination
carriage return \r
followed by linefeed \n
.
Line continuations (a backslash immediately followed by a newline) are handled
at the character input level, so the token recognizers will only see logical
lines. Line and column reflect positions in the physical line structure, not the logical one.
For instance the appearance of a line continuation inside a string literal:
"A long string literal that is broken here \
to stretch over two lines."
upon output as a token becomes:
"A long string literal that is broken here to stretch over two lines."
Moreover, white-space, control characters and comments are skipped and anything left over is flagged as illegal characters.
Since Java at the lexical level is very close to C and C++, this tokenizer
can also be used for Java, albeit that some literal pecularities are not
recognized. The program looks at the file name extension to determine the
language. This can be overridden (and must be specified in case of using
standard input) by the -l
option.
Depending on the language setting, the proper set of keywords will be
recognized. For C and C++ their
combined set of (95) keywords is recognized, assuming that a C program will not
inadvertently use C++ keywords as regular identifiers.
The program options can be listed with the -h
option:
$ tokenize -h
A tokenizer for C/C++ (and Java) source code with output in 6 formats.
Recognizes the following token classes: keyword, identifier, integer,
floating, string, character, operator, and preprocessor.
usage: tokenize [ -1acdhjl:m:no:rsvw ] [ FILES ]
Command line options are:
-a : append to output file instead of create or overwrite.
-c : treat a # character as the start of a line comment.
-d : print debug info to stderr; implies -v.
-h : print just this text to stderr and stop.
-j : assume input is Java (deprecated: use -l Java or .java).
-l<lang> : specify language explicitly (C, C++, Java).
-m<mode> : output mode either plain (default), csv, json, jsonl, xml, or raw.
-n : output newlines as a special pseudo token.
-o<file> : write output to this file (instead of stdout).
-s : enable a special start token specifying the filename.
-1 : treat all filename arguments as a continuous single input.
-v : print action summary to stderr.
-w : suppress all warning messages.
The program reads multiple files. Depending on the -1
option, the files
are either treated as a single input, or processed separately. When processed
separately (no -1 option) the output is as if each file is processed
individually, emitting start and end symbols where appropriate depending on
the mode setting.
The tokenizer has multiple output modes. They are plain text, CSV, JSON, JSONL and XML. A sample of plain text output looks like this:
( 62, 0) preprocessor: #
( 62, 1) identifier: include
( 62, 9) string: "perfect_hash.h"
( 64, 0) preprocessor: #
( 64, 1) identifier: define
( 64, 8) identifier: token_add
( 64, 17) operator: (
( 64, 18) identifier: cc
( 64, 20) operator: )
( 65, 2) keyword: do
( 65, 5) operator: {
( 65, 7) keyword: if
( 65, 10) operator: (
( 65, 11) identifier: len
( 65, 15) operator: <
( 65, 17) identifier: MAX_TOKEN
( 65, 26) operator: )
( 65, 28) identifier: token
( 65, 33) operator: [
( 65, 34) identifier: len
( 65, 37) operator: ++
( 65, 39) operator: ]
( 65, 41) operator: =
( 65, 43) operator: (
( 65, 44) identifier: cc
( 65, 46) operator: )
( 65, 47) operator: ;
( 65, 49) operator: }
( 65, 51) keyword: while
( 65, 56) operator: (
( 65, 57) integer: 0
( 65, 58) operator: )
( 68, 0) keyword: int
( 68, 4) identifier: get
( 68, 7) operator: (
( 68, 8) operator: )
Line numbers are 1 based, columns start at 0 (Emacs-style). The token classes are:
Class: | Description: |
---|---|
identifier | any identifier |
keyword | a reserved word |
integer | integer number irrespective of notation |
floating | a floating-point number |
string | a double-quoted string (maybe empty) |
character | a single-quoted character |
operator | any operator or punctuator symbol |
preprocessor | either # or ## |
filename | pseudo token: start of a new file |
newline | pseudo token: end of logical line |
The filename
token is optional. It will be included when the -s
option is
provided. It is a pseudo token that provides the filename of the input as the
first token. Similarly, the newline
is a pseudo token and appears only with
the -n
option. It signals the end of a logical line. Mind that multiple
newlines occurring in sequence are not suppressed. The newline
token has no
textual representation, e.g. in XML mode output it will appear as an empty
text element.
CSV output has this header and a few lines of sample rows:
line,column,class,token
...
624,12,operator,:
625,6,identifier,fprintf
625,13,operator,(
625,14,identifier,stdout
625,20,operator,","
625,22,string,"""</tokens>\n"""
...
The operator token ,
is escaped with double quotes, like so ","
.
String tokens are escaped as well and any original double quote is doubled.
In JSON output all token values are represented as strings. String class tokens themselves are properly escaped, especially backslash escape characters are doubled.
[
...
{ "line": 624, "column": 12, "class": "operator", "token": ":" },
{ "line": 625, "column": 6, "class": "identifier", "token": "fprintf" },
{ "line": 625, "column": 13, "class": "operator", "token": "(" },
{ "line": 625, "column": 14, "class": "identifier", "token": "stdout" },
{ "line": 625, "column": 20, "class": "operator", "token": "," },
{ "line": 625, "column": 22, "class": "string", "token": "\"</tokens>\\n\"" },
...
]
There is also a related JSONL mode that outputs one token object per line, but does not collect them as a JSON array.
For XML output, the 3 characters <
, >
, and &
are escaped by replacing
them with the corresponding entities in the character and string class
tokens. (An alternative would be to use the CDATA construct.)
<?xml version="1.0" encoding="UTF-8"?>
<tokens>
...
<token line="624" column="12" class="operator">:</token>
<token line="625" column="6" class="identifier">fprintf</token>
<token line="625" column="13" class="operator">(</token>
<token line="625" column="14" class="identifier">stdout</token>
<token line="625" column="20" class="operator">,</token>
<token line="625" column="22" class="string">"</tokens>\n"></token>
...
</tokens>
[2] https://www.ibm.com/support/knowledgecenter/en/SSGH3R_13.1.3/com.ibm.xlcpp1313.aix.doc/language_ref/lexcvn.html