ANTLR with Kotlin - Part 1: Basics and Project Setup

ANTLR generates parsers from grammar files. You write the grammar, ANTLR produces the parser code. We’ll build a SQL parser that validates queries and extracts metadata.

What You’ll Build

A parser that reads this:

SELECT name, email FROM users WHERE age > 18

And produces this:

QueryMetadata(
    tables = listOf("users"),
    columns = listOf("name", "email"),
    conditions = listOf(Condition("age", ">", 18))
)

By Part 6, you’ll parse complex queries with JOINs, subqueries, and aggregates. Part 1 covers basics: SELECT * FROM users.

ANTLR Concepts

ANTLR (ANother Tool for Language Recognition) is a parser generator. You define a grammar, ANTLR generates Java or Kotlin code that parses text matching that grammar.

Lexer vs Parser

Lexer: Converts text into tokens.

Input: SELECT name FROM users Tokens: [SELECT, name, FROM, users]

Parser: Builds a tree from tokens.

Tokens: [SELECT, name, FROM, users] Tree:

selectStatement
  ├─ SELECT
  ├─ columnList
  │   └─ name
  ├─ FROM
  └─ tableReference
      └─ users

The lexer handles characters. The parser handles structure.

Grammar Files

ANTLR grammars use .g4 files. Two types:

Lexer grammar - defines tokens (keywords, identifiers, operators)
Parser grammar - defines syntax rules (statements, expressions)

Combined grammars include both in one file.

Project Setup

Create a Kotlin project with Gradle. ANTLR generates code during build.

build.gradle.kts

plugins {
    kotlin("jvm") version "1.9.21"
    id("antlr")
}

repositories {
    mavenCentral()
}

dependencies {
    antlr("org.antlr:antlr4:4.13.1")
    implementation("org.antlr:antlr4-runtime:4.13.1")
    implementation("com.strumenta:antlr-kotlin-runtime:1.0.0-RC1")

    testImplementation(kotlin("test"))
}

tasks.generateGrammarSource {
    maxHeapSize = "64m"
    arguments = arguments + listOf("-visitor", "-long-messages")
    outputDirectory = file("build/generated-src/antlr/main")
}

tasks.compileKotlin {
    dependsOn(tasks.generateGrammarSource)
}

kotlin {
    sourceSets["main"].kotlin.srcDir("build/generated-src/antlr/main")
}

How it works:

antlr plugin generates parser from .g4 files
-visitor flag creates visitor classes (we’ll use in Part 3)
Generated code goes to build/generated-src/antlr/main
Kotlin compilation waits for ANTLR generation

Project Structure

src/
  main/
    antlr/
      SimpleSql.g4          # Grammar definition
    kotlin/
      SqlParser.kt          # Our Kotlin code
  test/
    kotlin/
      SqlParserTest.kt      # Tests

Grammar files go in src/main/antlr/. ANTLR finds them automatically.

First Grammar: Simple SELECT

Create src/main/antlr/SimpleSql.g4:

grammar SimpleSql;

// Parser Rules
query
    : SELECT columns FROM table EOF
    ;

columns
    : STAR
    | columnList
    ;

columnList
    : IDENTIFIER (',' IDENTIFIER)*
    ;

table
    : IDENTIFIER
    ;

// Lexer Rules
SELECT  : [Ss][Ee][Ll][Ee][Cc][Tt] ;
FROM    : [Ff][Rr][Oo][Mm] ;
STAR    : '*' ;
IDENTIFIER : [a-zA-Z_][a-zA-Z0-9_]* ;
COMMA   : ',' ;
WS      : [ \t\r\n]+ -> skip ;

Grammar Breakdown

Parser rules (lowercase): Define syntax structure.

query
    : SELECT columns FROM table EOF
    ;

A query is: SELECT keyword, columns, FROM keyword, table name, end of file.

columns
    : STAR
    | columnList
    ;

Columns can be * or a list of identifiers. The | means “or”.

columnList
    : IDENTIFIER (',' IDENTIFIER)*
    ;

A column list is one identifier, optionally followed by comma + identifier (repeated). Matches name, name, email, name, email, age.

Lexer rules (UPPERCASE): Define tokens.

SELECT  : [Ss][Ee][Ll][Ee][Cc][Tt] ;

SELECT matches case-insensitive (SELECT, select, SeLeCt all work).

IDENTIFIER : [a-zA-Z_][a-zA-Z0-9_]* ;

Identifiers start with letter or underscore, followed by any alphanumeric or underscore. Matches users, user_id, firstName.

WS : [ \t\r\n]+ -> skip ;

Whitespace is recognized but skipped (not part of parse tree).

Using the Parser

Build the project to generate parser code:

./gradlew generateGrammarSource

ANTLR creates these files:

SimpleSqlLexer.java - tokenizes input
SimpleSqlParser.java - builds parse tree
SimpleSqlVisitor.java - interface for tree traversal
SimpleSqlBaseVisitor.java - base implementation

Parse a Query

Create src/main/kotlin/SqlParser.kt:

import org.antlr.v4.runtime.*
import org.antlr.v4.runtime.tree.*

fun parseQuery(sql: String): ParseTree {
    val input = CharStreams.fromString(sql)
    val lexer = SimpleSqlLexer(input)
    val tokens = CommonTokenStream(lexer)
    val parser = SimpleSqlParser(tokens)

    return parser.query()
}

fun main() {
    val sql = "SELECT * FROM users"
    val tree = parseQuery(sql)
    println(tree.toStringTree(SimpleSqlParser.ruleNames.toList()))
}

Output:

(query SELECT (columns *) FROM (table users) <EOF>)

How It Works

Diagram 1

CharStream: Wraps input string
Lexer: Converts characters to tokens
TokenStream: Holds tokens
Parser: Builds parse tree from tokens
Parse Tree: Structure representing the query

Each step transforms data:

"SELECT * FROM users" → CharStream
CharStream → [SELECT, STAR, FROM, IDENTIFIER("users"), EOF]
Tokens → Parse tree with query root node

Visualizing the Parse Tree

Diagram 2

The parser creates this tree automatically from the grammar rules.

Testing

Create src/test/kotlin/SqlParserTest.kt:

import org.junit.jupiter.api.Test
import org.junit.jupiter.api.assertDoesNotThrow
import kotlin.test.assertContains

class SqlParserTest {

    @Test
    fun `parse SELECT star`() {
        val sql = "SELECT * FROM users"
        val tree = parseQuery(sql)
        val treeStr = tree.toStringTree(SimpleSqlParser.ruleNames.toList())

        assertContains(treeStr, "SELECT")
        assertContains(treeStr, "users")
    }

    @Test
    fun `parse SELECT columns`() {
        val sql = "SELECT name, email FROM customers"
        assertDoesNotThrow {
            parseQuery(sql)
        }
    }

    @Test
    fun `case insensitive keywords`() {
        val sqls = listOf(
            "SELECT * FROM users",
            "select * from users",
            "SeLeCt * FrOm users"
        )

        sqls.forEach { sql ->
            assertDoesNotThrow("Failed: $sql") {
                parseQuery(sql)
            }
        }
    }
}

Run tests:

./gradlew test

All three tests pass. The grammar handles case-insensitive keywords and both * and column lists.

Common Pitfalls

Missing EOF: Always include EOF in your top-level rule.

query : SELECT columns FROM table ;  // Wrong - doesn't validate entire input
query : SELECT columns FROM table EOF ;  // Correct

Without EOF, the parser stops after matching the first valid query, ignoring remaining input. SELECT * FROM users garbage would parse successfully.

Lexer rule order matters: ANTLR tries rules top to bottom.

// Wrong - IDENTIFIER matches before keywords
IDENTIFIER : [a-zA-Z]+ ;
SELECT : 'SELECT' ;

// Correct - keywords before IDENTIFIER
SELECT : 'SELECT' ;
IDENTIFIER : [a-zA-Z]+ ;

If IDENTIFIER comes first, “SELECT” becomes an IDENTIFIER token, not SELECT. Keywords must come before generic patterns.

Left recursion in lexer: Lexer rules cannot be recursive.

NUMBER : NUMBER DIGIT ;  // ERROR - lexer rules can't recurse

Only parser rules support recursion. Use repetition operators in lexer rules:

NUMBER : DIGIT+ ;

What’s Next

Part 2 adds WHERE clauses with comparison operators, AND/OR logic, and expressions. We’ll extend the grammar to parse:

SELECT name, email FROM users WHERE age > 18 AND status = 'active'

You’ll learn operator precedence, expression parsing, and testing complex grammars.