ANTLR with Kotlin - Part 1: Basics and Project Setup
ANTLR generates parsers from grammar files. You write the grammar, ANTLR produces the parser code. We’ll build a SQL parser that validates queries and extracts metadata.
What You’ll Build
A parser that reads this:
SELECT name, email FROM users WHERE age > 18
And produces this:
QueryMetadata(
tables = listOf("users"),
columns = listOf("name", "email"),
conditions = listOf(Condition("age", ">", 18))
)
By Part 6, you’ll parse complex queries with JOINs, subqueries, and aggregates. Part 1 covers basics: SELECT * FROM users
.
ANTLR Concepts
ANTLR (ANother Tool for Language Recognition) is a parser generator. You define a grammar, ANTLR generates Java or Kotlin code that parses text matching that grammar.
Lexer vs Parser
Lexer: Converts text into tokens.
Input: SELECT name FROM users
Tokens: [SELECT, name, FROM, users]
Parser: Builds a tree from tokens.
Tokens: [SELECT, name, FROM, users]
Tree:
selectStatement
├─ SELECT
├─ columnList
│ └─ name
├─ FROM
└─ tableReference
└─ users
The lexer handles characters. The parser handles structure.
Grammar Files
ANTLR grammars use .g4
files. Two types:
- Lexer grammar - defines tokens (keywords, identifiers, operators)
- Parser grammar - defines syntax rules (statements, expressions)
Combined grammars include both in one file.
Project Setup
Create a Kotlin project with Gradle. ANTLR generates code during build.
build.gradle.kts
plugins {
kotlin("jvm") version "1.9.21"
id("antlr")
}
repositories {
mavenCentral()
}
dependencies {
antlr("org.antlr:antlr4:4.13.1")
implementation("org.antlr:antlr4-runtime:4.13.1")
implementation("com.strumenta:antlr-kotlin-runtime:1.0.0-RC1")
testImplementation(kotlin("test"))
}
tasks.generateGrammarSource {
maxHeapSize = "64m"
arguments = arguments + listOf("-visitor", "-long-messages")
outputDirectory = file("build/generated-src/antlr/main")
}
tasks.compileKotlin {
dependsOn(tasks.generateGrammarSource)
}
kotlin {
sourceSets["main"].kotlin.srcDir("build/generated-src/antlr/main")
}
How it works:
antlr
plugin generates parser from.g4
files-visitor
flag creates visitor classes (we’ll use in Part 3)- Generated code goes to
build/generated-src/antlr/main
- Kotlin compilation waits for ANTLR generation
Project Structure
src/
main/
antlr/
SimpleSql.g4 # Grammar definition
kotlin/
SqlParser.kt # Our Kotlin code
test/
kotlin/
SqlParserTest.kt # Tests
Grammar files go in src/main/antlr/
. ANTLR finds them automatically.
First Grammar: Simple SELECT
Create src/main/antlr/SimpleSql.g4
:
grammar SimpleSql;
// Parser Rules
query
: SELECT columns FROM table EOF
;
columns
: STAR
| columnList
;
columnList
: IDENTIFIER (',' IDENTIFIER)*
;
table
: IDENTIFIER
;
// Lexer Rules
SELECT : [Ss][Ee][Ll][Ee][Cc][Tt] ;
FROM : [Ff][Rr][Oo][Mm] ;
STAR : '*' ;
IDENTIFIER : [a-zA-Z_][a-zA-Z0-9_]* ;
COMMA : ',' ;
WS : [ \t\r\n]+ -> skip ;
Grammar Breakdown
Parser rules (lowercase): Define syntax structure.
query
: SELECT columns FROM table EOF
;
A query is: SELECT keyword, columns, FROM keyword, table name, end of file.
columns
: STAR
| columnList
;
Columns can be *
or a list of identifiers. The |
means “or”.
columnList
: IDENTIFIER (',' IDENTIFIER)*
;
A column list is one identifier, optionally followed by comma + identifier (repeated). Matches name
, name, email
, name, email, age
.
Lexer rules (UPPERCASE): Define tokens.
SELECT : [Ss][Ee][Ll][Ee][Cc][Tt] ;
SELECT matches case-insensitive (SELECT, select, SeLeCt all work).
IDENTIFIER : [a-zA-Z_][a-zA-Z0-9_]* ;
Identifiers start with letter or underscore, followed by any alphanumeric or underscore. Matches users
, user_id
, firstName
.
WS : [ \t\r\n]+ -> skip ;
Whitespace is recognized but skipped (not part of parse tree).
Using the Parser
Build the project to generate parser code:
./gradlew generateGrammarSource
ANTLR creates these files:
SimpleSqlLexer.java
- tokenizes inputSimpleSqlParser.java
- builds parse treeSimpleSqlVisitor.java
- interface for tree traversalSimpleSqlBaseVisitor.java
- base implementation
Parse a Query
Create src/main/kotlin/SqlParser.kt
:
import org.antlr.v4.runtime.*
import org.antlr.v4.runtime.tree.*
fun parseQuery(sql: String): ParseTree {
val input = CharStreams.fromString(sql)
val lexer = SimpleSqlLexer(input)
val tokens = CommonTokenStream(lexer)
val parser = SimpleSqlParser(tokens)
return parser.query()
}
fun main() {
val sql = "SELECT * FROM users"
val tree = parseQuery(sql)
println(tree.toStringTree(SimpleSqlParser.ruleNames.toList()))
}
Output:
(query SELECT (columns *) FROM (table users) <EOF>)
How It Works
- CharStream: Wraps input string
- Lexer: Converts characters to tokens
- TokenStream: Holds tokens
- Parser: Builds parse tree from tokens
- Parse Tree: Structure representing the query
Each step transforms data:
"SELECT * FROM users"
→ CharStream- CharStream →
[SELECT, STAR, FROM, IDENTIFIER("users"), EOF]
- Tokens → Parse tree with
query
root node
Visualizing the Parse Tree
The parser creates this tree automatically from the grammar rules.
Testing
Create src/test/kotlin/SqlParserTest.kt
:
import org.junit.jupiter.api.Test
import org.junit.jupiter.api.assertDoesNotThrow
import kotlin.test.assertContains
class SqlParserTest {
@Test
fun `parse SELECT star`() {
val sql = "SELECT * FROM users"
val tree = parseQuery(sql)
val treeStr = tree.toStringTree(SimpleSqlParser.ruleNames.toList())
assertContains(treeStr, "SELECT")
assertContains(treeStr, "users")
}
@Test
fun `parse SELECT columns`() {
val sql = "SELECT name, email FROM customers"
assertDoesNotThrow {
parseQuery(sql)
}
}
@Test
fun `case insensitive keywords`() {
val sqls = listOf(
"SELECT * FROM users",
"select * from users",
"SeLeCt * FrOm users"
)
sqls.forEach { sql ->
assertDoesNotThrow("Failed: $sql") {
parseQuery(sql)
}
}
}
}
Run tests:
./gradlew test
All three tests pass. The grammar handles case-insensitive keywords and both *
and column lists.
Common Pitfalls
Missing EOF: Always include EOF
in your top-level rule.
query : SELECT columns FROM table ; // Wrong - doesn't validate entire input
query : SELECT columns FROM table EOF ; // Correct
Without EOF
, the parser stops after matching the first valid query, ignoring remaining input. SELECT * FROM users garbage
would parse successfully.
Lexer rule order matters: ANTLR tries rules top to bottom.
// Wrong - IDENTIFIER matches before keywords
IDENTIFIER : [a-zA-Z]+ ;
SELECT : 'SELECT' ;
// Correct - keywords before IDENTIFIER
SELECT : 'SELECT' ;
IDENTIFIER : [a-zA-Z]+ ;
If IDENTIFIER
comes first, “SELECT” becomes an IDENTIFIER token, not SELECT. Keywords must come before generic patterns.
Left recursion in lexer: Lexer rules cannot be recursive.
NUMBER : NUMBER DIGIT ; // ERROR - lexer rules can't recurse
Only parser rules support recursion. Use repetition operators in lexer rules:
NUMBER : DIGIT+ ;
What’s Next
Part 2 adds WHERE clauses with comparison operators, AND/OR logic, and expressions. We’ll extend the grammar to parse:
SELECT name, email FROM users WHERE age > 18 AND status = 'active'
You’ll learn operator precedence, expression parsing, and testing complex grammars.