Lexer Test Specifications

Conformance Level: These tests cover Level 1-3 conformance requirements for tokenization.

Overview

The lexer (tokenizer) is responsible for converting a character stream into a token stream. These tests validate that the lexer correctly identifies and categorizes all TON language elements.

Test Cases

LEXER_001: Basic Structure Tokenization

Purpose: Verify tokenization of basic object structure

Input:

{ }

Expected Tokens:

Token[0]: Type=LeftBrace, Value="{"
Token[1]: Type=RightBrace, Value="}"

Validation: Exactly 2 tokens, correct types and values

LEXER_002: Object With Class Name

Purpose: Verify tokenization of object with class annotation

Input:

{(person)}

Expected Tokens:

Token[0]: Type=LeftBrace, Value="{"
Token[1]: Type=LeftParen, Value="("
Token[2]: Type=Identifier, Value="person"
Token[3]: Type=RightParen, Value=")"
Token[4]: Type=RightBrace, Value="}"

LEXER_003: Property Assignment

Purpose: Verify tokenization of property assignments

Input:

name = 'John', age = 30

Expected Tokens:

Token[0]: Type=Identifier, Value="name"
Token[1]: Type=Equals, Value="="
Token[2]: Type=String, Value="John"
Token[3]: Type=Comma, Value=","
Token[4]: Type=Identifier, Value="age"
Token[5]: Type=Equals, Value="="
Token[6]: Type=Number, Value="30"

LEXER_004: String Literals

Purpose: Verify tokenization of different string quote styles

Test Cases:

Input	Expected Token Value
`'Hello World'`	Hello World
`"Hello World"`	Hello World
`Hello World`	Hello World

Validation: All produce Type=String with unquoted value

LEXER_005: Escape Sequences

Purpose: Verify proper handling of escape sequences in strings

Input:

'Line 1\nLine 2\t\''

Expected Token:

Type=String
Value="Line 1[newline]Line 2[tab]'"

Note: [newline] and [tab] represent actual control characters

LEXER_006: Numeric Literals

Purpose: Verify tokenization of various number formats

Test Cases:

Input	Token Type	Token Value
`123`	Number	123
`-456`	Number	-456
`3.14`	Number	3.14
`1.23e10`	Number	1.23e10
`0xFF`	Number	0xFF
`0b1010`	Number	0b1010

LEXER_007: Keywords

Purpose: Verify recognition of reserved keywords

Test Cases:

Input	Token Type
`true`	Boolean
`false`	Boolean
`null`	Null
`undefined`	Undefined

LEXER_008: GUID Tokenization

Purpose: Verify GUID recognition with and without braces

Test Cases:

Input: 550e8400-e29b-41d4-a716-446655440000
Expected: Type=Guid, Value=550e8400-e29b-41d4-a716-446655440000
Input: {550e8400-e29b-41d4-a716-446655440000}
Expected: Type=Guid, Value={550e8400-e29b-41d4-a716-446655440000}

LEXER_009: Enum Tokenization

Purpose: Verify enum and enum set tokenization

Test Cases:

Input	Token Value	Description
`\|active\|`	\|active\|	Single enum value
`\|0\|`	\|0\|	Numeric enum
`\|read\|write\|`	\|read\|write\|	Enum set
`\|0\|2\|4\|`	\|0\|2\|4\|	Numeric enum set

LEXER_010: Special Prefixes

Purpose: Verify tokenization of special prefix characters

Test Cases:

Input	Tokens	Description
`@name`	AtSign, Identifier("name")	At-prefixed property
`$value`	StringHint, Identifier("value")	String type hint
`%value`	NumberHint, Identifier("value")	Number type hint
`&value`	GuidHint, Identifier("value")	GUID type hint
`#@ header`	HeaderPrefix, Identifier("header")	Header section
`#! schema`	SchemaPrefix, Identifier("schema")	Schema section

LEXER_011: Comment Handling

Purpose: Verify that comments are properly skipped

Test Case 1: Single-line comment

// This is a comment
name = 'value'

Expected: First token should be Identifier("name"), no comment tokens

Test Case 2: Multi-line comment

/* This is a
multi-line comment */
name = 'value'

Expected: First token should be Identifier("name"), no comment tokens

LEXER_012: Position Tracking

Purpose: Verify line and column number tracking

Input:

name = 'value'
age = 30

Expected Positions:

Token "name": Line=1, Column=1
Token "age": Line=2, Column=1

LEXER_013: Multi-line String Literals

Purpose: Verify tokenization of triple-quoted strings

Input:

"""
Line 1
Line 2
"""

Expected Token:

Type=MultiLineString
Value="Line 1\nLine 2"

Implementation Notes

Whitespace between tokens should be ignored except within strings
Line endings can be \n, \r\n, or \r
Token position should reflect the starting position of the token
Escape sequences must be processed during tokenization
Comments should be completely skipped, not returned as tokens

Error Cases

The following should produce lexer errors:

Unterminated strings
Invalid escape sequences
Malformed numbers (e.g., 0x, 0b without digits)
Invalid characters outside of strings

Input	Token Value	Description
`\|active\|`	\|active\|	Single enum value
`\|0\|`	\|0\|	Numeric enum
`\|read\|write\|`	\|read\|write\|	Enum set
`\|0\|2\|4\|`	\|0\|2\|4\|	Numeric enum set