Lexer Test Specifications

Conformance Level: These tests cover Level 1-3 conformance requirements for tokenization.

Overview

The lexer (tokenizer) is responsible for converting a character stream into a token stream. These tests validate that the lexer correctly identifies and categorizes all TON language elements.

Test Cases

LEXER_001: Basic Structure Tokenization

Purpose: Verify tokenization of basic object structure

Input:

{ }

Expected Tokens:

  • Token[0]: Type=LeftBrace, Value="{"
  • Token[1]: Type=RightBrace, Value="}"

Validation: Exactly 2 tokens, correct types and values

LEXER_002: Object With Class Name

Purpose: Verify tokenization of object with class annotation

Input:

{(person)}

Expected Tokens:

  • Token[0]: Type=LeftBrace, Value="{"
  • Token[1]: Type=LeftParen, Value="("
  • Token[2]: Type=Identifier, Value="person"
  • Token[3]: Type=RightParen, Value=")"
  • Token[4]: Type=RightBrace, Value="}"

LEXER_003: Property Assignment

Purpose: Verify tokenization of property assignments

Input:

name = 'John', age = 30

Expected Tokens:

  • Token[0]: Type=Identifier, Value="name"
  • Token[1]: Type=Equals, Value="="
  • Token[2]: Type=String, Value="John"
  • Token[3]: Type=Comma, Value=","
  • Token[4]: Type=Identifier, Value="age"
  • Token[5]: Type=Equals, Value="="
  • Token[6]: Type=Number, Value="30"

LEXER_004: String Literals

Purpose: Verify tokenization of different string quote styles

Test Cases:

Input Expected Token Value
'Hello World' Hello World
"Hello World" Hello World
`Hello World` Hello World

Validation: All produce Type=String with unquoted value

LEXER_005: Escape Sequences

Purpose: Verify proper handling of escape sequences in strings

Input:

'Line 1\nLine 2\t\''

Expected Token:

  • Type=String
  • Value="Line 1[newline]Line 2[tab]'"

Note: [newline] and [tab] represent actual control characters

LEXER_006: Numeric Literals

Purpose: Verify tokenization of various number formats

Test Cases:

Input Token Type Token Value
123 Number 123
-456 Number -456
3.14 Number 3.14
1.23e10 Number 1.23e10
0xFF Number 0xFF
0b1010 Number 0b1010

LEXER_007: Keywords

Purpose: Verify recognition of reserved keywords

Test Cases:

Input Token Type
true Boolean
false Boolean
null Null
undefined Undefined

LEXER_008: GUID Tokenization

Purpose: Verify GUID recognition with and without braces

Test Cases:

  • Input: 550e8400-e29b-41d4-a716-446655440000
    Expected: Type=Guid, Value=550e8400-e29b-41d4-a716-446655440000
  • Input: {550e8400-e29b-41d4-a716-446655440000}
    Expected: Type=Guid, Value={550e8400-e29b-41d4-a716-446655440000}

LEXER_009: Enum Tokenization

Purpose: Verify enum and enum set tokenization

Test Cases:

Input Token Value Description
|active| |active| Single enum value
|0| |0| Numeric enum
|read|write| |read|write| Enum set
|0|2|4| |0|2|4| Numeric enum set

LEXER_010: Special Prefixes

Purpose: Verify tokenization of special prefix characters

Test Cases:

Input Tokens Description
@name AtSign, Identifier("name") At-prefixed property
$value StringHint, Identifier("value") String type hint
%value NumberHint, Identifier("value") Number type hint
&value GuidHint, Identifier("value") GUID type hint
#@ header HeaderPrefix, Identifier("header") Header section
#! schema SchemaPrefix, Identifier("schema") Schema section

LEXER_011: Comment Handling

Purpose: Verify that comments are properly skipped

Test Case 1: Single-line comment

// This is a comment
name = 'value'

Expected: First token should be Identifier("name"), no comment tokens

Test Case 2: Multi-line comment

/* This is a
multi-line comment */
name = 'value'

Expected: First token should be Identifier("name"), no comment tokens

LEXER_012: Position Tracking

Purpose: Verify line and column number tracking

Input:

name = 'value'
age = 30

Expected Positions:

  • Token "name": Line=1, Column=1
  • Token "age": Line=2, Column=1

LEXER_013: Multi-line String Literals

Purpose: Verify tokenization of triple-quoted strings

Input:

"""
Line 1
Line 2
"""

Expected Token:

  • Type=MultiLineString
  • Value="Line 1\nLine 2"

Implementation Notes

  • Whitespace between tokens should be ignored except within strings
  • Line endings can be \n, \r\n, or \r
  • Token position should reflect the starting position of the token
  • Escape sequences must be processed during tokenization
  • Comments should be completely skipped, not returned as tokens

Error Cases

The following should produce lexer errors:

  • Unterminated strings
  • Invalid escape sequences
  • Malformed numbers (e.g., 0x, 0b without digits)
  • Invalid characters outside of strings