Interpretation Schemas

Interpretation schemas are a powerful feature that allows you to define how to parse binary data (bytes) or text data (strings) directly in SQL. This enables structured extraction of data from file formats, network protocols, log files, and other structured data sources.

Overview

Interpretation schemas let you:

  • Define binary schemas to parse byte sequences (files, packets, etc.)
  • Define text schemas to parse character sequences (log files, CSV, etc.)
  • Use the Interpret() function to apply schemas to data
  • Combine with CROSS APPLY for powerful data transformations

Binary Schema Syntax

Binary schemas define how to interpret byte sequences:

INTERPRET binary Header {
    Magic:   int le,
    Version: short le,
    Length:  int le,
    Flags:   byte
}

Primitive Types

Type Description Size
byte Unsigned 8-bit 1 byte
sbyte Signed 8-bit 1 byte
short Signed 16-bit 2 bytes
ushort Unsigned 16-bit 2 bytes
int Signed 32-bit 4 bytes
uint Unsigned 32-bit 4 bytes
long Signed 64-bit 8 bytes
ulong Unsigned 64-bit 8 bytes
float 32-bit floating point 4 bytes
double 64-bit floating point 8 bytes

Endianness

Multi-byte types require an endianness specifier:

  • le - Little-endian (Intel, x86/x64)
  • be - Big-endian (network byte order, MIPS, PowerPC)
INTERPRET binary NetworkPacket {
    SourcePort: ushort be,      -- Network byte order
    DestPort:   ushort be,
    Length:     int le          -- Local byte order
}

Byte Arrays

Fixed-size byte arrays:

INTERPRET binary ImageHeader {
    Magic:    byte[4],          -- Fixed 4-byte signature
    Reserved: byte[16],         -- 16 reserved bytes
    Payload:  byte[256]         -- 256-byte payload
}

Text Schema Syntax

Text schemas define how to parse character sequences:

INTERPRET text LogEntry {
    Timestamp: between '[' ']',
    _:         literal ' ',
    Level:     until ':',
    _:         literal ': ',
    Message:   rest
}

Field Types

Type Description Example
literal Match exact string literal ': '
until Capture until delimiter until ','
between Capture between delimiters between '[' ']'
chars[n] Capture exactly n characters chars[10]
token Whitespace-delimited token token
rest Remaining content rest
whitespace Skip whitespace whitespace
pattern Regex pattern match pattern '\d+'

Modifiers

Text fields support optional modifiers:

INTERPRET text Record {
    Name:   chars[20] trim,     -- Trim whitespace
    Value:  until ',' lower,    -- Convert to lowercase
    Code:   chars[5] upper      -- Convert to uppercase
}

Available modifiers:

  • trim - Trim both ends
  • ltrim - Trim left
  • rtrim - Trim right
  • lower - Convert to lowercase
  • upper - Convert to uppercase
  • nested - Handle nested delimiters (for between)
  • optional - Field is optional

Using Interpretation Schemas

The Interpret() Function

Apply a schema to data using Interpret():

INTERPRET binary BmpHeader {
    Signature: byte[2],
    FileSize:  int le,
    Reserved:  int le,
    DataOffset: int le
}

SELECT 
    h.FileSize,
    h.DataOffset
FROM #os.file('image.bmp') f
CROSS APPLY Interpret(f.GetBytes(), BmpHeader) h

The Parse() Function

For text schemas, use Parse():

INTERPRET text CsvRow {
    Name:    until ',',
    _:       literal ',',
    Age:     until ',',
    _:       literal ',',
    Email:   rest
}

SELECT 
    r.Name,
    r.Age,
    r.Email
FROM #csv.lines('data.csv') l
CROSS APPLY Parse(l.Line, CsvRow) r

InterpretAt() for Offset Access

Read at a specific offset:

SELECT 
    h.Magic
FROM #os.file('data.bin') f
CROSS APPLY InterpretAt(f.GetBytes(), 100, Header) h

Schema Composition

Schemas can reference other schemas:

INTERPRET binary Point {
    X: int le,
    Y: int le
}

INTERPRET binary Rectangle {
    TopLeft:     Point,
    BottomRight: Point,
    Color:       int le
}

Error Handling

Interpretation functions throw exceptions on parse errors:

  • Use TryInterpret() or TryParse() for non-throwing variants (returns null on failure)
  • Error codes follow the ISExxxx format (Interpretation Schema Error)

Common error codes:

  • ISE001 - Insufficient data
  • ISE002 - Validation failed
  • ISE003 - Pattern mismatch
  • ISE004 - Literal not found
  • ISE005 - Delimiter not found

Practical Examples

Reading a ZIP File Header

INTERPRET binary ZipLocalHeader {
    Signature:         int le,
    VersionNeeded:     short le,
    Flags:             short le,
    CompressionMethod: short le,
    LastModTime:       short le,
    LastModDate:       short le,
    Crc32:             int le,
    CompressedSize:    int le,
    UncompressedSize:  int le,
    FileNameLength:    short le,
    ExtraFieldLength:  short le
}

SELECT 
    z.CompressedSize,
    z.UncompressedSize,
    z.CompressionMethod
FROM #os.file('archive.zip') f
CROSS APPLY Interpret(f.GetBytes(), ZipLocalHeader) z
WHERE z.Signature = 0x04034B50  -- PK\x03\x04

Parsing Apache Log Format

INTERPRET text ApacheLogEntry {
    IpAddress:  until ' ',
    _:          literal ' - - [',
    Timestamp:  until ']',
    _:          literal '] "',
    Method:     until ' ',
    _:          literal ' ',
    Path:       until ' ',
    _:          until '" ',
    StatusCode: until ' ',
    _:          literal ' ',
    Size:       rest
}

SELECT 
    a.IpAddress,
    a.Timestamp,
    a.Method,
    a.Path,
    a.StatusCode
FROM #text.lines('/var/log/apache/access.log') l
CROSS APPLY Parse(l.Line, ApacheLogEntry) a
WHERE a.StatusCode = '404'

Discard Fields

Use _ for fields you want to skip but not capture:

INTERPRET text KeyValue {
    Key:   until ':',
    _:     literal ': ',    -- Match but don't capture
    Value: rest
}

Multiple discard fields are allowed - they’re matched in order but not included in the result.


This site uses Just the Docs, a documentation theme for Jekyll.