Data Validation Error Format

Version 0.1.0

Author
Affiliation

Jakob Voß

Verbundzentrale des GBV (VZG)

Published

July 10, 2025

Abstract

This document specifies a data format to report validation errors of digital objects.

1 Introduction

Data validation is a crucial part of management of data qualitiy and interoperability. Validation is applied in many ways and contexts, for instance input forms and editors with visual feedback or schema languages with formal error reports. The diversity of use cases imply a variety of error results. No common standard exist to express error reports.1

The specification of Data Validation Error Format has two goals:

  • unify how validation errors are reported by different validators
  • address positions of errors in validated documents

Last but not least the format should help to better separate validation and presentation of validation results, so both can be solved by different applications.

Caution

The format is strictly limited to errors and error positions. Neither does it include other kinds of analysis results such as statistics and summaries of documents, nor does in include details about validation such as test cases, schema rules, and individual constraints. Errors can be linked to additional information with error types but the semantics of these types is out of the scope of this specification.

1.1 Overview

Figure 1 illustrates the validation process with core concepts used in this specification: a validator checks whether a document conforms to some requirements and returns a list of errors in return. Each error can refer to its location in the document via a position.

graph LR
   document --- validator --> errors
   errors -. positions .-> document
   validator(validator)

Figure 1: Validation process

Every document conforms to a document model. For instance JSON documents conforms to the JSON model, and character strings conforms to the model “sequence of characters from a known character set”. Document models come with encodings how to express documents in form of documents on a lower level. For instance JSON documents can be encoded with JSON syntax as Unicode strings and Unicode strings can be encoded with UTF-8 as sequences of bytes (solid arrows in Figure 2).

Note

Eventually all documents are given as digital objects, encoded as sequence of bytes. Encodings using a sequence of characters are also called textual data formats, in contrast to binary data formats.

An error position is given in form of one or more locators, each having a dimension and an address. Each dimension refers to a locator format for a set of document models. For instance JSON Pointer refers to JSON, character and line numbers refer to character strings with defined line breaks, and offsets refer to sequences of elements (Figure 2). Other examples of locator formats include XPath for XML, and row/column for tabular data.

Locators can also contain nested errors to address a more specific position within another position and to support error positions in nested documents such as archive files.

graph LR
   JSON -- JSON syntax --> Unicode
   Unicode -- UTF-8   --> Bytes
   Unicode[Unicode string]

   jsonpointer(JSON Pointer)
   char(character number)
   line(line number)
   offset

   style jsonpointer fill:#fff,stroke:#fff
   style char fill:#fff,stroke:#fff
   style line fill:#fff,stroke:#fff
   style offset fill:#fff,stroke:#fff

   jsonpointer -.-> JSON   
   char -.-> Unicode
   line -.-> Unicode
   offset -.-> Bytes

Figure 2: Example of encodings and locator formats

1.2 Examples

Documents can be invalid on many levels. For example the string {"åå":5} is valid JSON but it might be invalid if element åå is expected to hold a string instead of a number (Example 1). The error can be located with JSON Pointer in the JSON document and with character and line number:

{
  "message": "Expected string, got number at element /åå",
  "position": { "jsonpointer": "/åå", "char": "7", "line": 1 }
}
Example 1: Error in a JSON document

The string could also be part of a larger, newline-delimited JSON document. In this case it makes sense to use a nested error (Example 2):

{
  "message": "Invalid document at line 7",
  "position": [ {
    "dimension": "line",
    "address": "7",
    "errors": [ {
      "message": "Expected string, got number at element /åå",
      "position": {
        "jsonpointer": "/åå", "char": "7", "line": 1
      }
    } ]
  } ]
}
Example 2: Error in a newline-delimited JSON document

The document could also be invalid at JSON syntax level, for example if the closing } is missing (Example 3):

{
  "message": "Unexpected end of JSON input at character 8",
  "position": { "line": "1", "char": "8"  }
}
Example 3: Error in JSON syntax

A similar document could be invalid on byte level. The following table illustrates the document from Example 1 with ninth byte replaced by a value not allowed in UTF-8. It is common practice to replace such bytes with the Unicode replacement character U+FFFD but the resulting Unicode string is invalid JSON syntax still (Example 4). The example also illustrates another locator format linecol to give a character position by line and column.

Byte 7b 22 c3 a5 c3 a5 22 3a c0 7d
Code point U+007B U+0022 U+00E5 U+00E5 U+007B U+0022 ERROR⇒ U+FFFD U+0022
Character { " å å " : }
[
  {
    "level": "warning",
    "message": "Ill-formed UTF-8 byte sequence at offset 8",
    "position": { "line": "1", "char": "7", "offset": "8" }
  },
  {
    "level": "error",
    "message": "Expected JSON value at line 1, column 7",
    "position": { "line": "1", "char": "7", "linecol": "1:7" }
  }
]
Example 4: Invalid JSON on multiple levels

1.3 Conformance requirements

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in BCP 14 (RFC 2119 and RFC 8174) when, and only when, they appear in all capitals, as shown here.

Only section Section 2 to Section 4, excluding examples and notes, and the list of normative references are normative parts of this specification.

Specific support of Data Validation Error Format by an application depends on two options. Both MUST be documented by applications:

  1. Support of either the full format or only positions in condense form being locator maps
  2. The set of supported dimensions

2 Errors

An Error is a JSON object with:

  • mandatory field message with an error message, being a non-empty string. Applications MAY use a default value for error messages.

  • optional field types with an array of error types, each being a non-empty string. Error types can be used for grouping errors and they SHOULD be URIs. Repetitions of identical strings in the same array MUST be ignored.

  • optional field level with an error level, being one of the strings error, warning, or info. Application MUST NOT differentiate between error level error and no error level.

  • optional field position with a position. Applications MUST NOT differentiate between empty position and no position.

Note

Language and localization of error messages is out of the scope of this specification.

3 Positions

An error can have a position. A position is given

Every locator map can be transformed to an equivalent array of locators. The reverse transformation is only possible if there is at most one locator per dimension and no locator has nested errors.

Note

Locators of the same positions should refer to roughly the “same” part of a document or at least have a common intersection. This requirement is difficult to formalize because locators refer to different document models, so it is no normative part of this specification yet.

3.1 Locators

A Locator is a JSON object with

  • mandatory field dimension with the name of a dimension

  • mandatory field address with the address, being a string conforming to the locator format identified by the name of the dimension.

  • optional field errors with an array of nested errors within the located part of a document.

{ "dimension": "line", "address": "7" }
Example 5: A simple locator

Nested errors allow to reference locations within nested documents (Example 6 and Example 7):

{
  "message": "Invalid value in line 2 in file example.txt in file archive.zip",
  "position": [ {
    "dimension": "file",
    "address": "archive.zip",
    "errors": [ {
      "message": "Invalid value in line 2 in file example.txt",
      "position": [ {
        "dimension": "file",
        "address": "example.txt",
        "errors": [ { 
          "message": "Invalid value in line 2",
          "position": { "line": "2" }
        } ]
      } ]
    } ]
  } ]
}
Example 6: An error in line 2 of file example.txt in archive archive.zip
{
  "message": "Invalid character in line 7, column 3",
  "position": [ {
    "dimension": "linecol",
    "address": "7:3"
  }, {
    "dimension": "line",
    "address": "7",
    "errors": [ {
      "message": "Invalid character 3",
      "position": { "char": "3" }
    } ]
  } ]
}
Example 7: An error with position given in two forms, one with a nested error

3.2 Locator maps

A locator map is a JSON object that maps names of dimensions to addresses.

{ "line": "7", "char": "42" }
Example 8: A simple locator map indicating the position line 7, character 42

A locator map is equivalent to an array of locators with key and value of the JSON object entries mapped to field dimension and address of each locator. An array of locators can be reduced to a locator map by dropping all nested errors and selecting only the first locator of each locator format.

Applications MAY restrict their support of Data Validation Error Format to positions with locator maps. In this case nested errors and positions with multiple locators per dimension are not supported.

4 Dimensions

A dimension is a defined method to address parts of a document. Each dimension has:

  • a unique name, being a string that start with lowercase letter a to z, optionally followed by a sequence of lowercase letters, digits 0 to 9 and/or -.

  • a locator format, being a formal language of Unicode strings to encode references to parts of a document. The sets of strings of the language are called addresses.

  • a document model matching the locator format.

Applications SHOULD support the following dimensions:

name locator format document model
offset offset number sequence of elements
char character number sequence of characters or code points
cell cell reference tabular data
file file path directory tree
line line number sequence of lines
linecol line and column sequence of characters with line breaks
jsonpointer JSON Pointer JSON
xpath XML Path Expression XML or compatible hierarchies

See ?@sec-additional-dimensions for more dimensions.

Sequential document models

Offset number

The offset number locator format with name number is used to reference an element in a sequence of elements. The locator value is non-negative integer encoded as string without leading zeroes. The first element has number zero (locator value 0).

Character number

The character number locator format with name char is used to reference a character in a sequence of characters from a character set. The locator value is a positive integer encoded as string without leading zeroes. The first character has number one (locator value 1).

In Unicode strings, this locator format refers to code points instead of visual characters.

Line number

Warning

Possibly requires some more detailled specification. For instance line number depend on a common definition of line breaks, some formats include U+0B, U+0C, U+85, U+2028, U+2029…

Line and Column

Line number and [character position] within the line, separated by colon :

Tabular document models

Cell reference

The cell reference locator format with name cell is used to reference a cell or a range of cells in a table as known from spreadsheet software. The locator value consists of a pair of column and row, optionally followed by colon (:) and another pair of column and row. Columns are given in hexavigesimal system (A=1, B=2…, Z=26, AA=27, AB=28…) and rows are given by numbers, starting from 1.

Hierarchical document models

File path

The file path locator format with name file is used to reference a file or directory in a directory tree. The locator value must be a POSIX path, being a string optionally beginning with a slash (/), followed by zero or more file names, separated by slash. A file name is a non-empty sequence of Unicode code points excluding the slash (U+002F) and the null byte (U+0000).

Note

Depending in the document model, file names may be defined as binary string instead of Unicode strings. In most cases UTF-8 encoding can be assumed to map Unicode code points to bytes but (TODO: this requires more careful examination).

JSON Pointer

See https://datatracker.ietf.org/doc/html/rfc6901

XML Path Expression

TODO: Subset of XPath, see https://www.w3.org/TR/xpath20/#id-path-expressions (must start with /, no filter expressions, no reverse steps, no predicates except numbers.

Graph document models

5 References

5.1 Normative References

5.2 Informative references

Appendices

JSON Schemas

Error records can be validated with the non-normative JSON Schema schema.json in the specification repository. Rules not covered by the JSON Schema include:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "$defs": {
    "message": {
      "type": "string",
      "minLength": 1,
      "description": "error message"
    },
    "types": {
      "type": "array",
      "items": {
        "type": "string",
        "minLength": 1,
        "description": "identifier of an error type"
      }
    },
    "level": {
      "type": "string",
      "enum": ["error", "warning", "info"],
      "default": "error",
      "description": "error level"
    },
    "locator": {
      "type": "object",
      "properties": {
        "format": {
          "type": "string",
          "pattern": "^[a-z][a-z0-9-]*$"
        },
        "value": {
          "type": "string"
        },
        "position": { "$ref": "#/$defs/position" },
        "message": { "$ref": "#/$defs/message" },
        "types": { "$ref": "#/$defs/types" },
        "level": { "$ref": "#/$defs/level" }
      },
      "required": ["format", "locator"]
    },
    "position": {
      "description": "error position",
      "anyOf": [
        {
          "type": "array",
          "items": { "$ref": "#/$defs/locator" }
        },
        {
          "type": "object",
          "patternProperties": {
            "^[a-z][a-z0-9-]*$": {
              "type": "string"
            }
          },
          "additionalProperties": false
        }
      ]
    }
  },
  "properties": {
    "message": { "$ref": "#/$defs/message" },
    "types": { "$ref": "#/$defs/types" },
    "level": { "$ref": "#/$defs/level" },
    "position": { "$ref": "#/$defs/position" }
  },
  "required": ["message"]
}

Additional dimensions

The following dimensions are not normative part of the specification because they have not fully been specified yet:

name locator format document models
fq format and path all binary formats supported by fq (see Example 9)
rfc5147 RFC 5147 characters and lines
rfc7111 RFC 7111 tabular date
id Unicode string data models that refer to elements with an identifier

rfc5147, in contrast to char and line, also supports ranges. rfc7111, in contrast to cell, also supports ranges and multi-selection.

{
  "message": "Timestamp must not be in the future!",
  "position": {
    "fq": "gzip:.members[0].mtime"
  }
}
Example 9: Error using fq to locate the internal timestamp of a file in a .gz archive

The following locator formats or standards are yet to be evaluated for its use as dimension:

Changes

This document is managed in a revision control system at https://github.com/gbv/validation-error-format, including an issue tracker.

  • Version 0.1.0

    Work in progress.

Footnotes

  1. A notable exception are formats from software development used in unit testing such as JUnit XML and Test Anything Protocol.↩︎