Data Validation Error Format

Jakob Voß

1 Introduction

Data validation is a crucial part of management of data qualitiy and interoperability. Validation is applied in many ways and contexts, for instance input forms and editors with visual feedback or schema languages with formal error reports. The diversity of use cases imply a variety of error results. No common standard exist to express error reports.¹

The specification of Data Validation Error Format has two goals:

unify how validation errors are reported by different validators
address positions of errors in validated documents

Last but not least the format should help to better separate validation and presentation of validation results, so both can be solved by different applications.

Warning

The format is strictly limited to errors and error positions. Neither does it include other kinds of analysis results such as statistics and summaries of documents, nor does in include details about validation such as test cases, schema rules, and individual constraints. Errors can be linked to additional information with error types but the semantics of these types is out of the scope of this specification.

1.1 Overview

Figure 1 illustrates the validation process with core concepts used in this specification: a validator checks whether a document conforms to some requirements and returns a list of errors in return. Each error can refer to its locations in the document via positions.

graph LR
   document --- validator --> errors
   errors -. positions .-> document
   validator(validator)

Figure 1: Validation process

Every document conforms to a document model. For instance JSON documents conforms to the JSON model, and character strings conforms to the model “sequence of characters from a known character set”. Document models come with encodings how to express documents in form of documents on a lower level in form. For instance JSON documents can be encoded with JSON syntax as Unicode strings and Unicode strings can be encoded with UTF-8 as sequences of bytes (solid arrows in Figure 2).

Note

Eventually all documents are given as digital objects, encoded as sequence of bytes. Encodings using a sequence of characters are also called textual data formats, in contrast to binary data formats.

Error positions are given in form of locators, each expressed in a locator format. Locator formats refer to sets of document models: for instance JSON Pointer refers to JSON, character and line numbers refer to character strings with defined line breaks, and offsets refer to sequences of elements (Figure 2).

graph LR
   JSON -- JSON syntax --> Unicode
   Unicode -- UTF-8   --> Bytes
   Unicode[Unicode string]

   jsonpointer(JSON Pointer)
   char(character number)
   line(line number)
   offset

   style jsonpointer fill:#fff,stroke:#fff
   style char fill:#fff,stroke:#fff
   style line fill:#fff,stroke:#fff
   style offset fill:#fff,stroke:#fff

   jsonpointer -.-> JSON   
   char -.-> Unicode
   line -.-> Unicode
   offset -.-> Bytes

Figure 2: Example of encodings and locator formats

1.2 Example

Documents can be invalid compared to document models on many levels. For example the string {"åå":5} is valid JSON but it might be invalid if element åå is expected to hold a string instead of a number (Example 1). The error can be located with JSON Pointer in the JSON document and with line and character number of its encoding as Unicode string:

{
  "message": "Expected string, got number at element /åå",
  "position": { "jsonpointer": "/åå", "line": "1", "char": "7" }
}

Example 1: Error in a JSON document

The document could also be invalid at JSON syntax level, for example of the closing } is missing (Example 2). The error is also located by its byte offset:

{
  "message": "Unexpected end of JSON input at character 8",
  "position": { "line": "1", "char": "8", "offset": "10" }
}

Example 2: Error in JSON syntax

A similar document could be invalid on byte level. The following table illustrates the document from Example 1 with ninth byte replaced by a value not allowed in UTF-8. It is common practice to replace such bytes with the Unicode replacement character U+FFFD but the resulting Unicode string is invalid JSON syntax still (Example 3). The example also illustrates another locator format linecol to give a character position by line and column.

Byte	`7b`	`22`	`c3`	`a5`	`c3`	`a5`	`22`	`3a`	`c0`	`7d`
Code point	`U+007B`	`U+0022`	`U+00E5`		`U+00E5`		`U+007B`	`U+0022`	`ERROR⇒` `U+FFFD`	`U+0022`
Character	`{`	`"`	`å`		`å`		`"`	`:`	`�`	`}`

[
  {
    "level": "warning",
    "message": "Ill-formed UTF-8 byte sequence at offset 8",
    "position": { "line": "1", "char": "", "offset": "8" }
  },
  {
    "level": "error",
    "message": "Expected JSON value at line 1, column 7",
    "position": { "line": "1", "char": "7", "linecol": "1:7", "offset": "8" }
  }
]

Example 3: Invalid JSON on multiple levels

1.3 Conformance requirements

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in BCP 14 (RFC 2119 and RFC 8174) when, and only when, they appear in all capitals, as shown here.

Only section Section 2 to Section 4, excluding examples and notes, and the list of normative references are normative parts of this specification.

Specific support of Data Validation Error Format by an application depends on two options. Both MUST be documented by applications:

Support either the full format or only support condense form of positions being locator maps
The set of supported locator formats

2 Errors

An Error is a JSON object with:

optional field message with an error message, being a non-empty string
optional field types with an array of error types, each being a non-empty string. Error types can be used for grouping errors and they SHOULD be URIs. Repetitions of identical strings in the same array SHOULD be ignored.
optional field level with an error level, being one of the strings error or warning.
optional field position with positions

An error is also called warning if field level has value warning.

Applications SHOULD add a default value for errors without field message.

Note

Language and localization of error messages is out of the scope of this specification.

3 Positions

An error can have one or more positions. Positions are

either a JSON array of locators (detailled form),
or a locator map (condense form).

Every locator map can be transformed to an equivalent array of locators. The reverse transformation is not always possible.

Locators of the same positions SHOULD have an non-empty intersection.

3.1 Locators

A Locator is a JSON object with

mandatory field format with the name of a locator format, being a non-empty string
mandatory field value with the locator value, being a string
optional field errors with an array of errors within the located document fragment.

Locator format and locator name MAY also be referred to as “dimension” and “address” of a locator.

{ "format": "line", "value": "7" }

Example 4: A simple locator indicating the position line (locator format) 7 (locator value)

Errors in the errors field are called nested errors. Nested errors allow to reference locations within nested documents (Example 5 and Example 6):

{
  "message": "Invalid value in line 2 in file example.txt in file archive.zip",
  "position": [ {
    "format": "file",
    "value": "archive.zip",
    "errors": [ {
      "message": "Invalid value in line 2 in file example.txt",
      "position": [ {
        "format": "file",
        "value": "example.txt",
        "errors": [ { 
          "message": "Invalid value in line 2",
          "position": { "line": "2" }
        } ]
      } ]
    } ]
  } ]
}

Example 5: An error in line 2 of file example.txt in archive archive.zip

{
  "message": "Invalid character in line 7, column 3",
  "position": [ {
    "format": "linecol",
    "value": "7:3"
  }, {
    "format": "line",
    "value": "7",
    "errors": [ {
      "message": "Invalid character 3",
      "position": { "char": "3" }
    } ]
  } ]
}

Example 6: An error with position given in two forms, one with a nested error

3.2 Locator maps

A locator map is a JSON object that maps locator format names to locator values.

{ "line": "7", "char": "42" }

Example 7: A simple locator map indicating the position line 7, character 42

A locator map is equivalent to an array of locators with key and value of the JSON object entries mapped to field format and value of each locator. An array of locators can be reduced to a locator map by dropping all nested errors and selecting only the first locator of each locator format.

Applications MAY restrict their support of Data Validation Error Format to positions with locator maps. In this case nested positions and multiple positions of same locator format are not supported.

4 Locator formats

A locator format is a formal language of Unicode strings to locate positions in a document. The sets of strings of the language are called locator values of the locator format.

Each locator format has a unique locator format name. The name is a string that start with lowercase letter a to z, optionally followed by a sequence of lowercase letters, digits 0 to 9 and/or -.

Each locator format can encode positions of documents that conform to a set of matching document models.

The set of normative locator formats has not been finally specified yet. The final version of this specification may need to define a registry of locator formats. The following locator formats will likely be included:

name	locator format	document models
`offset`	offset number	sequence of elements
`char`	character position	sequence of characters or code points
`cell`	cell reference	tabular data models
`file`	file path	directory tree
`line`	line number (first: 1)	sequence of lines
`linecol`	line number and column	sequence of characters with line breaks
`jsonpointer`	JSON Pointer	JSON
`xpath`	XPath (or a subset)	XML
`fq`	format and path	all binary formats supported by fq (see Example 8)

The locator formats require some more detailled specification. For instance line number depend on a common definition of line breaks, some formats include U+0B, U+0C, U+85, U+2028, U+2029…

{
  "message": "Timestamp must not be in the future!",
  "position": {
    "fq": "gzip:.members[0].mtime"
  }
}

Example 8: Error using fq to locate the internal timestamp of a file in a .gz archive

Note

More candidates of locator formats to be specified:

RFC 5147 for ranges of lines and characters
RFC 7111 for tabular data
IIIF (section in an image)
RDF graphs (every subset of an RDF graph is another RDF graph)
Subsets of query languages (SQL, SPARQL…)
PDF highlighted text annotations
id for data models that refer to elements with an identifier
PICA Path
MARCspec
…

Offset number

The offset number locator format with name number is used to reference an element in a sequence of elements. The locator value is non-negative integer encoded as string without leading zeroes. The first element has number zero (locator value 0).

Character position

The character position locator format with name char is used to reference a character in a sequence of characters from a character set. The locator value is a positive integer encoded as string without leading zeroes. The first character has number one (locator value 1).

In Unicode strings, this locator format refers to code points instead of visual characters.

Cell reference

The cell reference locator format with name cell is used to reference a cell or a range of cells in a table as known from spreadsheet software. The locator value consists of a pair of column and row, optionally followed by colon (:) and another pair of column and row. Columns are given in hexavigesimal system (A=1, B=2…, Z=26, AA=27, AB=28…) and rows are given by numbers, starting from 1.

File path

The file path locator format with name file is used to reference a file or directory in a directory tree. The locator value must be a POSIX path, being a string optionally beginning with a slash (/), followed by zero or more file names, separated by slash. A file name is a non-empty sequence of Unicode code points excluding the slash (U+002F) and the null byte (U+0000).

Note

Depending in the document model, file names may be defined as binary string instead of Unicode strings. In most cases UTF-8 encoding can be assumed to map Unicode code points to bytes but (TODO: this requires more careful examination).

5 References

5.1 Normative References

Berners-Lee, T. and Fielding, R. and Masinter, L.: Uniform Resource Identifier (URI): Generic Syntax. RFC 3986, January 2005, http://www.rfc-editor.org/info/rfc3986.
Bradner, S.: Key words for use in RFCs to Indicate Requirement Levels. BCP 14, RFC 2119, March 1997, http://www.rfc-editor.org/info/rfc2119.
Bray, T.: The JavaScript Object Notation (JSON) Data Interchange Format. RFC 8259, December 2017. https://tools.ietf.org/html/rfc8259
Bryan, P and Zyp, K. and Nottigham, M.: JavaScript Object Notation (JSON) Pointer. RFC 6901, April 2023. https://tools.ietf.org/html/rfc6901
Leiba, B.: Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words. BCP 14, RFC 8174, May 2017, http://www.rfc-editor.org/info/rfc8174.

5.2 Informative references

JSON Schema schema language

Appendices

The following information is non-normative.

JSON Schemas

Error records can be validated with the non-normative JSON Schema schema.json in the specification repository. Rules not covered by the JSON Schema include:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "$defs": {
    "message": {
      "type": "string",
      "minLength": 1,
      "description": "error message"
    },
    "types": {
      "type": "array",
      "items": {
        "type": "string",
        "minLength": 1,
        "description": "identifier of an error type"
      }
    },
    "level": {
      "type": "string",
      "enum": ["error", "warning"],
      "default": "error",
      "description": "error level ('error' or 'warning')"
    },
    "locator": {
      "type": "object",
      "properties": {
        "format": {
          "type": "string",
          "pattern": "^[a-z][a-z0-9-]*$"
        },
        "value": {
          "type": "string"
        },
        "position": { "$ref": "#/$defs/position" },
        "message": { "$ref": "#/$defs/message" },
        "types": { "$ref": "#/$defs/types" },
        "level": { "$ref": "#/$defs/level" }
      },
      "required": ["format", "locator"]
    },
    "position": {
      "description": "positions",
      "anyOf": [
        {
          "type": "array",
          "items": { "$ref": "#/$defs/locator" }
        },
        {
          "type": "object",
          "patternProperties": {
            "^[a-z0-9-]+$": {
              "type": "string"
            }
          },
          "additionalProperties": false
        }
      ]
    }
  },
  "properties": {
    "message": { "$ref": "#/$defs/message" },
    "types": { "$ref": "#/$defs/types" },
    "level": { "$ref": "#/$defs/level" },
    "position": { "$ref": "#/$defs/position" }
  },
  "required": ["message"]
}

Changes

This document is managed in a revision control system at https://github.com/gbv/validation-error-format, including an issue tracker.

Version 0.1.0

Work in progress.

Footnotes

A notable exception are formats from software development used in unit testing such as JUnit XML and Test Anything Protocol.↩︎