Data Validation Error Format

Jakob Voß

Abstract

This document specifies a data format to report validation errors of digital objects with error positions independent from specifid document models.

1 Introduction

All data is wrong, but some data is wrong on multiple levels.

Data validation is a crucial part of management of data quality and interoperability. Validation is applied in many ways and contexts, for instance input forms and editors with visual feedback or schema languages with formal error reports. The diversity of use cases imply a variety of error results. Existing standards for error reporting such as such as JUnit XML and Test Anything Protocol have narrow use cases in software development.

The specification of Data Validation Error Format has two goals:

unify how validation errors are reported by different applications
reference positions of errors in validated documents, independent from document models

Last but not least the format should help to better separate validation and presentation of validation results, so both can be solved by different applications.

The format is strictly limited to errors and error positions. Neither does it include other kinds of analysis results such as statistics and summaries of documents, nor does in include details about validation such as test cases, schema rules, and individual constraints. Errors can be linked to additional information with error types but the semantics of these types is out of the scope of this specification.

This document is managed in a revision control system at https://github.com/gbv/validation-error-format, including an issue tracker.

1.1 Overview

Figure 1 illustrates the validation process with core concepts used in this specification: a validator checks whether a document conforms to some requirements and returns a list of errors in return. Each error can refer to its location in the document via a position.

graph LR
   document --- validator --> errors
   errors -. positions .-> document
   validator(validator)

Figure 1: Validation process

Every document conforms to a document model. For instance JSON documents conforms to the JSON model, and character strings conforms to the model “sequence of characters from a known character set”. Document models come with encodings how to express documents in form of documents on a lower level. For instance JSON documents can be encoded with JSON syntax as Unicode strings and Unicode strings can be encoded with UTF-8 as sequences of bytes (solid arrows in Figure 2).

Eventually all documents are given as digital objects, encoded as sequence of bytes. Encodings using a sequence of characters are also called textual data formats, in contrast to binary data formats.

An error position is given in form of one or more locators, each having a dimension and an address. Each dimension refers to a locator format for a set of document models. For instance JSON Pointer refers to JSON, character and line numbers refer to character strings with defined line breaks, and offsets refer to sequences of elements (Figure 2). Other examples of locator formats include XPath for XML, and row/column for tabular data.

Locators can also contain nested errors to reference a more specific position within another position and to support error positions in nested documents such as archive files.

graph LR
   JSON -- JSON syntax --> Unicode
   Unicode -- UTF-8   --> Bytes
   Unicode[Unicode string]

   jsonpointer(JSON Pointer)
   char(character number)
   line(line number)
   offset

   style jsonpointer fill:#fff,stroke:#fff
   style char fill:#fff,stroke:#fff
   style line fill:#fff,stroke:#fff
   style offset fill:#fff,stroke:#fff

   jsonpointer -.-> JSON   
   char -.-> Unicode
   line -.-> Unicode
   offset -.-> Bytes

Figure 2: Example of encodings and locator formats

1.2 Examples

Documents can be invalid on many levels. For example the string {"åå":5} is valid JSON but it might be invalid if element åå is expected to hold a string instead of a number (Example 1). The error can be located with JSON Pointer in the JSON document and with character and line number:

{
  "message": "Expected string, got number at element /åå",
  "position": { "jsonpointer": "/åå", "char": "7", "line": "1" }
}

Example 1: Error in a JSON document

The string could also be part of a larger, newline-delimited JSON document. In this case it makes sense to use a nested error (Example 2):

{
  "message": "Invalid document at line 7",
  "position": [ {
    "dimension": "line",
    "address": "7",
    "errors": [ {
      "message": "Expected string, got number at element /åå",
      "position": {
        "jsonpointer": "/åå", "char": "7", "line": "1"
      }
    } ]
  } ]
}

Example 2: Error in a newline-delimited JSON document

The document could also be invalid at JSON syntax level, for example if the closing } is missing (Example 3):

{
  "message": "Unexpected end of JSON input at character 8",
  "position": { "line": "1", "char": "8"  }
}

Example 3: Error in JSON syntax

A similar document could be invalid on byte level. The following table illustrates the document from Example 1 with ninth byte replaced by a value not allowed in UTF-8. It is common practice to replace such bytes with the Unicode replacement character U+FFFD but the resulting Unicode string is invalid JSON syntax still (Example 4). The example also illustrates another locator format linecol to give a character position by line and column.

Byte	`7b`	`22`	`c3`	`a5`	`c3`	`a5`	`22`	`3a`	`c0`	`7d`
Code point	`U+007B`	`U+0022`	`U+00E5`		`U+00E5`		`U+007B`	`U+0022`	`ERROR⇒` `U+FFFD`	`U+0022`
Character	`{`	`"`	`å`		`å`		`"`	`:`	`�`	`}`

[
  {
    "level": "warning",
    "message": "Ill-formed UTF-8 byte sequence at offset 8",
    "position": { "line": "1", "char": "7", "offset": "8" }
  },
  {
    "level": "error",
    "message": "Expected JSON value at line 1, column 7",
    "position": { "line": "1", "char": "7", "linecol": "1:7" }
  }
]

Example 4: Invalid JSON on multiple levels

1.3 Conformance requirements

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in BCP 14 (RFC 2119 and RFC 8174) when, and only when, they appear in all capitals, as shown here.

Formal grammars in this specification are given in EBNF notation as defined in Extensible Markup Language (XML) 1.1 section 6.

Only section 2 to 5, excluding examples and notes, and the list of normative references are normative parts of this specification.

2 Errors

An error is a JSON object with the following constraints:

an error SHOULD have a field message with an error message, being a non-empty string. Applications MAY use a default value for error messages. Language and localization of error messages is out of the scope of this specification.
an error MAY have field types with an array of error types, each being a non-empty string. Error types can be used for grouping errors and to reference a cause or constraint being violated by the error. Error types SHOULD be URIs (RFC 3986) local identifiers with same syntax as the name of a dimension.
an error MAY have field level with an error level, being one of the strings error, warning, or info. Application MUST use default value error if this field is not given.
an error MAY have field position with a position. Applications MUST NOT differentiate between no position and an empty position (an empty array or an empty JSON object).

Applications MUST use individual errors for individual positions of the kind of observation represented by the error. For instance a malformed character ocurring two times in a document results in two errors.

By this definition the error {} is allowed and equivalent to {"level":"error"}.

A nested error is an error that is listed as part of a [locator] in its field errors or reports.

3 Positions

The position of an error is given

either in condense form with a locator map,
or in full form as JSON array of locators.

Every locator map can be transformed to an equivalent array of locators. The reverse transformation is only possible if no locator has nested errors and there is not more then one locator per dimension.

A position with multiple locator of the same dimension does nor imply multiple errors but it references multiple elements involved in the same error (for instance a mismatch between two elements). Locators of different dimensions in the same position SHOULD refer to the the same elements or have a common intersection.

3.1 Locator maps

A locator map is a JSON object that maps names of dimensions to addresses.

{ "line": "7", "char": "42" }

Example 5: A simple locator map indicating the position line 7, character 42

A locator map can be transformed to an equivalent array of locators with key and value of the JSON object entries mapped to field dimension and address of each locator.

[
  { "dimension": "line", "address": "7" },
  { "dimension": "char", "address": "42" }
]

Example 6: Equivalent array of locators

Applications MAY restrict their support of Data Validation Error Format to positions in condense form being locator maps.

3.2 Locators

A locator references an element of a document. A Locator is a JSON object with the following constraints:

the locator MUST have a field dimension with the name of a dimension. Some dimensions imply a document model on elements referenced by locators of this dimension.
the locator MUST have a field address with the address, being a string conforming to the locator format identified by the name of the dimension.
the locator MAY have a field value with the referenced element encoded in some reasonable form (typically as JSON string). The value MUST be derived from the document, dimension, and address. Applications MAY replace the field with another value derived from document, dimension, and address.
the locator MAY have either a field errors with an array of errors within the located element or a field reports with an array of reports for the located element. Errors in field errors of a locator or as part of a report are called nested errors.

{ "dimension": "line", "address": "7" }

Example 7: A simple locator

Nested errors allow to reference locations within elements of a document. Positions of nested errors MUST be relative to the element referenced by their parent locator (Example 8 and Example 9):

{
  "message": "Invalid value in line 2 in file example.txt in file archive.zip",
  "position": [ {
    "dimension": "file",
    "address": "archive.zip",
    "errors": [ {
      "message": "Invalid value in line 2 in file example.txt",
      "position": [ {
        "dimension": "file",
        "address": "example.txt",
        "errors": [ { 
          "message": "Invalid value in line 2",
          "position": { "line": "2" }
        } ]
      } ]
    } ]
  } ]
}

Example 8: An error in line 2 of file example.txt in archive archive.zip

{
  "message": "Invalid character in line 7, column 3",
  "position": [ {
    "dimension": "linecol",
    "address": "7:3"
  }, {
    "dimension": "line",
    "address": "7",
    "errors": [ {
      "message": "Invalid character 3",
      "position": { "char": "3" }
    } ]
  } ]
}

Example 9: An error with position given in two forms, one with a nested error

4 Reports

Reports summarize errors of same type with additional metadata. A Report is a JSON object with the following constraints:

the report MAY have field types with an array of error types
the report SHOULD have field errors with an array of zero or more errors, optionally followed by the value null to indicate an incomplete list.
the report SHOULD have field totalErrors with a non-negative integer number. The number MUST be equal to the length of array errors if the array does not contain value null and it MUST be equal or larger then then length of the array if it does contain null.
the report MAY have field compliances with an array of positions, optionally followed by the value null to indicate an incomplete list. Positions in field compliances MUST NOT contain nested errors.
the report MAY have field totalCompliances with a non-negative integer number. The number MUST be equal to the length of array compliances if the array does not contain value null and it MUST be equal or larger then then length of the array if it does contain null. Applications MUST compliances
the report MAY have field totalFindings with a non-negative integer number. The number MUST be equal to the sum of totalCompliances and totalErrors, if both are given.
the report SHOULD have field complete with a boolean value. The value MUST be false if any of the arrays errors and compliances contains null as last element.
the report MAY have field duration with a non-negative number giving the time in seconds it took to create the report.

The number of totalFindings MUST be equal to the sum of totalErrors and totalCompliances, if both are given

Applications MUST process errors listed in field errors as following:

every error type of the report is added to field types of the error unless it is already exist in the array

{
  "message": "File records.xml contains invalid records",
  "position": [ { 
    "dimension": "file",
    "address": "records.xml",
    "reports": [ {
      "types": [ "record-must-be-valid" ],
      "errors": [
        { "position": { "xpath": "/records/record[2]" } }
      ],
      "compliances": [
        { "xpath": "/records/record[1]" },
        { "xpath": "/records/record[3]" }
      ]
    } ]
  } ]
}

Example 10: An error with nested errors in a report

5 Dimensions

A dimension is a defined method to reference elements of a document. Each dimension has:

a unique name, being a string that start with lowercase letter a to z, optionally followed by a sequence of lowercase letters and digits 0 to 9.
a locator format, being a formal language of Unicode strings to encode references to elements of a document. The sets of strings of the language are called addresses.
a document model matching the locator format.

Some dimensions imply a document model on referenced elements (element model). For instance a line number references a character string and a JSON Pointer references a JSON value.

Applications SHOULD support the following dimensions. The appendix contains a non-normative note on additional dimensions not fully specified yet.

name	locator format	document model	element model
`id`	identifier	indexed set of elements	-
`offset`	offset number	sequence of elements	-
`char`	character number	character string	character
`line`	line number	sequence of character strings	character string
`linecol`	line and column	sequence of character strings	character
`cell`	cell reference	tabular data	-
`cells`	cell range	tabular data	tabular data
`rfc7111`	table selection	tabular data	tabular data
`file`	file path	directory tree	-
`jsonpointer`	JSON Pointer	JSON value	JSON value
`xpath`	XML Locator	XML or compatible hierarchies	XML element or attribute

The identifier locator format with name id and locator values being arbitrary Unicode strings subsumes every other locator format because locators of same value refererence the same element. It can be used for any kind of formalized reference to elements of a document, but its main use case are record identifiers, unique names and similar identifier systems.

Dimensions are a subset of query languages. A dimension value refererences one element from a document. A query language (e.g. JSONPath, full XPath…) can locate a set of elements.

5.1 Sequential document models

Offset number

The offset number locator format with name number is used to reference an element in a sequence of elements. The locator value is non-negative integer encoded as string without leading zeroes. The first element has number zero (locator value 0).

Character number

The character number locator format with name char is used to reference a character in a sequence of characters from a character set. The locator value is a positive integer encoded as string without leading zeroes. The first character has number one (locator value 1).

In Unicode strings, this locator format refers to code points instead of visual characters.

Line number

The line number locator format with name line is used to reference a line in a sequence of lines, each being a character string. The locator value is a positive integer encoded as string without leading zeroes. The first line has number one (locator value 1).

The document model of line number is not a character string with line breaks but a sequence of character strings. Splitting of character strings into lines is beyond the scope of this specification because multiple definitions of line break exist (U+0A optionally followed by U+0D, U+0D, U+0B, U+0C, U+85, U+2028, U+2029…).

Line and Column

The line and column locator format with name linecol is used to reference a character in a sequence of character strings. The locator value consists of a line number and a character number within the line, separated by colon (:).

5.2 Tabular document models

Tabular data is known from spreadsheet software and CSV files. This document model does not include table headers! A table with header column and unique column names can better be mapped to a hierarchical model or modelled as sequence of indexed sets. For instance an error in column title of the third row of a table (not counting the header row) could have the following locator:

{
  "dimension": "offset", "address": "3",
  "errors": [ { "position": [ 
    { "dimension": "id", "address": "title" } ] } ]
}

Cell reference

The cell reference locator format with name cell is used to reference an individual cell in tabular data. The locator value consists of a pair of column and row. Columns are given in hexavigesimal system (A=1, B=2…, Z=26, AA=27, AB=28…) and rows are given by numbers, starting from 1.

Cell range

The cell range locator format with name cells is used to reference a selection of connected cells in tabular data. The locator value consists of a cell reference, optionally followed by colon (:) and another cell reference.

Table selection

The table selection locator format with name rfc7111 is used to reference a selection of connected cells in tabular data. It can reference cell references, cell ranges, full rows and full columns. The locator format follows the following grammar:

TableSelection  ::=  Cells | Rows | Columns
Cells           ::=  "cell=" CellPosition ( "-" CellPosition )?
Rows            ::=  "row=" RowPosition ( "-" RowPosition )?
Columns         ::=  "col=" ColPosition ( "-" ColPosition )?
CellPosition    ::=  RowPosition "," ColPosition
ColPosition     ::=  PositiveInteger
RowPosition     ::=  PositiveInteger
PositiveInteger ::=  [1-9] [0-9]*

Tabular selection locator is a proper subset of RFC 7111 URI Fragment Identifier, excluding multi-selections, so every elememt referenced by a table selected is tabular data for its part.

5.3 Hierarchical document models

File path

The file path locator format with name file is used to reference a file or directory in a directory tree. The locator value must be a POSIX path, being a string optionally beginning with a slash (/), followed by zero or more file names, separated by slash. A file name is a non-empty sequence of Unicode code points excluding the slash (U+002F) and the null byte (U+0000).

JSON Pointer

The JSON Pointer locator format with name jsonpointer is used to reference a JSON value within a JSON value. The locator value and its semantics are defined in RFC 6901.

XML Locator

The XML Locator format follows the following grammar with rule QName defined in Namespaces in XML 1.0 specification:

XMLLocator       ::=  ( "/" NodeTest )+ ( "/" AttributeTest )?
NodeTest         ::=  QName Position?
AttributeTest    ::=  "@" QName
Position         ::=  "[ PositiveInteger "]"
PositiveInteger  ::=  [1-9] [0-9]*

Applications MUST NOT use one XML Locator to reference multiple XML elements. For this reason applications MAY always append the string [1] to an XML Locator if it does not end with a Position or AttributeTest.

XML Locator is a proper subset of (X)Path Expressions from XPath specifications, limited to reference individual XML elements or attributes.

6 References

6.1 Normative References

Berners-Lee, T. and Fielding, R. and Masinter, L.: Uniform Resource Identifier (URI): Generic Syntax. RFC 3986, January 2005, http://www.rfc-editor.org/info/rfc3986.
Bradner, S.: Key words for use in RFCs to Indicate Requirement Levels. BCP 14, RFC 2119, March 1997, http://www.rfc-editor.org/info/rfc2119.
Bray, T.: The JavaScript Object Notation (JSON) Data Interchange Format. RFC 8259, December 2017. https://tools.ietf.org/html/rfc8259
Bray, T. et al: Extensible Markup Language (XML) 1.1 (Second Edition). W3C Recommendation. August 2006. W3C Recommendation. https://www.w3.org/TR/xml11/
Bray, T. et. al.: Namespaces in XML 1.0 (Third Edition). W3C Recommendation, December 2009. https://www.w3.org/TR/REC-xml-names/
Bryan, P and Zyp, K. and Nottigham, M.: JavaScript Object Notation (JSON) Pointer. RFC 6901, April 2023. https://tools.ietf.org/html/rfc6901
Leiba, B.: Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words. BCP 14, RFC 8174, May 2017. http://www.rfc-editor.org/info/rfc8174.

6.2 Informative references

J. Clark and S. DeRose: XML Path Language (XPath) Version 1.0. W3C Recommendation, November 1999. https://www.w3.org/TR/xpath-10/
M. Hausenblas, E. Wilde, and J. Tennison: URI Fragment Identifiers for the text/csv Media Type. RFC 7111, January 2014. https://tools.ietf.org/html/rfc7111
A. Wright, H. Andrews, B. Hutton, and G. Dennis: JSON Schema Draft 2020-12. June 2022. https://json-schema.org/draft/2020-12/json-schema-core.html

Appendices

JSON Schemas

Error records can be validated with the following non-normative, non-extensive JSON Schema (schema.json in the specification repository):

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "message": {
      "description": "error message",
      "type": "string",
      "minLength": 1
    },
    "types": { "$ref": "#/$defs/types" },
    "level": { 
      "description": "error level",
      "type": "string",
      "enum": ["error", "warning", "info"],
      "default": "error"
    },
    "position": { "$ref": "#/$defs/position" },
    "id": {
      "description": "error instance identifier",
      "type": "string"
    }
  },
  "$defs": {
    "errors": { 
      "type": "array",
      "items": { "$ref": "#" }
    },
    "types": {
      "type": "array",
      "items": {
        "description": "identifier of an error type",
        "type": "string",
        "minLength": 1
      }
    },
    "position": {
      "description": "error position",
      "anyOf": [
        {
          "description": "locators",
          "type": "array",
          "items": {
            "$ref": "#/$defs/locator" 
          }
        },
        {
          "description": "locator map",
          "type": "object",
          "patternProperties": {
            "^[a-z][a-z0-9]*$": {
              "type": "string"
            }
          },
          "additionalProperties": false
        }
      ]
    },
    "locator": {
      "type": "object",
      "allOf": [
        {
          "properties": {
            "dimension": {
              "type": "string",
              "pattern": "^[a-z][a-z0-9]*$"
            },
            "address": { "type": "string" },
            "value": { }
          }
        }, {
          "oneOf": [
            {
              "properties": {
                "errors": { "$ref": "#/$defs/errors" }
              },
              "properties": {
                "reports": { 
                  "type": "array",
                  "items": { "$ref": "#/$defs/reports" }
                }
              }
            }
          ]
        } ]
      },
      "required": ["dimension", "address"]
    },
    "report": {
      "type": "object",
      "properties": {
        "types": { "$ref": "#/$defs/types" },
        "errors": { 
          "type": "array",
          "items": { "anyOf": [
              { "$ref": "#" },
              { "value": null } ]
          },
        "totalErrors": { "type": "integer", "minimum": 0 },
        "compliances": { 
          "type": "array",
          "items": { "anyOf": [
              { "$ref": "#/$defs/position" },
              { "value": null } ]
          }
        },
        "totalCompliances": { "type": "integer", "minimum": 0 },
        "totalFindings": { "type": "integer", "minimum": 0 },
        "complete": { "type": "boolean" },
        "duration": { "type": "number", "minimum": 0 }
      }
    }
  }
}

Additional dimensions

The following locator formats or standards are being considered for addition as dimensions:

fq

The fq tool supports analysis of many binary formats so it could be used to locate elements of binary data models. The locator value would be the format, followed by a colon, followed by a path expression:

{
  "message": "Timestamp of .gz archive must not be in the future",
  "position": {
    "fq": "gzip:.members[0].mtime"
  }
}
{
  "message": "Image width of PNG file is too large",
  "position": {
    "fq": "png:.chunks[0].width"
  }
}

rdfterm

The dimension with name rdfterm follows locator format object of RDF 1.1. N-Triples Grammar to locate an RDF term in a RDF graph, both defined in as defined in RDF 1.1 Concepts and Abstract Syntax.

The locator format can also be used to locate the focus node of a SHACL Validation Report.

turtle

The dimension with name turtle follows RDF-Turtle syntax to locate a set of RDF Triples in a RDF graph.

triplepattern

The dimension with name triplepattern follows of locator format subset of Property Path Pattern from SPARQL 1.1 Query Language to locate a set of RDF Triples in a RDF graph.

The locator format can also be used to reference the focus node, path and value of a SHACL Validation Report.