graph LR document --- validator --> errors errors -. positions .-> document validator(validator)
1 Introduction
Data validation is a crucial part of management of data qualitiy and interoperability. Validation is applied in many ways and contexts, for instance input forms and editors with visual feedback or schema languages with formal error reports. The diversity of use cases imply a variety of error results. No common standard exist to express error reports.1
The specification of Data Validation Error Format has two goals:
- unify how validation errors are reported by different validators
- address positions of errors in validated documents
Last but not least the format should help to better separate validation and presentation of validation results, so both can be solved by different applications.
The format is strictly limited to errors and error positions. Neither does it include other kinds of analysis results such as statistics and summaries of documents, nor does in include details about validation such as test cases, schema rules, and individual constraints. Errors can be linked to additional information with error types but the semantics of these types is out of the scope of this specification.
1.1 Overview
Figure 1 illustrates the validation process with core concepts used in this specification: a validator checks whether a document conforms to some requirements and returns a list of errors in return. Each error can refer to its location in the document via a position.
Every document conforms to a document model. For instance JSON documents conforms to the JSON model, and character strings conforms to the model “sequence of characters from a known character set”. Document models come with encodings how to express documents in form of documents on a lower level. For instance JSON documents can be encoded with JSON syntax as Unicode strings and Unicode strings can be encoded with UTF-8 as sequences of bytes (solid arrows in Figure 2).
Eventually all documents are given as digital objects, encoded as sequence of bytes. Encodings using a sequence of characters are also called textual data formats, in contrast to binary data formats.
An error position is given in form of one or more locators, each having a dimension and an address. Each dimension refers to a locator format for a set of document models. For instance JSON Pointer refers to JSON, character and line numbers refer to character strings with defined line breaks, and offsets refer to sequences of elements (Figure 2). Other examples of locator formats include XPath for XML, and row/column for tabular data.
Locators can also contain nested errors to address a more specific position within another position and to support error positions in nested documents such as archive files.
graph LR JSON -- JSON syntax --> Unicode Unicode -- UTF-8 --> Bytes Unicode[Unicode string] jsonpointer(JSON Pointer) char(character number) line(line number) offset style jsonpointer fill:#fff,stroke:#fff style char fill:#fff,stroke:#fff style line fill:#fff,stroke:#fff style offset fill:#fff,stroke:#fff jsonpointer -.-> JSON char -.-> Unicode line -.-> Unicode offset -.-> Bytes
1.2 Examples
Documents can be invalid on many levels. For example the string {"åå":5}
is valid JSON but it might be invalid if element åå
is expected to hold a string instead of a number (Example 1). The error can be located with JSON Pointer in the JSON document and with character and line number:
{
"message": "Expected string, got number at element /åå",
"position": { "jsonpointer": "/åå", "char": "7", "line": 1 }
}
The string could also be part of a larger, newline-delimited JSON document. In this case it makes sense to use a nested error (Example 2):
{
"message": "Invalid document at line 7",
"position": [ {
"dimension": "line",
"address": "7",
"errors": [ {
"message": "Expected string, got number at element /åå",
"position": {
"jsonpointer": "/åå", "char": "7", "line": 1
}
} ]
} ]
}
The document could also be invalid at JSON syntax level, for example if the closing }
is missing (Example 3):
{
"message": "Unexpected end of JSON input at character 8",
"position": { "line": "1", "char": "8" }
}
A similar document could be invalid on byte level. The following table illustrates the document from Example 1 with ninth byte replaced by a value not allowed in UTF-8. It is common practice to replace such bytes with the Unicode replacement character U+FFFD
but the resulting Unicode string is invalid JSON syntax still (Example 4). The example also illustrates another locator format linecol
to give a character position by line and column.
Byte | 7b |
22 |
c3 |
a5 |
c3 |
a5 |
22 |
3a |
c0 |
7d |
Code point | U+007B |
U+0022 |
U+00E5 |
U+00E5 |
U+007B |
U+0022 |
ERROR⇒ U+FFFD |
U+0022 |
||
Character | { |
" |
å |
å |
" |
: |
� |
} |
[
{
"level": "warning",
"message": "Ill-formed UTF-8 byte sequence at offset 8",
"position": { "line": "1", "char": "7", "offset": "8" }
},
{
"level": "error",
"message": "Expected JSON value at line 1, column 7",
"position": { "line": "1", "char": "7", "linecol": "1:7" }
}
]
1.3 Conformance requirements
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in BCP 14 (RFC 2119 and RFC 8174) when, and only when, they appear in all capitals, as shown here.
Only section Section 2 to Section 4, excluding examples and notes, and the list of normative references are normative parts of this specification.
Specific support of Data Validation Error Format by an application depends on two options. Both MUST be documented by applications:
- Support of either the full format or only positions in condense form being locator maps
- The set of supported dimensions
2 Errors
An Error is a JSON object with:
mandatory field
message
with an error message, being a non-empty string. Applications MAY use a default value for error messages.optional field
types
with an array of error types, each being a non-empty string. Error types can be used for grouping errors and they SHOULD be URIs. Repetitions of identical strings in the same array MUST be ignored.optional field
level
with an error level, being one of the stringserror
,warning
, orinfo
. Application MUST NOT differentiate between error levelerror
and no error level.optional field
position
with a position. Applications MUST NOT differentiate between empty position and no position.
Language and localization of error messages is out of the scope of this specification.
3 Positions
An error can have a position. A position is given
either in full form as JSON array of locators,
or in condense form with a locator map.
Every locator map can be transformed to an equivalent array of locators. The reverse transformation is only possible if there is at most one locator per dimension and no locator has nested errors.
Locators of the same positions should refer to roughly the “same” part of a document or at least have a common intersection. This requirement is difficult to formalize because locators refer to different document models, so it is no normative part of this specification yet.
3.1 Locators
A Locator is a JSON object with
mandatory field
dimension
with the name of a dimensionmandatory field
address
with the address, being a string conforming to the locator format identified by the name of the dimension.optional field
errors
with an array of nested errors within the located part of a document.
{ "dimension": "line", "address": "7" }
Nested errors allow to reference locations within nested documents (Example 6 and Example 7):
{
"message": "Invalid value in line 2 in file example.txt in file archive.zip",
"position": [ {
"dimension": "file",
"address": "archive.zip",
"errors": [ {
"message": "Invalid value in line 2 in file example.txt",
"position": [ {
"dimension": "file",
"address": "example.txt",
"errors": [ {
"message": "Invalid value in line 2",
"position": { "line": "2" }
} ]
} ]
} ]
} ]
}
example.txt
in archive archive.zip
{
"message": "Invalid character in line 7, column 3",
"position": [ {
"dimension": "linecol",
"address": "7:3"
}, {
"dimension": "line",
"address": "7",
"errors": [ {
"message": "Invalid character 3",
"position": { "char": "3" }
} ]
} ]
}
3.2 Locator maps
A locator map is a JSON object that maps names of dimensions to addresses.
{ "line": "7", "char": "42" }
A locator map is equivalent to an array of locators with key and value of the JSON object entries mapped to field dimension
and address
of each locator. An array of locators can be reduced to a locator map by dropping all nested errors and selecting only the first locator of each locator format.
Applications MAY restrict their support of Data Validation Error Format to positions with locator maps. In this case nested errors and positions with multiple locators per dimension are not supported.
4 Dimensions
A dimension is a defined method to address parts of a document. Each dimension has:
a unique name, being a string that start with lowercase letter
a
toz
, optionally followed by a sequence of lowercase letters, digits0
to9
and/or-
.a locator format, being a formal language of Unicode strings to encode references to parts of a document. The sets of strings of the language are called addresses.
a document model matching the locator format.
Applications SHOULD support the following dimensions:
name | locator format | document model |
---|---|---|
offset |
offset number | sequence of elements |
char |
character number | sequence of characters or code points |
cell |
cell reference | tabular data |
file |
file path | directory tree |
line |
line number | sequence of lines |
linecol |
line and column | sequence of characters with line breaks |
jsonpointer |
JSON Pointer | JSON |
xpath |
XML Path Expression | XML or compatible hierarchies |
See ?@sec-additional-dimensions for more dimensions.
Sequential document models
Offset number
The offset number locator format with name number
is used to reference an element in a sequence of elements. The locator value is non-negative integer encoded as string without leading zeroes. The first element has number zero (locator value 0
).
Character number
The character number locator format with name char
is used to reference a character in a sequence of characters from a character set. The locator value is a positive integer encoded as string without leading zeroes. The first character has number one (locator value 1
).
In Unicode strings, this locator format refers to code points instead of visual characters.
Line number
Possibly requires some more detailled specification. For instance line number depend on a common definition of line breaks, some formats include U+0B, U+0C, U+85, U+2028, U+2029…
Line and Column
Line number and [character position] within the line, separated by colon :
…
Tabular document models
Cell reference
The cell reference locator format with name cell
is used to reference a cell or a range of cells in a table as known from spreadsheet software. The locator value consists of a pair of column and row, optionally followed by colon (:
) and another pair of column and row. Columns are given in hexavigesimal system (A=1, B=2…, Z=26, AA=27, AB=28…) and rows are given by numbers, starting from 1.
Hierarchical document models
File path
The file path locator format with name file
is used to reference a file or directory in a directory tree. The locator value must be a POSIX path, being a string optionally beginning with a slash (/
), followed by zero or more file names, separated by slash. A file name is a non-empty sequence of Unicode code points excluding the slash (U+002F
) and the null byte (U+0000
).
Depending in the document model, file names may be defined as binary string instead of Unicode strings. In most cases UTF-8 encoding can be assumed to map Unicode code points to bytes but (TODO: this requires more careful examination).
JSON Pointer
…
XML Path Expression
TODO: Subset of XPath, see https://www.w3.org/TR/xpath20/#id-path-expressions (must start with /
, no filter expressions, no reverse steps, no predicates except numbers.
Graph document models
…
5 References
5.1 Normative References
Berners-Lee, T. and Fielding, R. and Masinter, L.: Uniform Resource Identifier (URI): Generic Syntax. RFC 3986, January 2005, http://www.rfc-editor.org/info/rfc3986.
Bradner, S.: Key words for use in RFCs to Indicate Requirement Levels. BCP 14, RFC 2119, March 1997, http://www.rfc-editor.org/info/rfc2119.
Bray, T.: The JavaScript Object Notation (JSON) Data Interchange Format. RFC 8259, December 2017. https://tools.ietf.org/html/rfc8259
Bryan, P and Zyp, K. and Nottigham, M.: JavaScript Object Notation (JSON) Pointer. RFC 6901, April 2023. https://tools.ietf.org/html/rfc6901
Leiba, B.: Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words. BCP 14, RFC 8174, May 2017, http://www.rfc-editor.org/info/rfc8174.
5.2 Informative references
- JSON Schema schema language
- XPath XML Path Language
Appendices
JSON Schemas
Error records can be validated with the non-normative JSON Schema schema.json
in the specification repository. Rules not covered by the JSON Schema include:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"$defs": {
"message": {
"type": "string",
"minLength": 1,
"description": "error message"
},
"types": {
"type": "array",
"items": {
"type": "string",
"minLength": 1,
"description": "identifier of an error type"
}
},
"level": {
"type": "string",
"enum": ["error", "warning", "info"],
"default": "error",
"description": "error level"
},
"locator": {
"type": "object",
"properties": {
"format": {
"type": "string",
"pattern": "^[a-z][a-z0-9-]*$"
},
"value": {
"type": "string"
},
"position": { "$ref": "#/$defs/position" },
"message": { "$ref": "#/$defs/message" },
"types": { "$ref": "#/$defs/types" },
"level": { "$ref": "#/$defs/level" }
},
"required": ["format", "locator"]
},
"position": {
"description": "error position",
"anyOf": [
{
"type": "array",
"items": { "$ref": "#/$defs/locator" }
},
{
"type": "object",
"patternProperties": {
"^[a-z][a-z0-9-]*$": {
"type": "string"
}
},
"additionalProperties": false
}
]
}
},
"properties": {
"message": { "$ref": "#/$defs/message" },
"types": { "$ref": "#/$defs/types" },
"level": { "$ref": "#/$defs/level" },
"position": { "$ref": "#/$defs/position" }
},
"required": ["message"]
}
Additional dimensions
The following dimensions are not normative part of the specification because they have not fully been specified yet:
name | locator format | document models |
---|---|---|
fq |
format and path | all binary formats supported by fq (see Example 9) |
rfc5147 |
RFC 5147 | characters and lines |
rfc7111 |
RFC 7111 | tabular date |
id |
Unicode string | data models that refer to elements with an identifier |
rfc5147
, in contrast to char
and line
, also supports ranges. rfc7111
, in contrast to cell
, also supports ranges and multi-selection.
{
"message": "Timestamp must not be in the future!",
"position": {
"fq": "gzip:.members[0].mtime"
}
}
The following locator formats or standards are yet to be evaluated for its use as dimension:
RangeAddress
of References in OpenDocument- IIIF (section in an image)
- RDF graphs (every subset of an RDF graph is another RDF graph)
- SHACL focus node and result path serialized in a subset of RDF Turtle
- Subset of SQL SELECT statements
- PDF highlighted text annotations
- Some variant of Property Graph Exchange Format (PG) and/or CYPHER Query for property graphs
- PICA Path
- MARCspec
- …
Changes
This document is managed in a revision control system at https://github.com/gbv/validation-error-format, including an issue tracker.
Version 0.1.0
Work in progress.
Footnotes
A notable exception are formats from software development used in unit testing such as JUnit XML and Test Anything Protocol.↩︎