graph LR document --- validator --> errors errors -. positions .-> document validator(validator)
1 Introduction
Data validation is a crucial part of management of data qualitiy and interoperability. Validation is applied in many ways and contexts, for instance input forms and editors with visual feedback or schema languages with formal error reports. The diversity of use cases imply a variety of error results. No common standard exist to express error reports.1
The specification of Data Validation Error Format has two goals:
- unify how validation errors are reported by different validators
- address positions of errors in validated documents
Last but not least the format should help to better separate validation and presentation of validation results, so both can be solved by different applications.
The format is strictly limited to errors and error positions. Neither does it include other kinds of analysis results such as statistics and summaries of documents, nor does in include details about validation such as test cases, schema rules, and individual constraints. Errors can be linked to additional information with error types but the semantics of these types is out of the scope of this specification.
1.1 Overview
Figure 1 illustrates the validation process with core concepts used in this specification: a validator checks whether a document conforms to some requirements and returns a list of errors in return. Each error can refer to its locations in the document via positions.
Every document conforms to a document model. For instance JSON documents conforms to the JSON model, and character strings conforms to the model “sequence of characters from a known character set”. Document models come with encodings how to express documents in form of documents on a lower level in form. For instance JSON documents can be encoded with JSON syntax as Unicode strings and Unicode strings can be encoded with UTF-8 as sequences of bytes (solid arrows in Figure 2).
Eventually all documents are given as digital objects, encoded as sequence of bytes. Encodings using a sequence of characters are also called textual data formats, in contrast to binary data formats.
Error positions are given in form of locators, each expressed in a locator format. Locator formats refer to sets of document models: for instance JSON Pointer refers to JSON, character and line numbers refer to character strings with defined line breaks, and offsets refer to sequences of elements (Figure 2).
graph LR JSON -- JSON syntax --> Unicode Unicode -- UTF-8 --> Bytes Unicode[Unicode string] jsonpointer(JSON Pointer) char(character number) line(line number) offset style jsonpointer fill:#fff,stroke:#fff style char fill:#fff,stroke:#fff style line fill:#fff,stroke:#fff style offset fill:#fff,stroke:#fff jsonpointer -.-> JSON char -.-> Unicode line -.-> Unicode offset -.-> Bytes
1.2 Example
Documents can be invalid compared to document models on many levels. For example the string {"åå":5}
is valid JSON but it might be invalid if element åå
is expected to hold a string instead of a number (Example 1). The error can be located with JSON Pointer in the JSON document and with line and character number of its encoding as Unicode string:
{
"message": "Expected string, got number at element /åå",
"position": { "jsonpointer": "/åå", "line": "1", "char": "7" }
}
The document could also be invalid at JSON syntax level, for example of the closing }
is missing (Example 2). The error is also located by its byte offset:
{
"message": "Unexpected end of JSON input at character 8",
"position": { "line": "1", "char": "8", "offset": "10" }
}
A similar document could be invalid on byte level. The following table illustrates the document from Example 1 with ninth byte replaced by a value not allowed in UTF-8. It is common practice to replace such bytes with the Unicode replacement character U+FFFD
but the resulting Unicode string is invalid JSON syntax still (Example 3). The example also illustrates another locator format linecol
to give a character position by line and column.
Byte | 7b |
22 |
c3 |
a5 |
c3 |
a5 |
22 |
3a |
c0 |
7d |
Code point | U+007B |
U+0022 |
U+00E5 |
U+00E5 |
U+007B |
U+0022 |
ERROR⇒ U+FFFD |
U+0022 |
||
Character | { |
" |
å |
å |
" |
: |
� |
} |
[
{
"level": "warning",
"message": "Ill-formed UTF-8 byte sequence at offset 8",
"position": { "line": "1", "char": "", "offset": "8" }
},
{
"level": "error",
"message": "Expected JSON value at line 1, column 7",
"position": { "line": "1", "char": "7", "linecol": "1:7", "offset": "8" }
}
]
1.3 Conformance requirements
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in BCP 14 (RFC 2119 and RFC 8174) when, and only when, they appear in all capitals, as shown here.
Only section Section 2 to Section 4, excluding examples and notes, and the list of normative references are normative parts of this specification.
Specific support of Data Validation Error Format by an application depends on two options. Both MUST be documented by applications:
- Support either the full format or only support condense form of positions being locator maps
- The set of supported locator formats
2 Errors
An Error is a JSON object with:
optional field
message
with an error message, being a non-empty stringoptional field
types
with an array of error types, each being a non-empty string. Error types can be used for grouping errors and they SHOULD be URIs. Repetitions of identical strings in the same array SHOULD be ignored.optional field
level
with an error level, being one of the stringserror
orwarning
.optional field
position
with positions
An error is also called warning if field level
has value warning
.
Applications SHOULD add a default value for errors without field message
.
Language and localization of error messages is out of the scope of this specification.
3 Positions
An error can have one or more positions. Positions are
either a JSON array of locators (detailled form),
or a locator map (condense form).
Every locator map can be transformed to an equivalent array of locators. The reverse transformation is not always possible.
Locators of the same positions SHOULD have an non-empty intersection.
3.1 Locators
A Locator is a JSON object with
- mandatory field
format
with the name of a locator format, being a non-empty string - mandatory field
value
with the locator value, being a string - optional field
errors
with an array of errors within the located document fragment.
Locator format and locator name MAY also be referred to as “dimension” and “address” of a locator.
{ "format": "line", "value": "7" }
Errors in the errors
field are called nested errors. Nested errors allow to reference locations within nested documents (Example 5 and Example 6):
{
"message": "Invalid value in line 2 in file example.txt in file archive.zip",
"position": [ {
"format": "file",
"value": "archive.zip",
"errors": [ {
"message": "Invalid value in line 2 in file example.txt",
"position": [ {
"format": "file",
"value": "example.txt",
"errors": [ {
"message": "Invalid value in line 2",
"position": { "line": "2" }
} ]
} ]
} ]
} ]
}
example.txt
in archive archive.zip
{
"message": "Invalid character in line 7, column 3",
"position": [ {
"format": "linecol",
"value": "7:3"
}, {
"format": "line",
"value": "7",
"errors": [ {
"message": "Invalid character 3",
"position": { "char": "3" }
} ]
} ]
}
3.2 Locator maps
A locator map is a JSON object that maps locator format names to locator values.
{ "line": "7", "char": "42" }
A locator map is equivalent to an array of locators with key and value of the JSON object entries mapped to field format
and value
of each locator. An array of locators can be reduced to a locator map by dropping all nested errors and selecting only the first locator of each locator format.
Applications MAY restrict their support of Data Validation Error Format to positions with locator maps. In this case nested positions and multiple positions of same locator format are not supported.
4 Locator formats
A locator format is a formal language of Unicode strings to locate positions in a document. The sets of strings of the language are called locator values of the locator format.
Each locator format has a unique locator format name. The name is a string that start with lowercase letter a
to z
, optionally followed by a sequence of lowercase letters, digits 0
to 9
and/or -
.
Each locator format can encode positions of documents that conform to a set of matching document models.
The set of normative locator formats has not been finally specified yet. The final version of this specification may need to define a registry of locator formats. The following locator formats will likely be included:
name | locator format | document models |
---|---|---|
offset |
offset number | sequence of elements |
char |
character position | sequence of characters or code points |
cell |
cell reference | tabular data models |
file |
file path | directory tree |
line |
line number (first: 1) | sequence of lines |
linecol |
line number and column | sequence of characters with line breaks |
jsonpointer |
JSON Pointer | JSON |
xpath |
XPath (or a subset) | XML |
fq |
format and path | all binary formats supported by fq (see Example 8) |
The locator formats require some more detailled specification. For instance line number depend on a common definition of line breaks, some formats include U+0B, U+0C, U+85, U+2028, U+2029…
{
"message": "Timestamp must not be in the future!",
"position": {
"fq": "gzip:.members[0].mtime"
}
}
More candidates of locator formats to be specified:
- RFC 5147 for ranges of lines and characters
- RFC 7111 for tabular data
- IIIF (section in an image)
- RDF graphs (every subset of an RDF graph is another RDF graph)
- Subsets of query languages (SQL, SPARQL…)
- PDF highlighted text annotations
id
for data models that refer to elements with an identifier- PICA Path
- MARCspec
- …
Offset number
The offset number locator format with name number
is used to reference an element in a sequence of elements. The locator value is non-negative integer encoded as string without leading zeroes. The first element has number zero (locator value 0
).
Character position
The character position locator format with name char
is used to reference a character in a sequence of characters from a character set. The locator value is a positive integer encoded as string without leading zeroes. The first character has number one (locator value 1
).
In Unicode strings, this locator format refers to code points instead of visual characters.
Cell reference
The cell reference locator format with name cell
is used to reference a cell or a range of cells in a table as known from spreadsheet software. The locator value consists of a pair of column and row, optionally followed by colon (:
) and another pair of column and row. Columns are given in hexavigesimal system (A=1, B=2…, Z=26, AA=27, AB=28…) and rows are given by numbers, starting from 1.
File path
The file path locator format with name file
is used to reference a file or directory in a directory tree. The locator value must be a POSIX path, being a string optionally beginning with a slash (/
), followed by zero or more file names, separated by slash. A file name is a non-empty sequence of Unicode code points excluding the slash (U+002F
) and the null byte (U+0000
).
Depending in the document model, file names may be defined as binary string instead of Unicode strings. In most cases UTF-8 encoding can be assumed to map Unicode code points to bytes but (TODO: this requires more careful examination).
5 References
5.1 Normative References
Berners-Lee, T. and Fielding, R. and Masinter, L.: Uniform Resource Identifier (URI): Generic Syntax. RFC 3986, January 2005, http://www.rfc-editor.org/info/rfc3986.
Bradner, S.: Key words for use in RFCs to Indicate Requirement Levels. BCP 14, RFC 2119, March 1997, http://www.rfc-editor.org/info/rfc2119.
Bray, T.: The JavaScript Object Notation (JSON) Data Interchange Format. RFC 8259, December 2017. https://tools.ietf.org/html/rfc8259
Bryan, P and Zyp, K. and Nottigham, M.: JavaScript Object Notation (JSON) Pointer. RFC 6901, April 2023. https://tools.ietf.org/html/rfc6901
Leiba, B.: Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words. BCP 14, RFC 8174, May 2017, http://www.rfc-editor.org/info/rfc8174.
5.2 Informative references
- JSON Schema schema language
Appendices
The following information is non-normative.
JSON Schemas
Error records can be validated with the non-normative JSON Schema schema.json
in the specification repository. Rules not covered by the JSON Schema include:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"$defs": {
"message": {
"type": "string",
"minLength": 1,
"description": "error message"
},
"types": {
"type": "array",
"items": {
"type": "string",
"minLength": 1,
"description": "identifier of an error type"
}
},
"level": {
"type": "string",
"enum": ["error", "warning"],
"default": "error",
"description": "error level ('error' or 'warning')"
},
"locator": {
"type": "object",
"properties": {
"format": {
"type": "string",
"pattern": "^[a-z][a-z0-9-]*$"
},
"value": {
"type": "string"
},
"position": { "$ref": "#/$defs/position" },
"message": { "$ref": "#/$defs/message" },
"types": { "$ref": "#/$defs/types" },
"level": { "$ref": "#/$defs/level" }
},
"required": ["format", "locator"]
},
"position": {
"description": "positions",
"anyOf": [
{
"type": "array",
"items": { "$ref": "#/$defs/locator" }
},
{
"type": "object",
"patternProperties": {
"^[a-z0-9-]+$": {
"type": "string"
}
},
"additionalProperties": false
}
]
}
},
"properties": {
"message": { "$ref": "#/$defs/message" },
"types": { "$ref": "#/$defs/types" },
"level": { "$ref": "#/$defs/level" },
"position": { "$ref": "#/$defs/position" }
},
"required": ["message"]
}
Changes
This document is managed in a revision control system at https://github.com/gbv/validation-error-format, including an issue tracker.
Version 0.1.0
Work in progress.
Footnotes
A notable exception are formats from software development used in unit testing such as JUnit XML and Test Anything Protocol.↩︎