Stencila Schema
Extensions to schema.org to support semantic, composable, parameterize-able and executable documents
0JSON-LD | ||
JSON Schema | ||
TypeScript/JavaScript | ||
Python | ||
R |
ποΈ Contents
nullnull nullnullπ Introduction
This is the Stencila Schema, an extension to schema.org to support semantic, composable, parameterize-able and executable documents (we call them stencils for short). It also provides implementations of schema.org types (and our extensions) for several languages including JSON Schema, Typescript, Python and R. It is a central part of our platform that is used widely throughout our open-source tools as the data model for executable documents.
Why an extension to schema.org?
Schema.org is "a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond.". Schema.org is is used by most major search engines to provide richer, more semantic, search results. More and more web sites are using the schema.org vocabulary and there is increasing uptake in the research community e.g. bioschemas.org, codemeta.github.io
The schema.org vocabulary encompasses many
varied concepts and topics. Of particular relevance to Stencila are types for research
outputs such as ScholarlyArticle
, Dataset
and SoftwareSourceCode
and
their associated meta data e.g. Person
, Organization
, and Organization
.
However, schema.org does not
provide types for the content of research articles. This is
where our extensions come in. This schema adds types (and some properties to existing
types) to be able to represent a complete executable, research article. These extensions
types include "static" nodes such as Paragraph
, Heading
and Figure
, and
"dynamic" nodes involved in execution such as CodeChunk
and Parameter
.
It's about names, not formats
An important aspect of schema.org and similar vocabularies are that they really just define a shared way of naming things. They are format agnostic. As schema.org says, it can be used with "many different encodings, including RDFa, Microdata and JSON-LD".
We extend this philosophy to
the encoding of executable articles, allowing them to be encoded in several existing
document formats. For example, the following very small Article
, containing only
one Paragraph
, and with no
metadata, can be represented in Markdown:
Hello world!
as YAML,
type: Article
content:
- type: Paragraph
content:
- Hello world!
as a Jupyter Notebook,
{
"nbformat": 4,
"nbformat_minor": 4,
"metadata": {
"title": ""
},
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": ["Hello world!"]
}
]
}
as JSON-LD,
{
"@context": "http://schema.stenci.la/v1/jsonld/",
"type": "Article",
"content": [
{
"type": "Paragraph",
"content": ["Hello world!"]
}
]
}
or as HTML with Microdata,
<article itemscope="" itemtype="http://schema.org/Article">
<p itemscope="" itemtype="http://schema.stenci.la/Paragraph">Hello world!</p>
</article>
This repository does not deal with format conversion per se. Please see Encoda for that. However, when developing our schema.org extensions, we aimed to not reinvent the wheel and maintain consistency and compatibility with existing schemas for representing document content. Those include:
But, sometimes (often) we need more than just names
Despite its name, schema.org
does not define strong rules around the shape of data, as say a database schema
or XML schema does. All the properties of schema.org types are optional, and although they
have "expected types", this is not enforced. In addition, properties can be
singular values or array, but always have a singular name. For example, a Article
has a author
property
which could be undefined, a string, a Person
or an Organization
, or an array of
Person
or Organization
items.
This flexibility makes a lot of sense for the primary purpose of schema.org: semantic annotation of other content. However, for use as an internal data model, as in Stencila, it can result in a lot of defensive code to check exactly which of these alternatives a property value is. And writing more code than you need to is A Bad Thingβ’.
Instead, we wanted a schema that placed some restrictions on the shape of executable documents. This has flow on benefits for developer experience such as type inference and checking. To achieve this the Stencila Schema defines schema.org types using JSON Schema. Yes, that's a lot of "schemas", but bear with us...
Using JSON Schema for validation and type safety
JSON Schema is "a vocabulary that allows you to annotate and validate JSON documents". It is a draft internet standard, which like schema.org has a growing adoption e.g. schemastore.org.
In Stencila Schema, when we define a type of document node, either a schema.org type, or an extension, we define it,
- as a JSON Schema document, with restrictions on the marginality, type and shape of it's properties
- using schema.org type and property names, pluralized as appropriate to avoid confusion
For example, an Article
is defined
to have an optional authors
property (note the s
this time) which
is always an array whose items are either a Person
or Organization
.
{
"title": "Article",
"@id": "schema:Article",
"description": "An article, including news and scholarly articles.",
"properties": {
"authors": {
"@id": "schema:author",
"description": "The authors of this creative work.",
"type": "array",
"items": {
"anyOf": [
{
"$ref": "Person.schema.json"
},
{
"$ref": "Organization.schema.json"
}
]
}
}
...
To keep things simpler, this is a stripped
down version of the actualPerson.schema.json
.
With a JSON Schema, we are able to:
- use a JSON Schema validator to check that content meets the schema
-
generate types (i.e.
interface
andclass
elements) matching the schema in other languages.
But, JSON Schema can be a pain to write
JSON can be quite fiddly to
write by hand. And JSON Schema lacks a way to easily express parent-child relationships
between types. For these reasons, we define types using YAML with custom keywords such as
extends
and
generate JSON Schema and ultimately bindings for each language from those.
π Documentation
Documentation is available at https://schema.stenci.la/.
Alternatively, you may want to
directly consult the type definitions (*.yaml
files) and documentation
(*.md
files) in
the schema
directory.
π Usage
JSON-LD context
A JSON-LD @context
is generated from the
JSON Schema sources and published at https://schema.stenci.la/stencila.jsonld.
Individual files are published for each extension type e.g. https://schema.stenci.la/CodeChunk.jsonld and extension property e.g. https://schema.stenci.la/rowspan.jsonld
Programming language bindings
Binding for this schema, in the form of installable packages, are currently generated for:
Depending on the capabilities of the host language, these packages expose type definitions as well as utility functions for constructing valid Stencila Schema nodes. Each packages has its own documentation auto-generated from the code.
π Contributing
We π contributions! All contributions: ideas π€, examples π‘, bug reports π, documentation π, code π», questions π¬.
Please see CONTRIBUTING.md for a guide on how to
contribute to the schema definitions. See the README.md
files of each language
sub-folder e.g. py
for advice
on development of language bindings.
π Acknowledgments
Thanks to the developers of all the existing schemas and open source tools we use in this repo, including:
- Schema.org
- CodeMeta
Available types
Schemas marked with βUβ are considered unstable and have a higher likelihood of changes.
- Prose
- Code
- Data
- Validation
- Metadata
- Miscellaneous