Skip to content

RFC: Glossary Data Formats

First stage

For now, we can use the following format to specify glossary entries for the Journal:

<dl>
  <dt>Media Fragment</dt>
  <dt>Media Fragments</dt>
  <dd>Paragraphs, datasets, images, topic maps, glossaries… that go into forming one or more document(s).</dd>
  <dd>This glossary entry is a media fragment, and each of its parts could be as well.</dd>
</dl>

As the Journal is currently stored in WordPress, we might use its data model to some extend, but will abandon that more and more by abusing it as generic online data storage, where any arbitrary data is stored in the content-body of a blog post. This will render all plugins, themes and administrative tools on the server useless and might even break the HTML that gets generated for the web, so users in a browser won’t be able to see something reasonable any more if they visit the URL. The views rendered by a server are irrelevant anyway, as the client can’t rely on this external entity to cooperate on ViewSpec operations, or even to be present/reachable at all.

After retrieval, a client is supposed to look into the content-body of a blog post, and if there’s a XHTML <dl> in there, this glossary definitions should be added to the glossary of the Journal, the latter consisting of all <dl>s found in all blog posts. Because of the semantic <dl> markup, we can ignore the explicit WordPress category that marks posts to be glossary entries, while on the other hand the entire Journal can’t have two separate glossaries at this first stage. There’s no need to require all glossary entries to be in a single <dl> in a single post, or to demand that every post containing a <dl> isn’t allowed to contain more than one definition. One post could contain more than one <dl>, but this is certainly not encouraged. There are no recommendations on what to do with other data found in a post content-body outside of <dl>. Clients are free to ignore it, as it is expected that different types of data will be stored separately instead of being embedded in big collections, so clients get a choice to explicitly only retrieve the data that’s of interest to them, by avoiding the need to extract relevant data from mixed media fragments in complex ways. It is expected that on the client side, the glossary data will get applied onto other, independent media fragments in the process of building the document.

At the first stage, XHTML’s <dl> supports multiple terms and multiple definitions for each glossary entry. If I understand correctly, all consecutive <dt>s are alternatives (synonyms, abbreviations, plurals, languages even?) to each other until the first <dd> is encountered, of which there might be several to express different definitions (different in meaning, not highlighting different aspects of the same meaning) for the same term(s). Clients are expected to associate the term(s) and description(s) with some logic of their own, as XHTML doesn’t provide explicit grouping unfortunately. Within the <dt>s and <dd>s, other semantics aren’t supported nor encouraged, clients are free to ignore such markup, but they must interpret all text nodes inside the unsupported markup as belonging to the text of the term or definition. Clients are allowed to interpret other valid XHTML if found in a <dt> or <dd> if they can and want, but they can never expect an implicitly granted meaning on the Journal, nor will there be a stage in the future which will include unrelated XHTML elements via this backdoor into glossary definitions.

We look at <dl> as some kind of microformat serialized in XML, and as XML namespaces aren’t included in the first stage, the unambiguous meaning of this “magic” markup name is derived from the declaration that it has the XHTML meaning if this markup name is found in the data of the official location(s) and sources of the Journal. For now, definitions should be unique globally and not published at two separate places or as duplicates at the same place. If clients encounter a term that’s already in the local data storage, they’re free to ignore the newly retrieved definition, which includes all other term alternatives of the definition and all definition descriptions that are associated with the term in question, but not other definitions of the same <dl> that might provide new, previously unknown definitions.

The option for clients to publish and retrieve/interpret glossary entries to/from the Journal in the JSON serialization of the microformat with the implicit meaning from XHTML is deprecated, as the structure standardized in HTML might not be expressable in JSON. The initial flawed suggestion was:

{
  "dl":
  {
    "dt":
      [
        "Media Fragment",
        "Media Fragments"
      ],
      "dd":
      [
        "Paragraphs, datasets, images, topic maps, glossaries… that go into forming one or more document(s).",
        "This glossary entry is a media fragment, and each of its parts could be as well."
      ]
  }
}

The old proposal added: Clients are free to refuse to post/retrieve/interpret either JSON or XML on the text transport layer, and exclusively work on one of them. They can determine the format by looking at the first magic byte, which either has to be < or {, no whitespace or other data in front of this indication permitted. If a term is encountered twice in the two formats, the client is not allowed to add the term (including all other terms grouped with it) with the associated description(s) to the local storage. Clients are free however to replace a term (including all other terms grouped with it) together with the associated description(s) with another term (including all other terms grouped with it) plus the associated description(s). In those cases, it doesn’t matter if the other terms of a definition are unique and not encountered in another format, or newly introduced by the new data where they haven’t been in the storage before, conflicting terms render the entire definition exclusive. Servers are free to accept the same or different definitions for the same term in the two formats, be it because a conflicting term has different other terms in its term group or the descriptions differ. They are encouraged to deny the posting of the same term in the same format (no matter of the other terms in the group or differences in the definition), but there is no requirement to do any checking, as clients will handle the conflict anyway, so it’s more a question of data/storage optimization for the server operator.

The use of the blog post title/subject for the glossary term and the blog post content-body for the term description is deprecated. Clients can, if they want to rely on a potentially changing category ID or name, interpret such posts as glossary definitions, but they’re free to ignore such posts as not being glossary data.

Second stage

<?ohs version="1.0" format="xml" encoding="UTF-8"?>
<dl xmlns="http://www.w3.org/1999/xhtml"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    rdf:about="ohsidentifierscheme:info.doug-50/journal/category/12?created=2018-02-11T23:42">
  <dt>Media Fragment</dt>
  <dt>Media Fragments</dt>
  <dd>Paragraphs, datasets, images, topic maps, glossaries… that go into forming one or more document(s).</dd>
  <dd>This glossary entry is a media fragment, and each of its parts could be as well.</dd>
</dl>
Published inSuggestions

2 Comments

  1. […] ones. Have a look at Frode Hegland’s implementation (proposal, example, interaction) and mine (proposal, example, interaction). Now, in the context of the FTI, it’s obvious that we do not necessarily […]

Leave a Reply

Your email address will not be published. Required fields are marked *