Erik Naggum on the SGML/XML Dichotomies: Attributes vs. Sub-elements and Data vs. Metadata

Newsgroups: comp.lang.lisp
Subject: Re: XML and lisp
From: Erik Naggum <e...@naggum.net>
Message-ID: <3207672197075433@naggum.net>
Organization: Naggum Software, Oslo, Norway
User-Agent: Gnus/5.0808 (Gnus v5.8.8) Emacs/20.7
Date: Fri, 24 Aug 2001 20:03:19 GMT
NNTP-Posting-Date: Fri, 24 Aug 2001 22:03:19 MET DST

* Kent M Pitman <pit...@world.std.com>
> Certainly what you say is undeniably true in terms of practice, and I'd even
> give you that the notational distinction is not worth the mechanism, but
> is there somewhere that the language actually forces this "role" relationship?

  No, there is nothing that requires there to be element attributes as a
  distinct concept from element contents.  There are, however, a number of
  practical things that follow from making that arbitrar distinction which
  can look like rationales, but if you ask yourself "why can it not be a
  subelement", there are no real answers, only appeals to the idea that
  there somehow __have to be a distinction.  It took me years to figure out
  that the whole attribute idea is completely vacuous, and I worked with
  the creator of SGML himself for several years on several SGML-related
  standards and projects.  I started writing "A conceptual introduction to
  SGML" back in 1994, but as I had pained my way through five chapters, I
  had to realize that it was all wrong.  There was a basic design mistake
  in the whole language framework.  That mistake is that simply put: "what
  is good enough for the users of the language is not good enough for its
  creators".  Each and every level of "containership" in SGML has its own
  syntax, optimized for the task.  Each and every level has a different
  syntax for "the writing on the box" as opposed to "the contents of the
  box".  This follows from a very simple, yet amazingly elusive principle
  in its design: Meta-data is conceptually incompatible with data.  This is
  in fact wrong.  Meta-data is only data viewed from a different angle, and
  vice versa.  SGML forces you to remain loyal to your chosen angle of view.

> I wrote a package in Java at a prior employer which automatically
> generated XML representations for classes as elements based on Java
> metadata, and the tack I took was not that the XML attributes contain
> meta-data and the contents data but rather that the XML attributes
> contain atomic data and the contents contain compound data, since this is
> IN FACT what the real distinction is.

  The key to understanding this is that there is no _one_ real distinction.
  There are in fact any number of "real distinctions".  You just found one
  way to wrap your world in the attribute/contents dichotomy because it was
  there.  What would you do if it was not?  What would you do if you had
  only sub-elements?  Would you have _invented_ attributes?  I do not think
  anyone would have, because using sub-elements exacts no higher cost than
  using attributes.

> In effect, what I got out of this was a description that allowed two
> syntaxes: an easy syntax for easy things, and a hard syntax for hard
> things.

  I propose an easier syntax for the harder things and a slightly harder
  syntax for the easier things so they do not impose any easy-vs-hard
  misconceptions on the user and designer.  By making both things cost the
  same, the decision to use an attribute or a sub-element becomes a very
  different choice.

> But what I'm really wondering is whether SGML has some "intended use"
> spec that tells you that you have to put meta-info in the "car" of the
> "form", and info in the "cdr".  I thought the use of these containers was
> semantics-free.

  The intended use has less to do with it than the notion that you can
  define what is meta-information and what is information at the time you
  want to decide whether something goes in an attribute or a sub-element.
  My argument is that this is impossible.  Whether it is meta-information
  or information is a reflection of the actual use, not the intended use.

  However, given that the mechanism was created, and I will argue that it
  was not so much created as it was never thought possible to be any other
  way, it was used to define several language properties.  "Now that we
  have this, would it not also be nice to have that."  This means that
  several of the attribute types grew very far apart from the contents of
  sub-elements and you sort of "had" to use them as attributes, but only
  sort of, because the application can and does define the semantics of
  everything, and if you want ID and IDREF, you can make the same choice as
  you would in Common Lisp to use symbols or a hash tables of strings.

> >   I have come to _loathe_ the half-assed hybrid that some XML-in-Lisp tools
> >   use and produce, because it makes XML just as evil in Lisp as it was in
> >   XML to begin with, and we have gained absolutely nothing in either power
> >   of processing or in abstraction, which is so very un-Lisp-like.
> >
> > <foo bar="zot">quux</foo>
> >
> >   should be read as
> >
> > (foo (bar "zot") "quux")
> >
>
> Maybe. Macsyma used a similar notation for years (though without the restriction
> on container-ness).  I don't think the answer is to change to do the rewrite
> you suggest.

  I cannot follow you here.  I am not suggesting a rewrite.  I suggest that
  there is _no_ distinction between attribute and sub-element contents.
  What I am trying to communicate is so emphatically _NOT_ syntax that we
  will have a severe communications problem if this is not understood.  The
  syntax has a function, and I am challenging the _function_ of the syntax
  that is believed by many people to support a concept I _also_ challenge.
  What do you gain from the attribute-vs-contents dichotomy?  Why do you
  need it?  What does it do for you?  What would you have done if it were
  not there?  What choices and design decisions went into attributes that
  would go into contents if you did not have attributes?

> I don't understand why it's not natural to add the
> following as legal syntaxes:
>
>  <foo bar=<zot/>>
>
> or
>
>  <foo bar=<string>zot</string>>quux</foo>

  Imagine that all attributes are in fact sub-elements, and this problem
  just goes away.  Please, discard the concept of attributes.  They no
  longer exist.  What used to be called "attributes" are only sub-elements
  with special treatment and a whole bunch of arbitrary restrictions, one
  of which is lack of internal structure (except insofar as defined by the
  NOTATION attribute of attributes in SGML).

> This would keep people from feeling the attribute list was a shorthand
> area and would also allow the storing of complex meta-data.

  But that is not my goal.  My goal is to get rid of the idea that there is
  a distinction that can be made once and for all, and prematurely at that,
  that some information is meta-data and some information is data.  The
  core philosophical mistake in SGML is that you can specify these things
  before you know them.  SGML is great for after-the-fact description of
  structures you already know how to deal with perfectly.  It absolutely
  sucks for structures that are in any way yet to be defined.  This is
  _because_ it is impossible to define what is considered meta-information
  and what is considered information before you actually have a full-blown
  software application that is hard to change your mind about.  SGML was
  supposedly designed to free data from the vagaries of software, but when
  it adopted the attribute-content dichotomy, it dove right into dependency
  on the software design process instead of the information design process.

> Do you know what the reason was that recursive structures were not
> allowed in this position in XML?

  Yes, as a matter of fact, I do.  Recursive structures are in fact allowed
  in attribute values, provided that your application processe them and not
  the SGML/XML parser.  Back in the SGML days, the NOTATION attribute of
  both elements and attribute values was designed as an "escape" to the
  application to let some other syntax processor deal with the string of
  characters.  (Please understand that everything SGML/XML is a string of
  characters.  There are no _values_.  Imposing valuedom on strings is the
  kind of semantics that SGML/XML specifically does _not_ support.)

> Or perhaps it was the fact that the "real world" substitutes for "parsed
> structure" things like that weird assembly code like notation which looks
> like
>
>  (A
>  AHREF=foo.html
>  -Text
>  )A
>
> Perhaps someone was just being uncreative about how a compound-structure
> could be offered as an attribute.

  No, they never actually thought of it that way.  You have to understand
  and appreciate that the design process for SGML was such that some people
  had a very clear picture of the meta-information-vs-information dichotomy
  and that it never occurred to anyone that meta-information had exactly
  the same properties as information.

  Whoever first decided to define HTML in such a way that unknown elements
  should be displayed suffered from exactly the same problem.  As a sorry
  consequence, we have elements that have to contain _comments_ that are
  the real contents because that somebody did not foresee the need to have
  meta-information in contents.  I argue that this is a result of "getting"
  the invalid meta-information/information dichotomy.  If that person had
  not been bitten by the false idea that meta-information is fundamentally
  different from information, he would have realized that there would be a
  need to use element contents for meta-information, as well.

> Good.  I'd hate for it to be "lost" as merely a post here, though I think
> it's fun that you felt comfortable in sharing your thoughts.

  Well, it took ten years of discomfort with the "attribute" concept before
  I went back to examine the genesis of the various forms of attributes and
  persisted in asking the question "could it not have been done with
  sub-elements", and finally found that the reason it could not was that
  somebody did not _want_ it to be done with sub-elements, and that the
  root cause of this was a fundamental misunderstanding of the relationship
  between information and meta-information.  Just like Plato and Aristotle
  agreed that ideas and concepts were somehow "inherent" in the things we
  saw and not a property of the person who observed and organized them in
  his own mind, SGML embodies the false premise that structuring has some
  inherent qualities and processing that structure should reflect its
  inherent qualities.  The result is that the processing defines the
  structure.  If there is a mismatch between the two, the result is a very
  painful and elaborate processing, and it can be solved very simply by
  removing the attribute/sub-contents dichotomy, because once we do that,
  we return to first principles and can move forward with the same
  knowledge and experience that created the attributes, but now we can do
  it with sub-elements, instead, and I can promise you that once you start
  off on that road, the least of your worries will be recursive structure
  in attribute values.

///