Erik Naggum on the SGML/XML Dichotomies: Attributes vs. Sub-elements and Data vs. Metadata
Newsgroups: comp.lang.lisp
Subject: Re: XML and lisp
From: Erik Naggum <e...@naggum.net>
Message-ID: <3207672197075433@naggum.net>
Organization: Naggum Software, Oslo, Norway
User-Agent: Gnus/5.0808 (Gnus v5.8.8) Emacs/20.7
Date: Fri, 24 Aug 2001 20:03:19 GMT
NNTP-Posting-Date: Fri, 24 Aug 2001 22:03:19 MET DST
* Kent M Pitman <pit...@world.std.com>
> Certainly what you say is undeniably true in terms of practice, and I'd even
> give you that the notational distinction is not worth the mechanism, but
> is there somewhere that the language actually forces this "role" relationship?
No, there is nothing that requires there to be element attributes as a
distinct concept from element contents. There are, however, a number of
practical things that follow from making that arbitrar distinction which
can look like rationales, but if you ask yourself "why can it not be a
subelement", there are no real answers, only appeals to the idea that
there somehow __have to be a distinction. It took me years to figure out
that the whole attribute idea is completely vacuous, and I worked with
the creator of SGML himself for several years on several SGML-related
standards and projects. I started writing "A conceptual introduction to
SGML" back in 1994, but as I had pained my way through five chapters, I
had to realize that it was all wrong. There was a basic design mistake
in the whole language framework. That mistake is that simply put: "what
is good enough for the users of the language is not good enough for its
creators". Each and every level of "containership" in SGML has its own
syntax, optimized for the task. Each and every level has a different
syntax for "the writing on the box" as opposed to "the contents of the
box". This follows from a very simple, yet amazingly elusive principle
in its design: Meta-data is conceptually incompatible with data. This is
in fact wrong. Meta-data is only data viewed from a different angle, and
vice versa. SGML forces you to remain loyal to your chosen angle of view.
> I wrote a package in Java at a prior employer which automatically
> generated XML representations for classes as elements based on Java
> metadata, and the tack I took was not that the XML attributes contain
> meta-data and the contents data but rather that the XML attributes
> contain atomic data and the contents contain compound data, since this is
> IN FACT what the real distinction is.
The key to understanding this is that there is no _one_ real distinction.
There are in fact any number of "real distinctions". You just found one
way to wrap your world in the attribute/contents dichotomy because it was
there. What would you do if it was not? What would you do if you had
only sub-elements? Would you have _invented_ attributes? I do not think
anyone would have, because using sub-elements exacts no higher cost than
using attributes.
> In effect, what I got out of this was a description that allowed two
> syntaxes: an easy syntax for easy things, and a hard syntax for hard
> things.
I propose an easier syntax for the harder things and a slightly harder
syntax for the easier things so they do not impose any easy-vs-hard
misconceptions on the user and designer. By making both things cost the
same, the decision to use an attribute or a sub-element becomes a very
different choice.
> But what I'm really wondering is whether SGML has some "intended use"
> spec that tells you that you have to put meta-info in the "car" of the
> "form", and info in the "cdr". I thought the use of these containers was
> semantics-free.
The intended use has less to do with it than the notion that you can
define what is meta-information and what is information at the time you
want to decide whether something goes in an attribute or a sub-element.
My argument is that this is impossible. Whether it is meta-information
or information is a reflection of the actual use, not the intended use.
However, given that the mechanism was created, and I will argue that it
was not so much created as it was never thought possible to be any other
way, it was used to define several language properties. "Now that we
have this, would it not also be nice to have that." This means that
several of the attribute types grew very far apart from the contents of
sub-elements and you sort of "had" to use them as attributes, but only
sort of, because the application can and does define the semantics of
everything, and if you want ID and IDREF, you can make the same choice as
you would in Common Lisp to use symbols or a hash tables of strings.
> > I have come to _loathe_ the half-assed hybrid that some XML-in-Lisp tools
> > use and produce, because it makes XML just as evil in Lisp as it was in
> > XML to begin with, and we have gained absolutely nothing in either power
> > of processing or in abstraction, which is so very un-Lisp-like.
> >
> > <foo bar="zot">quux</foo>
> >
> > should be read as
> >
> > (foo (bar "zot") "quux")
> >
>
> Maybe. Macsyma used a similar notation for years (though without the restriction
> on container-ness). I don't think the answer is to change to do the rewrite
> you suggest.
I cannot follow you here. I am not suggesting a rewrite. I suggest that
there is _no_ distinction between attribute and sub-element contents.
What I am trying to communicate is so emphatically _NOT_ syntax that we
will have a severe communications problem if this is not understood. The
syntax has a function, and I am challenging the _function_ of the syntax
that is believed by many people to support a concept I _also_ challenge.
What do you gain from the attribute-vs-contents dichotomy? Why do you
need it? What does it do for you? What would you have done if it were
not there? What choices and design decisions went into attributes that
would go into contents if you did not have attributes?
> I don't understand why it's not natural to add the
> following as legal syntaxes:
>
> <foo bar=<zot/>>
>
> or
>
> <foo bar=<string>zot</string>>quux</foo>
Imagine that all attributes are in fact sub-elements, and this problem
just goes away. Please, discard the concept of attributes. They no
longer exist. What used to be called "attributes" are only sub-elements
with special treatment and a whole bunch of arbitrary restrictions, one
of which is lack of internal structure (except insofar as defined by the
NOTATION attribute of attributes in SGML).
> This would keep people from feeling the attribute list was a shorthand
> area and would also allow the storing of complex meta-data.
But that is not my goal. My goal is to get rid of the idea that there is
a distinction that can be made once and for all, and prematurely at that,
that some information is meta-data and some information is data. The
core philosophical mistake in SGML is that you can specify these things
before you know them. SGML is great for after-the-fact description of
structures you already know how to deal with perfectly. It absolutely
sucks for structures that are in any way yet to be defined. This is
_because_ it is impossible to define what is considered meta-information
and what is considered information before you actually have a full-blown
software application that is hard to change your mind about. SGML was
supposedly designed to free data from the vagaries of software, but when
it adopted the attribute-content dichotomy, it dove right into dependency
on the software design process instead of the information design process.
> Do you know what the reason was that recursive structures were not
> allowed in this position in XML?
Yes, as a matter of fact, I do. Recursive structures are in fact allowed
in attribute values, provided that your application processe them and not
the SGML/XML parser. Back in the SGML days, the NOTATION attribute of
both elements and attribute values was designed as an "escape" to the
application to let some other syntax processor deal with the string of
characters. (Please understand that everything SGML/XML is a string of
characters. There are no _values_. Imposing valuedom on strings is the
kind of semantics that SGML/XML specifically does _not_ support.)
> Or perhaps it was the fact that the "real world" substitutes for "parsed
> structure" things like that weird assembly code like notation which looks
> like
>
> (A
> AHREF=foo.html
> -Text
> )A
>
> Perhaps someone was just being uncreative about how a compound-structure
> could be offered as an attribute.
No, they never actually thought of it that way. You have to understand
and appreciate that the design process for SGML was such that some people
had a very clear picture of the meta-information-vs-information dichotomy
and that it never occurred to anyone that meta-information had exactly
the same properties as information.
Whoever first decided to define HTML in such a way that unknown elements
should be displayed suffered from exactly the same problem. As a sorry
consequence, we have elements that have to contain _comments_ that are
the real contents because that somebody did not foresee the need to have
meta-information in contents. I argue that this is a result of "getting"
the invalid meta-information/information dichotomy. If that person had
not been bitten by the false idea that meta-information is fundamentally
different from information, he would have realized that there would be a
need to use element contents for meta-information, as well.
> Good. I'd hate for it to be "lost" as merely a post here, though I think
> it's fun that you felt comfortable in sharing your thoughts.
Well, it took ten years of discomfort with the "attribute" concept before
I went back to examine the genesis of the various forms of attributes and
persisted in asking the question "could it not have been done with
sub-elements", and finally found that the reason it could not was that
somebody did not _want_ it to be done with sub-elements, and that the
root cause of this was a fundamental misunderstanding of the relationship
between information and meta-information. Just like Plato and Aristotle
agreed that ideas and concepts were somehow "inherent" in the things we
saw and not a property of the person who observed and organized them in
his own mind, SGML embodies the false premise that structuring has some
inherent qualities and processing that structure should reflect its
inherent qualities. The result is that the processing defines the
structure. If there is a mismatch between the two, the result is a very
painful and elaborate processing, and it can be solved very simply by
removing the attribute/sub-contents dichotomy, because once we do that,
we return to first principles and can move forward with the same
knowledge and experience that created the attributes, but now we can do
it with sub-elements, instead, and I can promise you that once you start
off on that road, the least of your worries will be recursive structure
in attribute values.
///