Note: this page is a work in progress.

Questions or comments about this page, please use the contact form or email mj@mjclement.com. If you happen to catch an error please let me know.

What is HTML?

HTML is HyperText Markup Language: a language for the markup of hypertext.

Hypertext
In the simplest possible terms, hypertext is text plus links. The earliest description of hypertext I have heard is as "text that is not constrained to be linear". The reader can follow multiple different paths by following links, and is not likely to read a given set of documents in any one particular sequence. This is a significant difference from print media where there is pervasive linearity across larger units of text.
Markup
"Markup" is a term with a history in typesetting, referring to additional information added to a text by the author or editor before it goes to press, to give a typesetter clues about how it should be printed.
Language
HTML is a "language," not a natural language but a computer language. A computer language can be written by people, and can be understood by computers.

Note that the latest versions of HTML are called XHTML. Everything in this document applies to XHTML as well as older plain HTML. When most people use "HTML" to include both HTML and XHTML, and that is the usage here.

How hypertext works

Hypertext defines uni-directional links between documents. For a link to take you to another document, there must be some way of identifying the document that the link points to. This is accomplished using a URI or URL. A URI universally identifies a document on the Web. A link in HTML consists of an anchor and a URL. The anchor is the text or image that can be clicked or otherwise activated, the URL identifies the document your browser should display when that happens.

How HTML works

HTML hypertext documents are simple text files. These files can be viewed in a browser designed for HTML, but they can also be opened in any program that can read or edit text. This means HTML editing does not require a special HTML editor, you can use a typical text editor (think "Notepad") to create HTML pages. Of course, there are also editors specifically designed to create and edit HTML. Different people prefer different tools to manipulate HTML: I personally use a text editor, but would not be opposed to using an HTML-specific editor, if I ever find one that creates HTML output as clean as mine.

HTML documents are divided into paragraphs, and can have headings, lists of items, (which can be numbered) and tables for laying out information in rows and columns. All of these are indicated in HTML files by using elements. Elements are often referred to as tags. Elements must have one of the names determined by the HTML specification. Browsers determine the presentation of a HTML document by reading the elements and arranging them sensibly.

Tags

Tags divide an HTML document into elements, and those elements define the structure and content of the document. The tags themselves aren't visible in the browser, they are simply there to mark the beginning and end of elements.

Tags are marked by angle brackets. The left angle bracket "<" indicates the beginning of the tag, then the name of the tag follows, and the right angle bracket ">" marks the end. Opening tags mark the beginning of an element. Closing tags have the same name as the opening tag, but start with an angle bracket and a slash "/". For example, a paragraph element contains an opening paragraph tag, the paragraph content, and a closing paragraph tag:

<p>paragraph contents</p>

A few common tags used include "<p>" for paragraphs, "<h4>," "<h2>," ... "<h6>" for headings, "<em>" for emphasized text, and "<strong>" for a stronger emphasis.

Attributes

Elements can also have attributes. Attributes are named strings associated with a particular element. Here is an example of a paragraph element with a class attribute:

<p class="example">
Here is some example text.
<p>

This creates a paragraph just like any other HTML paragraph, but also associates this paragraph with the "example" class. Designating various classes of paragraphs might be used, for example, to make all the paragraphs of a certain class show up in a certain color.

Links

Links are created with the "<a>" or anchor element. Anchor elements can use the "href" attribute to specify the URL to which the link points. In visual Web browsers, links are usually blue and underlined, and clicking them "follows the link," that is, changes to the URL specified by the "href" attribute.

Images

Images are not directly included in HTML files, rather they are stored in separate files, using common image file formats like JPEG or PNG. HTML documents contain "<img>" elements, which reference the actual image file. The img element uses a "src" attribute to define the location of the file containing the image. The browser is responsible for finding the image, downloading it, and displaying it on the page. If the image can't be found, then you may see a "broken image" displayed by your browser to indicate that the image is missing. HTML documents should use the "alt" attribute of the image element to specify an alternate text for an image. This text will be displayed in place of the image while it is downloading and if it is unavailable, and will be presented in place of the image to those who cannot view images.

Entities

It is possible to include certain special characters in HTML documents that might not be possible normally by using entities. For example, the left angle bracket character "<" is normally interpreted as the beginning of a tag. If you want to include a literal left angle bracket in your document (as in the previous sentence) it must be represented in the HTML by an entity. Entities begin with an ampersand "&" and end with a semicolon ";". Whatever is between the ampersand and the semicolon is interpreted as an entity. The entity for the left angle bracket, for example, is "&lt;". Of course, if "&lt;" were really in the HTML, your browser would simply show a left angle bracket. So another entity must be used to represent the literal ampersand. The ampersand entity is "&amp;".

Who invented HTML?

Numerous people contributed to the development of HTML, including Tim Berners-Lee and Dan Connolly, who created the first specification of HTML.

HTML is now maintained by the W3C, who also maintain and create new standards to help the Web move along "towards its full potential." The latest incarnation of HTML from the W3C is XHTML, which is a reformulation of HTML as an XML application. Everything in this document describes XHTML as well as traditional HTML.

Learning HTML

All there is to HTML is elements, which can contain text and can also have attributes. To use HTML you need to be familiar with the elements that are available, and what attributes they can have. A good start is the latest specifications from the W3C. There are also more gentle ways to learn HTML if you cannot easily understand the specifications themselves. In a few hours, you can learn to create basic hypertext containing links, with headings, organized in paragraphs, with lists, tables, and images. The best way to learn HTML is to create some HTML documents in a text editor. Using a WYSIWYG HTML editor will not teach you HTML, but will in fact prevent you from learning it. HTML was designed as a human-readable and human-writable language. It uses a syntax that makes sense and is easy to manipulate without complicated tools. There is nothing to prevent anyone from learning HTML and writing HTML simply and easily with a basic text editing tool. As a learning tool, in almost any browser, you can view the HTML source of the page you are viewing.

Problems with HTML

HTML has been in widespread use since around 1990. The development of HTML has proceeded in a somewhat haphazard way, with mixed results. The original specification of HTML was targeted at structural markup of documents which would describe what the content of the document was and leave the formatting and display of the document up to the user agent (browser), which is to say, up to the user himself. Later development of HTML was often driven by browser vendors with business agendas that were not aligned with the original intentions of HTML, nor with the best interests of users. HTML has become somewhat bloated by inappropriate additions. Many of these developments were designed to give control over the document display to the document author. In practice this often meant taking that same control away from the user, the person using a browser to actually read the HTML document. Unfortunately a lot of simply horrible HTML was produced as many designers tried to treat the Web as just another form of print media. Recently, with the push for style sheets by the W3C, some attempt is being made to restore HTML as a structural markup language and cut away some of the uglier developments (such as the brain-damaged <FONT> element). The W3C has also used XML as the basis of future HTML versions. While the XML sytax is not quite as compact as the original SGML-based HTML, this allows for more rigorous syntax checking, and more strict HTML means a better Web for everyone. Most of the current problems related to HTML are not actually HTML problems, but problems with rendering, style sheets, and the rampant misuse of HTML features to do things HTML simply was not designed to do. HTML is perfectly adequate when used as intended.

Markup and Stylesheets

Good textual markup, including good HTML, indicates the structural or contextual functions of elements of the text. The formatting is then determined based on the structure of the text, using some externally defined rules. As an example, there are three tags in HTML which usually produce italicized text. They are: <em>, <i>, and <cite>. Although most browsers display these exactly the same way, each actually indicates something different. The <i> tag simply means "italicize this text." The <em> tag means "emphasize this text" and most browsers accomplish this by italicizing it. The <cite> tag means "this text is a citation," such as the name of a book or a play, and most browsers indicate this using italics. Why are there three ways of doing the same thing? Well it isn't the same thing at all, and this becomes clear when we imagine using the HTML in different ways. Let's say we have a document containing all three tags. Perhaps we write a program to read the document in a vaguely human-sounding voice. When our program gets to the <em> tag, it knows what to do: it should emphasize that word, phrase, or sentence. Perhaps it does this by speaking louder, or in a higher tone. Real people use a number of different means to add emphasis when speaking. When our program reads the <cite> text, it also will know what to do. The text is a title of a work, and there are various ways of indicating this when speaking. Often people add some "space" or silence before and after a title. Our program might even say "citation" to make it very clear, or use a completely different voice, or it might even just read the title normally and let the hearer identify it as a title by context. However, it certainly would not use the same modulation as with emphasized text. When our program gets to the <i> tag, it will have no idea how to read the text. It should be italicized when printed, but why? Is it a citation? Italicized for emphasis? An excerpt in another language?

Perhaps we write another program to translate documents from English into German. When this program gets to a citation, it will know not to translate the title of a work. A bit of emphasized text should be translated normally. A phrase quoted in French shouldn't be translated, and our translation program, expecting English, would choke on French anyway. Again, with the <i> tag, there is no way for the program to know what to do with the italicized text. Finally, consider a search program looking through many documents for references to Sartre's No Exit. If citation tags have been used, the search program can eliminate all occurrences of "No Exit" that are not references to the work in question, or at least some work with the same title. Of course, such a search can only work if the documents being searched all use the <cite> tags properly, and experience shows this cannot be relied on.

A person reading italicized text can tell from context why it is italicized. That is why it is alright to use italics for so many different purposes. A computer cannot make decisions based on context, since computers are nowhere near any understanding of natural language. So if we are creating text in a computer, we should specify what is different about parts of the text, not how they should be formatted, at least if we want to be able to use the text in more than one way. Should HTML not have an <i> tag? Ideally, such a tag would be needed very seldom, if at all. In every case where printed text is meant to be italicized, different tags that indicate what it is about the text that makes it different would be used. However, there are a few cases in which it is desirable to have text that is italicized for no other underlying reason, for example, it is nice to be able to say: here is an example of italicized text. Such a use of italics has no deeper meaning. Furthermore, there will probably never be tags for all the reasons people have for putting text in italics, and it is certainly better to have a generic "italic" tag than for people to use an inaccurate tag just to force italics.

Unfortunately, many people use the wrong HTML tags in the wrong places. Worse, HTML simply lacks needed tags to indicate many structural indications. To use the example above, there should be a way to indicate the use of a word borrowed from another language. In spite of what was said above, computers can make some contextual decisions about content. For example, an ideal HTML editor would contain a library of titles of many books, films, and other works, and would automatically offer to markup such occurrences as citations as they are entered. If you typed "Shakespeare's Hamlet" in such an editor, "Hamlet" would be recognized as the title of a play and marked up automatically. An editor might also provide a similar function for words borrowed from another language. This would encourage the use of proper HTML tags without requiring manual insertion of all of them.

Stylesheets allow the separation of formatting considerations and the actual information. It is possible to specify several stylesheets which may be used to view the same information with different formatting applied, or even on different kinds of devices. Unfortunately, stylesheets are not as widely used as they should be, and are not well integrated into browsers: i.e. it is not possible to choose a stylesheet in one place for a whole site (without using cookies and scripts), and some browsers do not even allow switching between stylesheets. Few HTML editors encourage the use of stylesheets.