image

the sleepy snake

index :: html

html

HTML 4.0 module

This module is a python wrapper for HTML tags. It contains classes representing all HTML 4.0tags, allowing easy construction and parsing of HTML pages from pure python. Parsing HTML pages is somewhat limited. The module does no magic at all, relying exclusively on pythons HTMLParser, so html to parse has to be "well-formed". Also no efford is ,taken to generate nicely formated output. All this is left as an exercise for tidylib or the like.

The module does some error checking like reporting unsupported attributes for a specified tag or unsupported child tags.

The documentation is split into the following subsections:

html tagsHTML4.0 tags
generic tagsgeneric tags like Text() or Raw()
generic classeshigher level classes
path syntaxxpath like parsing
colorshtml colors
constantshtml constants

The html module defines the following errors:

PathError
ParserError
TagError

The html module provides the following functions:

add_attrs(tag, *attrs)adds attributes to a tag class
escape_string(string)replaces chars in a string with HTML entities
new_tag(name, has_endtag=True, attrs=ANY, can_contain=ANY)creates a new class of tags
set_verbosity(verbosity)sets verbosity for the module
html_tidy(commandline, what)calls tidy to tidy what
unescape_string(string)replaces HTML entities in string with chars


sample usage:

from html import *

# constructing a simple page
page = (
    HtmlFile()
    (
        Doctype(),
        Html()
        (
            Head()
            (
                Title()('my title')            
            ),
            Body()
            (
               'Hello, World!'
            )        
        )
    )
)
page.save(outpath="myfile.html")

# parsing a page (and tidy markup)
page = HtmlFile(url="some/url", tidy="path/to/tidy")
for tag in page.walk():
    print tag