image

the sleepy snake

index :: html :: PathSyntax

Path Syntax

To query a tag for child tags the html module provides the following syntax, somewhat close to xpath:

path selectors

/* selects all child tags
//* all tags (recursive)
/.. parent tag of the tag
//.. not used (have no idea what it should return)
/table all emidiate child tables
//table all tables located anywhere lower in the hirarchy

index selection

/*[0] first child tag
/table[0]/tr[-1] last table row of first table
/*/*/td all td's wich have at least two ancestors

attributes

//[@*] any tag with at least one attribute
//[@foo] any child tag with an attribute foo
//[@foo="bar"] any child tag with an attribute "foo" with the value "bar"
//[@*="foo"] any tag with at least one attribute with the value "foo"
//[@foo=*] attrinute "foo" must have a value (no flag attribute)
//[@*=*] at least one attribute and every attribute must have a value (no flag attributes allowed)
//[not(@*)] any tag with no attributes at all
//[not(@foo)] may not not have an attribute "foo"
//[not(@foo="bar")] might have an attribute "foo" but not with its value set to"bar"
//[not(@*="bar")] at least one attribute but no attribute may have the value "bar"
//[not(@foo=*)] attrinute "bar" must not not have value (flag attribute)
//[not(@*=*)] at least one attribute but no attribute must have a value (all flag attrributes)

combining attributes and indices is supported aswell

/b[1][2][@foo="bar"][not(@bar="baz")] first and second <b> for wich the attributes selectors are True

other selectors

root() root node of the document
raw() raw tag
text() text node
comment() comment node
doctype() doctype declaration
sgml-declaration() SGML declaration
processing-instruction() processing instruction

Notes:

  • index errors are ignored in patterns, check any return value carefully
  • attribute values should best be quoted (" and ' supported)!
  • no spaces allowed in patterns! table[not(@*)] is ok table[ @* ] will not find the tag

sample

from html import *

t = (
    Table()
    (
        Tr(),
        Tr(),
        Tr()    
    )
)

print t.path('//tr')
>> [<html.html_tags.Tr object at ...>, <html.html_tags.Tr object at ...>, <html.html_tags.Tr object at ...>]