Path Syntax
To query a tag for child tags the html module provides the following syntax, somewhat close to xpath:
path selectors
| /* | selects all child tags
|
| //* | all tags (recursive)
|
| /.. | parent tag of the tag
|
| //.. | not used (have no idea what it should return)
|
| /table | all emidiate child tables
|
| //table | all tables located anywhere lower in the hirarchy
|
index selection
| /*[0] | first child tag
|
| /table[0]/tr[-1] | last table row of first table
|
| /*/*/td | all td's wich have at least two ancestors
|
attributes
| //[@*] | any tag with at least one attribute
|
| //[@foo] | any child tag with an attribute foo
|
| //[@foo="bar"] | any child tag with an attribute "foo" with the value "bar"
|
| //[@*="foo"] | any tag with at least one attribute with the value "foo"
|
| //[@foo=*] | attrinute "foo" must have a value (no flag attribute)
|
| //[@*=*] | at least one attribute and every attribute must have a value (no flag attributes allowed)
|
| //[not(@*)] | any tag with no attributes at all
|
| //[not(@foo)] | may not not have an attribute "foo"
|
| //[not(@foo="bar")] | might have an attribute "foo" but not with its value set to"bar"
|
| //[not(@*="bar")] | at least one attribute but no attribute may have the value "bar"
|
| //[not(@foo=*)] | attrinute "bar" must not not have value (flag attribute)
|
| //[not(@*=*)] | at least one attribute but no attribute must have a value (all flag attrributes)
|
combining attributes and indices is supported aswell
| /b[1][2][@foo="bar"][not(@bar="baz")] | first and second <b> for wich the attributes selectors are True
|
other selectors
| root() | root node of the document
|
| raw() | raw tag
|
| text() | text node
|
| comment() | comment node
|
| doctype() | doctype declaration
|
| sgml-declaration() | SGML declaration
|
| processing-instruction() | processing instruction
|
Notes:
- index errors are ignored in patterns, check any return value carefully
- attribute values should best be quoted (" and ' supported)!
- no spaces allowed in patterns! table[not(@*)] is ok table[ @* ] will not find the tag
sample
from html import *
t = (
Table()
(
Tr(),
Tr(),
Tr()
)
)
print t.path('//tr')
>> [<html.html_tags.Tr object at ...>, <html.html_tags.Tr object at ...>, <html.html_tags.Tr object at ...>]