Parser_Html_Table is ABSTRACT class containing methods useful for parsing HTML tables in generic HTML files. More...

Inheritance diagram for Parser_Html_Table:

Public Member Functions
	open ($url)
	Opens URL.
	get_line ()
	get_line appends AT LEAST one line from the $file into the $buffer.
	parse ($url)
	Parse method.

Public Attributes
const	TIMEOUT = 3

Protected Member Functions
	find_tag_and_trim ($tag)
	find_tag_and_trim($tag) tries to find the tag in the $this->buffer and trim the beginning of the buffer till (and including) the $tag returns false if string not found.
	find_tags_and_trim ($tags)
	The same as previous function, but for multiple tags search.
	find_row_end ()
	this functions tries to find the end of table row.
	get_table_rows ()
	get_table_rows tries to fill the buffer with at least one table row (...<[/]tr>) string.
	get_charset ()
	Sets charset.

Protected Attributes
	$file
	$charset
	$buffer
	$eoln_pos
	$matches
	$table_end = false

Detailed Description

Parser_Html_Table is ABSTRACT class containing methods useful for parsing HTML tables in generic HTML files.

Motivation: we want to parse HTML tables to get interesting data from various web sites. The HTML code of the tables often does not conforms to XML/XHTML rules. It often does not conform even HTML4, e.g. - the table row is not closed by , table cell is not closed by etc. Therefore, XML parsers can't be used for this. The Tidy extension is not available on all hostings. If you think about parsing a non-XHTML non-HTML4.0 table, look at this class. The methods have been optimized to give maximum possible performance and memory efficiency. For an example how to use this class, see the Parser_Ebanka class

Author: Tomas <Dulik at="" unart="" dot="" cz>="">

Version: 1.0

Member Function Documentation

Parser_Html_Table::find_row_end ( )

protected

this functions tries to find the end of table row.

It can handle even rows terminated incorrectly by

instead of

Returns: integer The position of the end row tag ( or

) or false if the tag is not found.

PHP5 version: in PHP5, strripos can search whole string, not only 1 char as in PHP4

PHP4 version: we have to use perl regular expressions... This is only 0.03sec/100kB slower than PHP5 strripos version

$matchCnt=preg_match("/<[\/]?(?:tr|table)(?!.*<[\/]?(tr|table))/si",$this->buffer, $matches, PREG_OFFSET_CAPTURE); if ($matchCnt==1) return $matches[0][1]; else return false;

Parser_Html_Table::find_tag_and_trim ( $tag )

protected

find_tag_and_trim($tag) tries to find the tag in the $this->buffer and trim the beginning of the buffer till (and including) the $tag returns false if string not found.

returns true if string found, and the variable $this->buffer contains string trimmed from the first occurence of $tag

Parser_Html_Table::find_tags_and_trim ( $tags )

protected

The same as previous function, but for multiple tags search.

If tag is found, returns the tag index in the $tags array. If tag is not found, returns number of $tags+1

Parameters

array $tags

Returns: integer

Parser_Html_Table::get_line ( )

get_line appends AT LEAST one line from the $file into the $buffer.

Returns: boolean buffer, eoln_pos; In PHP4, this is MUCH faster than using fgets because of a PHP bug. In PHP5, this is usualy still faster than the following version based on fgets:

protected function get_line_fgets() { if (!feof($this->file)) $this->buffer .= fgets($this->file); else return false; $this->eoln_pos=strlen($this->buffer); return true; }

Note for HTML files with super long lines (hundreds of kbytes without single EOLN) the fgets would be useless - it'd take a lot of memory to read a single line! For such files, you should modify the code of my function this way: Replace ...eoln_pos=strripos($this->buffer,"\n")) by something like ...eoln_pos=find_row_end()

Parser_Html_Table::get_table_rows ( )

protected

get_table_rows tries to fill the buffer with at least one table row (...<[/]tr>) string.

It then parses the rows using a regular expression, which returns the content of the table cells in the $this->matches array Because fread reads whole blocks, it is possible this

Returns: bool

Parser_Html_Table::open ( $url )

Opens URL.

Parameters

string $url

Parser_Html_Table::parse ( $url )

abstract

Parse method.

Parameters

string $url

The documentation for this class was generated from the following file:

Parser_Html_Table.php

Public Member Functions

Public Attributes

Protected Member Functions

Protected Attributes

Detailed Description

Member Function Documentation