FreenetIS
 All Classes Namespaces Functions Variables Pages
Public Member Functions | Public Attributes | Protected Member Functions | Protected Attributes | List of all members
Parser_Html_Table Class Reference

Parser_Html_Table is ABSTRACT class containing methods useful for parsing HTML tables in generic HTML files. More...

Inheritance diagram for Parser_Html_Table:
Parser_Ebanka

Public Member Functions

 open ($url)
 Opens URL.
 get_line ()
 get_line appends AT LEAST one line from the $file into the $buffer.
 parse ($url)
 Parse method.

Public Attributes

const TIMEOUT = 3

Protected Member Functions

 find_tag_and_trim ($tag)
 find_tag_and_trim($tag) tries to find the tag in the $this->buffer and trim the beginning of the buffer till (and including) the $tag returns false if string not found.
 find_tags_and_trim ($tags)
 The same as previous function, but for multiple tags search.
 find_row_end ()
 this functions tries to find the end of table row.
 get_table_rows ()
 get_table_rows tries to fill the buffer with at least one table row (...<[/]tr>) string.
 get_charset ()
 Sets charset.

Protected Attributes

 $file
 $charset
 $buffer
 $eoln_pos
 $matches
 $table_end = false

Detailed Description

Parser_Html_Table is ABSTRACT class containing methods useful for parsing HTML tables in generic HTML files.

Motivation: we want to parse HTML tables to get interesting data from various web sites. The HTML code of the tables often does not conforms to XML/XHTML rules. It often does not conform even HTML4, e.g. - the table row is not closed by , table cell is not closed by etc. Therefore, XML parsers can't be used for this. The Tidy extension is not available on all hostings. If you think about parsing a non-XHTML non-HTML4.0 table, look at this class. The methods have been optimized to give maximum possible performance and memory efficiency. For an example how to use this class, see the Parser_Ebanka class

Author
Tomas <Dulik at="" unart="" dot="" cz>="">
Version
1.0

Member Function Documentation

Parser_Html_Table::find_row_end ( )
protected

this functions tries to find the end of table row.

It can handle even rows terminated incorrectly by

instead of

Returns
integer The position of the end row tag ( or

) or false if the tag is not found.

PHP5 version: in PHP5, strripos can search whole string, not only 1 char as in PHP4

PHP4 version: we have to use perl regular expressions... This is only 0.03sec/100kB slower than PHP5 strripos version

$matchCnt=preg_match("/<[\/]?(?:tr|table)(?!.*<[\/]?(tr|table))/si",$this->buffer, $matches, PREG_OFFSET_CAPTURE); if ($matchCnt==1) return $matches[0][1]; else return false;

Parser_Html_Table::find_tag_and_trim (   $tag)
protected

find_tag_and_trim($tag) tries to find the tag in the $this->buffer and trim the beginning of the buffer till (and including) the $tag returns false if string not found.

returns true if string found, and the variable $this->buffer contains string trimmed from the first occurence of $tag

Parser_Html_Table::find_tags_and_trim (   $tags)
protected

The same as previous function, but for multiple tags search.

If tag is found, returns the tag index in the $tags array. If tag is not found, returns number of $tags+1

Parameters
array$tags
Returns
integer
Parser_Html_Table::get_line ( )

get_line appends AT LEAST one line from the $file into the $buffer.

Returns
boolean buffer, eoln_pos; In PHP4, this is MUCH faster than using fgets because of a PHP bug. In PHP5, this is usualy still faster than the following version based on fgets:

protected function get_line_fgets() { if (!feof($this->file)) $this->buffer .= fgets($this->file); else return false; $this->eoln_pos=strlen($this->buffer); return true; }

Note for HTML files with super long lines (hundreds of kbytes without single EOLN) the fgets would be useless - it'd take a lot of memory to read a single line! For such files, you should modify the code of my function this way: Replace ...eoln_pos=strripos($this->buffer,"\n")) by something like ...eoln_pos=find_row_end()

Parser_Html_Table::get_table_rows ( )
protected

get_table_rows tries to fill the buffer with at least one table row (...<[/]tr>) string.

It then parses the rows using a regular expression, which returns the content of the table cells in the $this->matches array Because fread reads whole blocks, it is possible this

Returns
bool
Parser_Html_Table::open (   $url)

Opens URL.

Parameters
string$url
Parser_Html_Table::parse (   $url)
abstract

Parse method.

Parameters
string$url

The documentation for this class was generated from the following file: