Parser_Html_Table is ABSTRACT class containing methods useful for parsing HTML tables in generic HTML files. More...
Public Member Functions | |
open ($url) | |
Opens URL. | |
get_line () | |
get_line appends AT LEAST one line from the $file into the $buffer. | |
parse ($url) | |
Parse method. |
Public Attributes | |
const | TIMEOUT = 3 |
Protected Member Functions | |
find_tag_and_trim ($tag) | |
find_tag_and_trim($tag) tries to find the tag in the $this->buffer and trim the beginning of the buffer till (and including) the $tag returns false if string not found. | |
find_tags_and_trim ($tags) | |
The same as previous function, but for multiple tags search. | |
find_row_end () | |
this functions tries to find the end of table row. | |
get_table_rows () | |
get_table_rows tries to fill the buffer with at least one table row (...<[/]tr>) string. | |
get_charset () | |
Sets charset. |
Protected Attributes | |
$file | |
$charset | |
$buffer | |
$eoln_pos | |
$matches | |
$table_end = false |
Parser_Html_Table is ABSTRACT class containing methods useful for parsing HTML tables in generic HTML files.
Motivation: we want to parse HTML tables to get interesting data from various web sites. The HTML code of the tables often does not conforms to XML/XHTML rules. It often does not conform even HTML4, e.g. - the table row is not closed by , table cell is not closed by etc. Therefore, XML parsers can't be used for this. The Tidy extension is not available on all hostings. If you think about parsing a non-XHTML non-HTML4.0 table, look at this class. The methods have been optimized to give maximum possible performance and memory efficiency. For an example how to use this class, see the Parser_Ebanka class
|
protected |
this functions tries to find the end of table row.
It can handle even rows terminated incorrectly by
instead of
) or false if the tag is not found.
PHP5 version: in PHP5, strripos can search whole string, not only 1 char as in PHP4
PHP4 version: we have to use perl regular expressions... This is only 0.03sec/100kB slower than PHP5 strripos version
$matchCnt=preg_match("/<[\/]?(?:tr|table)(?!.*<[\/]?(tr|table))/si",$this->buffer, $matches, PREG_OFFSET_CAPTURE); if ($matchCnt==1) return $matches[0][1]; else return false;
|
protected |
find_tag_and_trim($tag) tries to find the tag in the $this->buffer and trim the beginning of the buffer till (and including) the $tag returns false if string not found.
returns true if string found, and the variable $this->buffer contains string trimmed from the first occurence of $tag
|
protected |
The same as previous function, but for multiple tags search.
If tag is found, returns the tag index in the $tags array. If tag is not found, returns number of $tags+1
array | $tags |
Parser_Html_Table::get_line | ( | ) |
get_line appends AT LEAST one line from the $file into the $buffer.
protected function get_line_fgets() { if (!feof($this->file)) $this->buffer .= fgets($this->file); else return false; $this->eoln_pos=strlen($this->buffer); return true; }
Note for HTML files with super long lines (hundreds of kbytes without single EOLN) the fgets would be useless - it'd take a lot of memory to read a single line! For such files, you should modify the code of my function this way: Replace ...eoln_pos=strripos($this->buffer,"\n")) by something like ...eoln_pos=find_row_end()
|
protected |
get_table_rows tries to fill the buffer with at least one table row (...<[/]tr>) string.
It then parses the rows using a regular expression, which returns the content of the table cells in the $this->matches array Because fread reads whole blocks, it is possible this
Parser_Html_Table::open | ( | $url | ) |
Opens URL.
string | $url |
|
abstract |
Parse method.
string | $url |