19 users online (0 members and 19 guests)  

Thread: Parse HTML


  Results 1 to 5 of 5

Related

  1. Parse partial directory name    Forum: CGI Perl Forum
    Replies: 1
  2. Replies: 0
  3. Replies: 8
  4. How to parse displayed text from HTML code?    Forum: CGI Perl Forum
    Replies: 3
  5. How to parse lines of a reponse?    Forum: CGI Perl Forum
    Replies: 3
  1. #1
    jack jill's Avatar
    New User

    Status
    Offline
    Join Date
    Oct 2005
    Posts
    6

    Smile Parse HTML

    Hi..
    Can anyone help me to write a regular expression to parse HTML page (removing all the tags except <a> tag) using PHP?
    Thanks

  2. #2
    ALL's Avatar
    Super Dooper Nerd

    Status
    Offline
    Join Date
    Feb 2005
    Location
    localhost
    Posts
    382

    Re: Parse HTML

    so, you want to keep the links, but remove everything else... like:
    HTML Code:
    <html><head><title>something</title></head><body>some text<br />something else<br /><input type="password" value="something" /><a src="somewhere1.txt">something 1</a><a src="somwhere2.txt">something 2</a></body></html>
    this is where i am stumped...

    do you want to extract:
    somewhere1.txt && something 1
    and
    somewhere2.txt && something 2

    or
    <a src="somewhere1.txt">something 1</a>
    <a src="somwhere2.txt">something 2</a>

    if you can tell me a perfect example of what you want, that would be great, because i dont want to make something that you don't need/want.

    -ALL

  3. #3
    jack jill's Avatar
    New User

    Status
    Offline
    Join Date
    Oct 2005
    Posts
    6

    Re: Parse HTML

    Thanks !
    Actually I need to keep track the links(something 1) and its src(somewhere1.txt). So, I should maintain the code as
    <a src="somewhere1.txt">something 1</a>
    Thanks again!

    Quote Originally Posted by ALL
    so, you want to keep the links, but remove everything else... like:
    HTML Code:
    <html><head><title>something</title></head><body>some text<br />something else<br /><input type="password" value="something" /><a src="somewhere1.txt">something 1</a><a src="somwhere2.txt">something 2</a></body></html>
    this is where i am stumped...

    do you want to extract:
    somewhere1.txt && something 1
    and
    somewhere2.txt && something 2

    or
    <a src="somewhere1.txt">something 1</a>
    <a src="somwhere2.txt">something 2</a>

    if you can tell me a perfect example of what you want, that would be great, because i dont want to make something that you don't need/want.

    -ALL

  4. #4
    ALL's Avatar
    Super Dooper Nerd

    Status
    Offline
    Join Date
    Feb 2005
    Location
    localhost
    Posts
    382

    Re: Parse HTML

    try playing around with this:
    PHP Code:
    function ExtractLinks($html){
    $tmp = array();
      while(
    strpos($html'<a') !== false){
        
    array_push($tmpsubstr($htmlstrpos($html'<a'), strpos($html'</a>')+4));
        
    $html substr($htmlstrpos($html'</a>')+4strlen($html));
      }
    return 
    $tmp;

    the only problem i can see right now is that if somone is using "<a " to decorat the text without using an "href" then it will still capture it even though it is not a "REAL" link

  5. #5
    DeadMeatGF's Avatar
    Moderator

    Status
    Offline
    Join Date
    Sep 2005
    Posts
    381

    Re: Parse HTML

    Quote Originally Posted by ALL
    try playing around with this:
    PHP Code:
    function ExtractLinks($html){
    $tmp = array();
      while(
    strpos($html'<a') !== false){
        
    array_push($tmpsubstr($htmlstrpos($html'<a'), strpos($html'</a>')+4));
        
    $html substr($htmlstrpos($html'</a>')+4strlen($html));
      }
    return 
    $tmp;

    the only problem i can see right now is that if somone is using "<a " to decorat the text without using an "href" then it will still capture it even though it is not a "REAL" link
    If you use
    PHP Code:
    function ExtractLinks($html){
    $tmp = array();
      while(
    strpos($html'<a href=') !== false){
        
    array_push($tmpsubstr($htmlstrpos($html'<a'), strpos($html'</a>')+4));
        
    $html substr($htmlstrpos($html'</a>')+4strlen($html));
      }
    return 
    $tmp;

    It should avoid the decoration issue?



Tags for this Thread