Regex help - URGENT will pay!

Discussion in 'PHP' started by thesurface, Jul 12, 2012.

  1. #1
    I need to get from bellow html code this -> 110.234.71.142:8080


    <td class="leftborder timestamp" rel="1342078386"><span class="updatets ">
    18 secs</span></td>
             <td><span><style>
    .wef6{display:none}
    .N6zc{display:inline}
    .ANrs{display:none}
    .qQPY{display:inline}
    .Cory{display:none}
    .Jgqn{display:inline}
    </style><span class="N6zc">110</span><span style="display:none">180</span><span class="ANrs">13</span><div style="display:none">176</div>.<span class="N6zc">234</span><span class="Cory">7</span>.<span style="display: inline">71</span><span class="wef6">63</span>.<span class="227">142</span></span></td>    
             <td>
    8080</td>
    Code (markup):
     
    thesurface, Jul 12, 2012 IP
  2. Vooler

    Vooler Well-Known Member

    Messages:
    1,146
    Likes Received:
    64
    Best Answers:
    4
    Trophy Points:
    150
    #2
    Why not do it using DOMDocument. Try following.

    
    libxml_use_internal_errors(TRUE);
    $dom = new DOMDocument();
    $dom->loadHTML($code);
    $xml = simplexml_import_dom($dom);
    libxml_use_internal_errors(FALSE);
    
    foreach($xml->xpath("//span") as $item){
        echo (string)$item . PHP_EOL;
    }
    
    PHP:
     
    Vooler, Jul 13, 2012 IP
  3. EricBruggema

    EricBruggema Well-Known Member

    Messages:
    1,740
    Likes Received:
    28
    Best Answers:
    13
    Trophy Points:
    175
    #3
    why not remove first all between <style></style> and then remove all <?>

    you can use stripos to find </style>
    php.net/strip_tags is also nice! :)
     
    EricBruggema, Jul 13, 2012 IP
  4. omgitsfletch

    omgitsfletch Well-Known Member

    Messages:
    1,222
    Likes Received:
    44
    Best Answers:
    0
    Trophy Points:
    145
    #4
    It looks like it has some false IP addresses too. Hmm can you show me where this code is being displayed, if possible? I may be able to assist you with something to parse it properly.
     
    omgitsfletch, Sep 17, 2012 IP
  5. ratan1980

    ratan1980 Member

    Messages:
    46
    Likes Received:
    1
    Best Answers:
    1
    Trophy Points:
    28
    #5
    1)<span[\s]class="N6zc">([0-9]+).*?<span[\s]class="N6zc">([0-9]+).*?="display:[\s]inline">([0-9]+).*?</span>.<span[\s]class=.*?>([0-9]+)
    2)([0-9]+)</td>
    3)above are the two regex u can use to extract ip address(the first one ) and port number (the second one). then u can combine all the numbers.
     
    ratan1980, Sep 20, 2012 IP
  6. plussy

    plussy Peon

    Messages:
    152
    Likes Received:
    5
    Best Answers:
    9
    Trophy Points:
    0
    #6
    ok it looks like you are doing some scraping there.

    the problem I see ( not sure if I am right ) is that the css classes in the <style></style> tags are generated at random.

    If you want a script that is handling random classes then look at this one.

    might be a bit long but I commented every step and made it simple to understand.

    
    <?php
    $code = '<td class="leftborder timestamp" rel="1342078386"><span class="updatets ">18 secs</span></td>         <td><span><style>.wef6{display:none}.N6zc{display:inline}.ANrs{display:none}.qQPY{display:inline}.Cory{display:none}.Jgqn{display:inline}</style><span class="N6zc">110</span><span style="display:none">180</span><span class="ANrs">13</span><div style="display:none">176</div>.<span class="N6zc">234</span><span class="Cory">7</span>.<span style="display: inline">71</span><span class="wef6">63</span>.<span class="227">142</span></span></td>             <td>8080</td>';
    
    // remove all line feeds    
    $code = str_replace("\n",'',$code);        
    
    // get the inline styles    
    preg_match_all('|<style>(.*?)</style>|',$code,$arr);        
    
    // get each style rule    
    $parts = explode('.',$arr[1][0]);        
    
    // ignore first one as empty    
    unset($parts[0]);        
    
    // delete style from $code    
    $code = str_replace($arr[0][0],'',$code);        
    
    // loop through all style rules    
    foreach ($parts as $part) {                        
    
        // get what display the rule is            
        preg_match('|\{(display:.*)\}|',$part,$style);                        
        
        // get style class            
        $class = substr($part,0,4);                        
    
        // change class to style arrtibute on span elements            
        $code = str_replace('class="'.$class.'"','style="'.$style[1].'"',$code);                
    }        
    
    // check if there are style any classes left.    
    preg_match_all('|span (class=".*?")|',$code,$arr);        
    
    // ignore first one again.    
    unset($arr[0]);
    
    // loop through all left over classes    
    foreach ($arr as $part) {            
     
        // change all left over classes it display:inline as there is no other rule defined for them.            
        $code = str_replace($part[1],'style="display:inline"',$code);                
    }        
    
    // get all inline spans         
    preg_match_all('|<span style="display:\s*inline">(.*?)</span>|',$code,$arr);        
    
    // join them with a .    
    $ip = implode('.',$arr[1]);            
    
    // get port number    
    preg_match('|</td>             <td>([0-9]*)</td>|',$code,$arr);        
    
    $port = $arr[1];        
    echo $ip.':'.$port;    
    
    ?>
    
    PHP:
    Hope this helps you.
     
    plussy, Sep 20, 2012 IP