1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

PHP - Find difference in 2 strings and get path to difference

Discussion in 'PHP' started by deriklogov, Oct 30, 2017.

  1. #1
    I got 2 html files which use same template only some fields different and i need to get full xpath to those differences using PHP.

    1st)
    <html><body><divclass="price">12,400</div><divclass="make">Acura</div>

    2nd)
    <html><body><divclass="price">15,400</div><divclass="make">Bmw</div>

    So as you can see from example its the same template but price is different and make So PHP script suppose to show xpath (those results):

    //div[@class='price']
    //div[@class='make']
    SEMrush
    Script needs to find difference in 2 files and get xpath to that difference, obviously template is unknown and every time could be different

    Any Help Appreciated!!!
     
    deriklogov, Oct 30, 2017 IP
    SEMrush
  2. SoftLink

    SoftLink Greenhorn

    Messages:
    35
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    13
    #2
    I'm not sure exactly what you're trying to do.
    What do you mean by xpath? That's an xml term.

    Is the html generated from php?
    PHP can't directly read the html on a page.
    It is executed before the page is rendered and is usually used to write the html.

    It's easy to do in Javascript because Javascript can access the html (dom object).
    Can you tell us a bit more about what you're trying to do?
     
    SoftLink, Oct 30, 2017 IP
  3. deriklogov

    deriklogov Well-Known Member

    Messages:
    1,069
    Likes Received:
    16
    Best Answers:
    0
    Trophy Points:
    130
    #3
    Those html files are not generated from PHP they are static files.
    In any language including PHP you can get to any dom object by using xpath
    loading htmk into $dom then creating DomXpath and then you can access any node with xpath queries.

    So PHP script needs to find what is the dynamic part between those 2 html files (in example above dynamic parts are price value and make of vehicle), and then i need to get xpath to that dynamic content.
     
    deriklogov, Oct 30, 2017 IP
  4. PoPSiCLe

    PoPSiCLe Illustrious Member

    Messages:
    4,623
    Likes Received:
    725
    Best Answers:
    152
    Trophy Points:
    470
    #4
    Not familiar with the functions (and too lazy to run after the docs right now), but if the content is read into the container-variable ($dom in this case) as an array (or you can make it do that), you could just read each file into separate arrays and do something like:
    
    function arrayDiff($A, $B) {
        $intersect = array_intersect($A, $B);
        return array_merge(array_diff($A, $intersect), array_diff($B, $intersect));
    }
    
    Code (markup):
    This will give you an array of the differences (the non-matching elements), which can then be parsed to get the Xpath. Might be too complex for what you need, but it's at least one way to go about it.
     
    PoPSiCLe, Oct 30, 2017 IP
  5. SoftLink

    SoftLink Greenhorn

    Messages:
    35
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    13
    #5
    Oh ok, static files & php xml.
    You'd have to find the differences first and get an array of each string that is different.
    Then you can query for those in xml.

    To get different phrases you have to extract all the text from the html, into an array.
    The only other way I can see to do it is compare individual words which means you lose the phrases.

    So, I've written some code to extract the 'phrases' from the html.
    I think the xpath query is correct but I can't get it to return the actual path.
    I've read twice that it can't be done.
    If you can do it please let me know how you did it.
    
    <?php
    $strConstant = file_get_contents("Test1.htm");
    $strVariable = file_get_contents("Test2.htm");
    
    $arXPaths = getXPaths($strVariable, getDiffArray($strConstant, $strVariable));
    foreach($arXPaths as $value) {
       echo $value . "<br/>";
    }
    function getXPaths($strVariable, $arDiff) {
       $arXPaths = array();
       $doc = new DOMDocument();
       $doc->loadXML($strVariable);
    
       if(empty($arDiff) || !is_array($arDiff)) return false;
       foreach($arDiff as $strDiff) {
         $query = "//*[text()[contains(.,'" . $strDiff . "')]]";    
         $xpathvar = new Domxpath($doc);
         $queryResult = $xpathvar->query($query);
         foreach($queryResult as $node) {
           $arXPaths[] = $node->getNodePath(); //this isn't correct
         }    
       }
       return $arXPaths;
    }
    
    function getDiffArray($strConstant, $strVariable){
       $arDiff = array();
       $arConstant = getElemTextArray($strConstant);
       $arVariable = getElemTextArray($strVariable);  
       $diff = diff($arConstant, $arVariable);
       if(is_array($diff)) {
         foreach($diff as $k){
             if(is_array($k))
             {
               if(!empty($k['i'])) {
                 foreach($k['i'] as $key => $value) {          
                   $arDiff[] = $value;            
                 }
               }                
             }
         }
       }
       return $arDiff;
    }
    function diff($old, $new){
       /*
       (C) Paul Butler 2007 <http://www.paulbutler.org/>
      May be used and distributed under the zlib/libpng license.
       */
      $matrix = array();
      $maxlen = 0;
      foreach($old as $oindex => $ovalue){
      $nkeys = array_keys($new, $ovalue);
      foreach($nkeys as $nindex){
      $matrix[$oindex][$nindex] = isset($matrix[$oindex - 1][$nindex - 1]) ?
      $matrix[$oindex - 1][$nindex - 1] + 1 : 1;
      if($matrix[$oindex][$nindex] > $maxlen){
      $maxlen = $matrix[$oindex][$nindex];
      $omax = $oindex + 1 - $maxlen;
      $nmax = $nindex + 1 - $maxlen;
      }
      }  
      }
      if($maxlen == 0) return array(array('d'=>$old, 'i'=>$new));
      return array_merge(
      diff(array_slice($old, 0, $omax), array_slice($new, 0, $nmax)),
      array_slice($new, $nmax, $maxlen),
      diff(array_slice($old, $omax + $maxlen), array_slice($new, $nmax + $maxlen)));
    }
    
    function getElemTextArray($html) {
       $arTexts = array();  
       $nLastIdx = 0;
       $bBreak = false;
      
       $reg = "/(?<=>)\s*(?=<)|(?<=>)\n*([^<]+)/";
       if (preg_match_all($reg, $html, $arMatches)) {
         foreach($arMatches as $key => $value) {
           foreach($value as $key1 => $value1) {
             if(empty(trim($value1))) continue;
             if($key1 < $nLastIdx) {
               $bBreak = true;
               break;
             }
             $nLastIdx = $key1;
             $arTexts[] = $value1;
           }
           if($bBreak) break;
         }
       }
       return $arTexts;
    }
    ?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Untitled Document</title>
    
    
    </head>
    
    <body>
    
    </body>
    </html>
    
    
    
    Code (markup):
     
    SoftLink, Oct 31, 2017 IP
  6. ThePHPMaster

    ThePHPMaster Well-Known Member

    Messages:
    737
    Likes Received:
    52
    Best Answers:
    33
    Trophy Points:
    150
    #6
    You can't, at least not without developing some sort of intelligent detection (pretty sure it is not worth it for you). Every time the HTML changes you would need to re-work it.
     
    ThePHPMaster, Oct 31, 2017 IP
  7. SoftLink

    SoftLink Greenhorn

    Messages:
    35
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    13
    #7
    You just need a constant to compare each file with.
    What I wrote expects the constant to have the same html as the variable.
    It looks for text that's different inside each element.
    If they're 2 completely different files the only way to do it is with a word by word comparison.
    In that case a diff would pretty much be meaningless anyway.

    It wouldn't matter if the template changes.
    You just need to update the constant so the html (not necessarily the text) for the constant & variable are the same.
     
    SoftLink, Oct 31, 2017 IP
  8. Einheijar

    Einheijar Well-Known Member

    Messages:
    537
    Likes Received:
    13
    Best Answers:
    3
    Trophy Points:
    115
    #8
    @SoftLink: You're on the right track

    First of compile a list of xpaths that contain text from doc1 then compare it to doc2.
     
    Einheijar, Nov 2, 2017 IP
  9. SoftLink

    SoftLink Greenhorn

    Messages:
    35
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    13
    #9
    Thanks. Yea, that'd be another way to do it; maybe even quicker.
     
    SoftLink, Nov 2, 2017 IP