PHP - Find difference in 2 strings and get path to difference

deriklogov Well-Known Member

Messages:: 1,078

Likes Received:: 22

Best Answers:: 0

Trophy Points:: 130

#1

I got 2 html files which use same template only some fields different and i need to get full xpath to those differences using PHP.

1st)
<html><body><divclass="price">12,400</div><divclass="make">Acura</div>

2nd)
<html><body><divclass="price">15,400</div><divclass="make">Bmw</div>

So as you can see from example its the same template but price is different and make So PHP script suppose to show xpath (those results):

//div[@class='price']
//div[@class='make']

Script needs to find difference in 2 files and get xpath to that difference, obviously template is unknown and every time could be different

Any Help Appreciated!!!

deriklogov, Oct 30, 2017 IP

SoftLink Active Member

Messages:: 120

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 60

#2

I'm not sure exactly what you're trying to do.
What do you mean by xpath? That's an xml term.

Is the html generated from php?
PHP can't directly read the html on a page.
It is executed before the page is rendered and is usually used to write the html.

It's easy to do in Javascript because Javascript can access the html (dom object).
Can you tell us a bit more about what you're trying to do?

SoftLink, Oct 30, 2017 IP

deriklogov Well-Known Member

Messages:: 1,078

Likes Received:: 22

Best Answers:: 0

Trophy Points:: 130

#3

Those html files are not generated from PHP they are static files.
In any language including PHP you can get to any dom object by using xpath
loading htmk into $dom then creating DomXpath and then you can access any node with xpath queries.

So PHP script needs to find what is the dynamic part between those 2 html files (in example above dynamic parts are price value and make of vehicle), and then i need to get xpath to that dynamic content.

deriklogov, Oct 30, 2017 IP

PoPSiCLe Illustrious Member

Messages:: 4,623

Likes Received:: 725

Best Answers:: 152

Trophy Points:: 470

#4

Not familiar with the functions (and too lazy to run after the docs right now), but if the content is read into the container-variable ($dom in this case) as an array (or you can make it do that), you could just read each file into separate arrays and do something like:
function arrayDiff($A, $B) {
    $intersect = array_intersect($A, $B);
    return array_merge(array_diff($A, $intersect), array_diff($B, $intersect));
}
Code (markup):
This will give you an array of the differences (the non-matching elements), which can then be parsed to get the Xpath. Might be too complex for what you need, but it's at least one way to go about it.

PoPSiCLe, Oct 30, 2017 IP

SoftLink Active Member

Messages:: 120

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 60

#5

Oh ok, static files & php xml.
You'd have to find the differences first and get an array of each string that is different.
Then you can query for those in xml.

To get different phrases you have to extract all the text from the html, into an array.
The only other way I can see to do it is compare individual words which means you lose the phrases.

So, I've written some code to extract the 'phrases' from the html.
I think the xpath query is correct but I can't get it to return the actual path.
I've read twice that it can't be done.
If you can do it please let me know how you did it.


<?php
$strConstant = file_get_contents("Test1.htm");
$strVariable = file_get_contents("Test2.htm");

$arXPaths = getXPaths($strVariable, getDiffArray($strConstant, $strVariable));
foreach($arXPaths as $value) {
   echo $value . "<br/>";
}
function getXPaths($strVariable, $arDiff) {
   $arXPaths = array();
   $doc = new DOMDocument();
   $doc->loadXML($strVariable);

   if(empty($arDiff) || !is_array($arDiff)) return false;
   foreach($arDiff as $strDiff) {
     $query = "//*[text()[contains(.,'" . $strDiff . "')]]";    
     $xpathvar = new Domxpath($doc);
     $queryResult = $xpathvar->query($query);
     foreach($queryResult as $node) {
       $arXPaths[] = $node->getNodePath(); //this isn't correct
     }    
   }
   return $arXPaths;
}

function getDiffArray($strConstant, $strVariable){
   $arDiff = array();
   $arConstant = getElemTextArray($strConstant);
   $arVariable = getElemTextArray($strVariable);  
   $diff = diff($arConstant, $arVariable);
   if(is_array($diff)) {
     foreach($diff as $k){
         if(is_array($k))
         {
           if(!empty($k['i'])) {
             foreach($k['i'] as $key => $value) {          
               $arDiff[] = $value;            
             }
           }                
         }
     }
   }
   return $arDiff;
}
function diff($old, $new){
   /*
   (C) Paul Butler 2007 <http://www.paulbutler.org/>
  May be used and distributed under the zlib/libpng license.
   */
  $matrix = array();
  $maxlen = 0;
  foreach($old as $oindex => $ovalue){
  $nkeys = array_keys($new, $ovalue);
  foreach($nkeys as $nindex){
  $matrix[$oindex][$nindex] = isset($matrix[$oindex - 1][$nindex - 1]) ?
  $matrix[$oindex - 1][$nindex - 1] + 1 : 1;
  if($matrix[$oindex][$nindex] > $maxlen){
  $maxlen = $matrix[$oindex][$nindex];
  $omax = $oindex + 1 - $maxlen;
  $nmax = $nindex + 1 - $maxlen;
  }
  }  
  }
  if($maxlen == 0) return array(array('d'=>$old, 'i'=>$new));
  return array_merge(
  diff(array_slice($old, 0, $omax), array_slice($new, 0, $nmax)),
  array_slice($new, $nmax, $maxlen),
  diff(array_slice($old, $omax + $maxlen), array_slice($new, $nmax + $maxlen)));
}

function getElemTextArray($html) {
   $arTexts = array();  
   $nLastIdx = 0;
   $bBreak = false;
  
   $reg = "/(?<=>)\s*(?=<)|(?<=>)\n*([^<]+)/";
   if (preg_match_all($reg, $html, $arMatches)) {
     foreach($arMatches as $key => $value) {
       foreach($value as $key1 => $value1) {
         if(empty(trim($value1))) continue;
         if($key1 < $nLastIdx) {
           $bBreak = true;
           break;
         }
         $nLastIdx = $key1;
         $arTexts[] = $value1;
       }
       if($bBreak) break;
     }
   }
   return $arTexts;
}
?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>


</head>

<body>

</body>
</html>

Code (markup):

SoftLink, Oct 31, 2017 IP

ThePHPMaster Well-Known Member

Messages:: 737

Likes Received:: 52

Best Answers:: 33

Trophy Points:: 150

#6

deriklogov said: ↑

obviously template is unknown and every time could be different
Click to expand...

You can't, at least not without developing some sort of intelligent detection (pretty sure it is not worth it for you). Every time the HTML changes you would need to re-work it.

ThePHPMaster, Oct 31, 2017 IP

SoftLink Active Member

Messages:: 120

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 60

#7

You just need a constant to compare each file with.
What I wrote expects the constant to have the same html as the variable.
It looks for text that's different inside each element.
If they're 2 completely different files the only way to do it is with a word by word comparison.
In that case a diff would pretty much be meaningless anyway.

It wouldn't matter if the template changes.
You just need to update the constant so the html (not necessarily the text) for the constant & variable are the same.

SoftLink, Oct 31, 2017 IP

Einheijar Well-Known Member

Messages:: 539

Likes Received:: 13

Best Answers:: 3

Trophy Points:: 165

#8

@SoftLink: You're on the right track

First of compile a list of xpaths that contain text from doc1 then compare it to doc2.

Einheijar, Nov 2, 2017 IP

SoftLink Active Member

Messages:: 120

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 60

#9

Einheijar said: ↑

@SoftLink: You're on the right track
First of compile a list of xpaths that contain text from doc1 then compare it to doc2.
Click to expand...

Thanks. Yea, that'd be another way to do it; maybe even quicker.

SoftLink, Nov 2, 2017 IP

Log in or Sign up

PHP - Find difference in 2 strings and get path to difference

deriklogov Well-Known Member

SoftLink Active Member

deriklogov Well-Known Member

PoPSiCLe Illustrious Member

SoftLink Active Member

ThePHPMaster Well-Known Member

SoftLink Active Member

Einheijar Well-Known Member

SoftLink Active Member

Useful Searches