Google's html is not fully standards compliant

Discussion in 'HTML & Website Design' started by Eager2Seo, Sep 22, 2010.

  1. #1
    This could be a PHP or HTML post....
    I'm in internet marketing now and I was writing an SEO tool to pull info off google pages, and many of their class and id identifiers do not have quotes! like <div id=header> for example. This was wreaking havoc with my scraping tools.

    I used this php to fix it:

    preg_replace("/(class|id)=(\w+?)([> ])/",'$1="$2"$3',$body);
    Basically it fixes that problem.

    Maybe it is to discourage scraping?
     
    Eager2Seo, Sep 22, 2010 IP
  2. drhowarddrfine

    drhowarddrfine Peon

    Messages:
    5,428
    Likes Received:
    95
    Best Answers:
    7
    Trophy Points:
    0
    #2
    What gives you the impression attribute values need to be quoted? While this is a SGML requirement, it's not required in HTML.

    Also, in the past, people complained Google didn't use a doctype but didn't realize that Google served their pages with proper http content-type headers so the doctype wasn't needed.

    From the W3C docs:
    I presume Google is using those "certain cases". A test case of a div using an id in the validator does not generate any error.
     
    Last edited: Sep 22, 2010
    drhowarddrfine, Sep 22, 2010 IP