Avoid inserting duplicate record

Discussion in 'MySQL' started by JamieC, Oct 18, 2005.

  1. #1
    Hi All,

    We have recently purchased an XML feed of news articles from a news organisation.

    I'm trying to write a script to copy this feed into a mySQL database. The problem I have is that their XML feed updates in a "queue" style system where as a new article is added, the last article is removed.

    I'm trying to write an insert statement like this:

    INSERT INTO articles ( OriginalID , SourceID , Created , Headline , Article , ArticleSummary , PictureURL , PictureURLSmall ) VALUES ( foo , bar ... );
    Code (markup):
    ...that will avoid adding duplicate stories to my database. The field that I can used to identify duplicates is 'OriginalID'. Ideally I would like to do this in a single statement - in fact, i'm sure i've done it before but I can't find the code!

    My primary key is an auto increment on the database, rather than using their article id, which is why I can't rely on mySQL to handle this - this is a precaution in case I add additional news sources at a later stage.

    Maybe something along the lines of 'IF NOT EXISTS'?

    Any suggestions?

    Cheers!

    - Jamie
     
    JamieC, Oct 18, 2005 IP
  2. jbw

    jbw Peon

    Messages:
    343
    Likes Received:
    12
    Best Answers:
    0
    Trophy Points:
    0
    #2
    create a unique key on the field, or do a select first to see if it exists inside a transaction that is followed by the insert if needed, if you are using something that support it (not myisam).

    Another option is to make the primary key a composite key that includes both your unique id and the data sources.
     
    jbw, Oct 18, 2005 IP
  3. JamieC

    JamieC Well-Known Member

    Messages:
    226
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    138
    #3
    This is my ideal solution, however I am having trouble composing a conditional statement for mySQL. Any clues as to the syntax?

    Cheers,

    Jamie
     
    JamieC, Oct 18, 2005 IP
  4. jbw

    jbw Peon

    Messages:
    343
    Likes Received:
    12
    Best Answers:
    0
    Trophy Points:
    0
    #4
    actually I think the key ideas are more the idea solution really.

    but to do it htis way you would do

    mysql_query(
    'begin work;
    select count(*) from blah where blah');

    now back in your code you check the value of count

    if ( count < 1) mysql_query('insert blah blah');
    mysql_query('commit;');


    btw, postgreql is a better db for doing things like this.
     
    jbw, Oct 18, 2005 IP
  5. JamieC

    JamieC Well-Known Member

    Messages:
    226
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    138
    #5
    Hi jbw,

    Thanks for your help through the day. In the end I went, as you suggested, for the unique key option. Just not sure if I like having to deliberately create "exceptions" - but it seems to be the most efficient method.

    Just to clarify, I am actually working with mySQL and VBScript - weird combo I know!
     
    JamieC, Oct 18, 2005 IP
  6. J.D.

    J.D. Peon

    Messages:
    1,198
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #6
    You can go with an additional lookup, as jbw suggests, or you can just look up some data from the last feed item you processed (e.g. look up description and URL with the highest autoincremented record ID) and then skip all new feed entries until you get the item you found in your database. Then you can process all feed entries following this one.

    J.D.
     
    J.D., Oct 23, 2005 IP
  7. JamieC

    JamieC Well-Known Member

    Messages:
    226
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    138
    #7
    Hi all,

    Bit of a follow up to this query, which is why I have included it in the same thread. I have now discovered that our supplier makes modifications to existing articles in their XML feed. This means that I need to update rows that have new content.

    My current insert statement is:

    
    INSERT INTO articles ( OriginalID , SourceID , Created , Headline , Article , ArticleSummary , PictureURL , PictureURLSmall ) VALUES( " & ItemID & " , 1 , '" & Created & "' , '" & EscapeCharacters( Headline ) & "' , '" & EscapeCharacters( Article ) & "' , '" & EscapeCharacters( ArticleSummary ) & "' , '" & EscapeCharacters( PictureURL ) & "' , '" & EscapeCharacters( PictureURLSmall ) & "' )"
    
    Code (markup):
    The table also includes an autonumbered ArticleID. I have set OriginalID to be unique, so when an existing article is encountered currently the INSERT does not take place.

    I now want to UPDATE existing records with the content in the XML feed (which may or may not have changed).

    I have been investigating the ON DUPLICATE KEY UPDATE feature, but the documentation is slightly vague.

    Does anyone have any suggestions as to how I can achieve this?

    Thanks in advance,

    Jamie
     
    JamieC, Oct 28, 2005 IP
  8. clickbuild

    clickbuild Member

    Messages:
    89
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    48
    #8
    If the feed you are getting contains anything unique such as an ID or title (which shouldn't change) then use this concept to save time

    Setup the table with a unique key (ID or TITLE)

    // this will only insert unique result - using the insert failure as a filter
    RESULT = QUERY: insert this row into the datbase including ID/TITLE

    // if you want to update the table do this - using the failure above as an
    // indicator that the record exists
    if the RESULT fails
    UPDATE this row based on ID

    Once you get this running properly, just suppress any error messages and away you go.

    Hope that makes sense - I wanted to put a non-code example.
     
    clickbuild, Oct 29, 2005 IP