View Full Version : Avoid inserting duplicate record
JamieC
Oct 18th 2005, 3:44 am
Hi All,
We have recently purchased an XML feed of news articles from a news organisation.
I'm trying to write a script to copy this feed into a mySQL database. The problem I have is that their XML feed updates in a "queue" style system where as a new article is added, the last article is removed.
I'm trying to write an insert statement like this:
INSERT INTO articles ( OriginalID , SourceID , Created , Headline , Article , ArticleSummary , PictureURL , PictureURLSmall ) VALUES ( foo , bar ... );
...that will avoid adding duplicate stories to my database. The field that I can used to identify duplicates is 'OriginalID'. Ideally I would like to do this in a single statement - in fact, i'm sure i've done it before but I can't find the code!
My primary key is an auto increment on the database, rather than using their article id, which is why I can't rely on mySQL to handle this - this is a precaution in case I add additional news sources at a later stage.
Maybe something along the lines of 'IF NOT EXISTS'?
Any suggestions?
Cheers!
- Jamie
jbw
Oct 18th 2005, 3:47 am
create a unique key on the field, or do a select first to see if it exists inside a transaction that is followed by the insert if needed, if you are using something that support it (not myisam).
Another option is to make the primary key a composite key that includes both your unique id and the data sources.
JamieC
Oct 18th 2005, 3:59 am
do a select first to see if it exists inside a transaction that is followed by the insert if needed
This is my ideal solution, however I am having trouble composing a conditional statement for mySQL. Any clues as to the syntax?
Cheers,
Jamie
jbw
Oct 18th 2005, 6:24 am
actually I think the key ideas are more the idea solution really.
but to do it htis way you would do
mysql_query(
'begin work;
select count(*) from blah where blah');
now back in your code you check the value of count
if ( count < 1) mysql_query('insert blah blah');
mysql_query('commit;');
btw, postgreql is a better db for doing things like this.
JamieC
Oct 18th 2005, 9:42 am
Hi jbw,
Thanks for your help through the day. In the end I went, as you suggested, for the unique key option. Just not sure if I like having to deliberately create "exceptions" - but it seems to be the most efficient method.
Just to clarify, I am actually working with mySQL and VBScript - weird combo I know!
J.D.
Oct 23rd 2005, 9:44 am
You can go with an additional lookup, as jbw suggests, or you can just look up some data from the last feed item you processed (e.g. look up description and URL with the highest autoincremented record ID) and then skip all new feed entries until you get the item you found in your database. Then you can process all feed entries following this one.
J.D.
JamieC
Oct 28th 2005, 6:59 am
Hi all,
Bit of a follow up to this query, which is why I have included it in the same thread. I have now discovered that our supplier makes modifications to existing articles in their XML feed. This means that I need to update rows that have new content.
My current insert statement is:
INSERT INTO articles ( OriginalID , SourceID , Created , Headline , Article , ArticleSummary , PictureURL , PictureURLSmall ) VALUES( " & ItemID & " , 1 , '" & Created & "' , '" & EscapeCharacters( Headline ) & "' , '" & EscapeCharacters( Article ) & "' , '" & EscapeCharacters( ArticleSummary ) & "' , '" & EscapeCharacters( PictureURL ) & "' , '" & EscapeCharacters( PictureURLSmall ) & "' )"
The table also includes an autonumbered ArticleID. I have set OriginalID to be unique, so when an existing article is encountered currently the INSERT does not take place.
I now want to UPDATE existing records with the content in the XML feed (which may or may not have changed).
I have been investigating the ON DUPLICATE KEY UPDATE feature, but the documentation is slightly vague.
Does anyone have any suggestions as to how I can achieve this?
Thanks in advance,
Jamie
clickbuild
Oct 29th 2005, 12:07 pm
If the feed you are getting contains anything unique such as an ID or title (which shouldn't change) then use this concept to save time
Setup the table with a unique key (ID or TITLE)
// this will only insert unique result - using the insert failure as a filter
RESULT = QUERY: insert this row into the datbase including ID/TITLE
// if you want to update the table do this - using the failure above as an
// indicator that the record exists
if the RESULT fails
UPDATE this row based on ID
Once you get this running properly, just suppress any error messages and away you go.
Hope that makes sense - I wanted to put a non-code example.
vBulletin® v3.6.8, Copyright ©2000-2008, Jelsoft Enterprises Ltd.