Large CSV Import

Discussion in 'PHP' started by meloncreative, Nov 25, 2010.

  1. #1
    Hi all

    So... I am creating an import script for putting contacts into a database. The script we had worked ok for 500kb / 20k row CSV files, but anything much bigger than that and it started to run into the max execution limit. Rather than alter this I wish to create something that will run in the background and work as efficiently as possible.

    So basically the CSV file is uploaded, then you choose if the duplicates should be ignored / overwritten, and you match up the fields in the CSV (by the first line being a field title row), to the fields in the database.

    The field for the email address is singled out as this is to be checked for duplicates that already exist in the system.

    It then saves these values, along with the filename, and puts it all into an import queue table, which is processed by a CRON job. Each batch of the CRON job will look in the queue, find the first import that is incomplete, then start work on that file from where it left off last. When the batch is complete it will update the row to give a pointer in the file for the next batch, and update how many contacts were imported / how many duplicates there were

    So far so good, but when checking for duplicity it is massively slowing down the script. I can run 1000 lines of the file in 0.04 seconds without checking, but with checking that increases to 14-15 seconds, and gets longer the more contacts are in the db.

    For every line it tries to import its doing a SELECT query on the contact table, and although I am not doing SELECT * its still adding up to a lot of DB activity. One thought was to load every email address in the contacts table into an array before hand, but this table could be massive so thats likely to be just as inefficient.

    Any ideas on optimising this process?
     
    meloncreative, Nov 25, 2010 IP
  2. meloncreative

    meloncreative Peon

    Messages:
    37
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #2
    also worth noting, my email field isnt a primary key. This is due to the contact table being used by multiple sites, and each contact is assigned a site_id. The duplicates need to be calculated based on a single site_id
     
    meloncreative, Nov 25, 2010 IP
  3. ThePHPMaster

    ThePHPMaster Well-Known Member

    Messages:
    737
    Likes Received:
    52
    Best Answers:
    33
    Trophy Points:
    150
    #3
    ThePHPMaster, Nov 25, 2010 IP
  4. meloncreative

    meloncreative Peon

    Messages:
    37
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #4
    No, as this will waste ID's, however I am now using ON DUPLICATE KEY UPDATE, and using a composite key to ensure the site_id and email fields dont clash.

    nearly there :)
     
    meloncreative, Nov 26, 2010 IP
  5. olddocks

    olddocks Notable Member

    Messages:
    3,275
    Likes Received:
    165
    Best Answers:
    0
    Trophy Points:
    215
    #5
    first you will need to import all the data from csv to mysql. To accomplish this, follow these steps

    1. import the csv data into excel sheet, create blank column in front for auto id increment
    2. Create mysql table fields and id auto increment.
    3. Again save the excel sheet to csv separated by commas. make sure the file starts with comma (,) because when imported to mysql it will create auto increment id values automatically.
    4. once you get the data in mysql you can do whatever you want.

    Thats how i used to do and thought i could share my experiences!
     
    olddocks, Nov 26, 2010 IP
  6. meloncreative

    meloncreative Peon

    Messages:
    37
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Didnt really read the thread did you? ;) Thanks anyways I guess but not what I was after
     
    meloncreative, Nov 26, 2010 IP
  7. wmtips

    wmtips Well-Known Member

    Messages:
    601
    Likes Received:
    70
    Best Answers:
    1
    Trophy Points:
    150
    #7
    Make sure you have an index for a fields from the "where" clause of your select.
     
    wmtips, Nov 26, 2010 IP