Results 1 to 16 of 16
  1. #1
    Join Date
    Feb 2004
    Posts
    385

    HIGH Volume MSSQL List Compare

    Hello folks,

    I have a MSSQL database with about 90,000+ email addresses.

    I have 35,000 email addresses which I have on a text file.

    If I run a query for me to do an export on the database to pull up all 90,000+ email addresses - would it have the potential to crash the server? What would be the safest way to export all the email addresses?

    My goal is to basically be able to compare both lists and make sure there are no matches with either lists, and if there are then it would be removed.

    I want to go about with this as safe as possible. Please post your suggestions here. Appreciate it.
    “Intelligence is not to make no mistakes, but quickly to see how to make them good.” - Bertolt Brecht

  2. #2
    Join Date
    May 2009
    Posts
    766
    Quote Originally Posted by reddem0n View Post
    Hello folks,

    I have a MSSQL database with about 90,000+ email addresses.

    I have 35,000 email addresses which I have on a text file.

    If I run a query for me to do an export on the database to pull up all 90,000+ email addresses - would it have the potential to crash the server?
    Yes. This is MS after all (I know...let the flames commence)

    Quote Originally Posted by reddem0n View Post
    What would be the safest way to export all the email addresses?

    My goal is to basically be able to compare both lists and make sure there are no matches with either lists, and if there are then it would be removed.
    This is totally unclear. If you have a matching email address in both lists, you want to remove it from both? Are you simply trying to merge the data? Need specifics here, or we don't even know what query you're trying to run. Give us an example with sample data in two lists and what the comparison process should look like.

  3. #3
    Join Date
    Feb 2004
    Posts
    385
    Sorry, I meant to say that I want to remove any matching email addresses from list 2 (35k emails) found on list 1 (90k emails).
    “Intelligence is not to make no mistakes, but quickly to see how to make them good.” - Bertolt Brecht

  4. #4
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,422
    I'd suck the email addresses in to a table and then you can create a view / union and export that:
    Code:
    mysql> select * from foo;
    +------+------+
    | id   | name |
    +------+------+
    | NULL | Joe  |
    | NULL | Hary |
    +------+------+
    2 rows in set (0.00 sec)
    
    mysql> select * from bar;
    +------+------+
    | id   | name |
    +------+------+
    | NULL | Joe  |
    | NULL | Hary |
    | NULL | Mike |
    +------+------+
    3 rows in set (0.00 sec)
    
    mysql> (select f.name from foo f) union (select b.name from bar b);
    +------+
    | name |
    +------+
    | Joe  |
    | Hary |
    | Mike |
    +------+
    3 rows in set (0.00 sec)
    The principle is the same in any ANSI SQL db.

    90 - 120K records is *nothing*.
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

  5. #5
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,422
    Alternatively you could write only the email column from your db, and take your existing other text data file, and run the following Python script against them, creating a sorted output file called "filtered.txt":

    PHP Code:
    #!/usr/bin/env python

    datafiles = ['emails1.txt''emails2.txt']
    output 'filtered.txt'

    emails set()

    for 
    datafile in datafiles:
        for 
    line in open(datafile'r'):
            
    email line.strip().lower()
            if 
    email:
                
    emails.add(email)

    fout open(output'w')
    for 
    email in sorted(emails):
        
    fout.write('%s\n' email
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

  6. #6
    Join Date
    May 2009
    Posts
    766
    I don't know what you'd do on MS equipment as far as using a script to do the processing, but say your table is "a" and you load the data from the text file into a table called "b". You can get the dataset you're looking for with the following query:

    PHP Code:
    SELECT FROM b WHERE email NOT IN (SELECT email FROM a); 

  7. #7
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,422
    In my two examples I assumed the OP wanted a "union" or combined list with no duplicates (perhaps because I've had to do that one too many times!).

    mattle's example shows the differences (one not in the other) between the two sets.

    Whatever the OP wants, it's covered now.
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

  8. #8
    presuming you want to delete from table B if already in table A

    Code:
    delete b
    where b.email in select a.email from a where a.email=b.email
    That is a subquery, you can also do it with a left outer join which is the preferred way to do it in TSQL.

    Code:
    delete b 
    from b
    left join a on a.email=b.email
    where b.email is null
    Both examples are drawn from the examples in the MS SQL Books Online. It should be on your CD.
    edgedirector.com
    managed dns global failover and load balance (gslb)
    exactstate.com
    uptime report for webhostingtalk.com

  9. #9

    *

    Quote Originally Posted by mattle View Post
    Yes. This is MS after all (I know...let the flames commence)

    No, if an indiot gets hold of sa on MSSQL, in might run slower. Or, he himself may not know how to get something done. But, it won't crash. Certainly not on something as puny as 90k records.
    edgedirector.com
    managed dns global failover and load balance (gslb)
    exactstate.com
    uptime report for webhostingtalk.com

  10. #10
    sorry, wrong direction, should be:

    Code:
    delete b 
    from b,a
    where b.email = a.email
    edgedirector.com
    managed dns global failover and load balance (gslb)
    exactstate.com
    uptime report for webhostingtalk.com

  11. #11
    Join Date
    May 2009
    Posts
    766
    Quote Originally Posted by mattle
    Are you simply trying to merge the data? Need specifics here, or we don't even know what query you're trying to run.
    Quote Originally Posted by reddem0n View Post
    Sorry, I meant to say that I want to remove any matching email addresses from list 2 (35k emails) found on list 1 (90k emails).
    Quote Originally Posted by mwatkins
    In my two examples I assumed the OP wanted a "union" or combined list with no duplicates (perhaps because I've had to do that one too many times!).
    ...still better than getting your specs from accountants...

  12. #12
    Join Date
    May 2009
    Posts
    766
    Quote Originally Posted by plumsauce View Post
    No, if an indiot gets hold of sa on MSSQL, in might run slower. Or, he himself may not know how to get something done. But, it won't crash. Certainly not on something as puny as 90k records.
    PHP Code:
    if ($self->isReasonable())
    {
      exit;
    }
    else
    {
      
    wetGroundWithFlammableAccelerant();
      switch (
    $self->mood)
      {
        case 
    "righteous":
          
    makeLinuxCase("With the real DBAs all on Linux, only idiots are left");
          break;
        case 
    "snarky":
          
    mockSpelling("indiot");
          ++
    $irony;
          break;
        default:
          
    callNames();
          break;
      }


  13. #13
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,422
    Quote Originally Posted by mattle View Post
    ...still better than getting your specs from accountants...
    oooh, they are the worst!

    Still, I missed the relevant clarification -- good spotting.
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

  14. #14
    Join Date
    Mar 2009
    Posts
    2,222
    Quote Originally Posted by reddem0n View Post
    Hello folks,

    I have a MSSQL database with about 90,000+ email addresses.

    I have 35,000 email addresses which I have on a text file.

    If I run a query for me to do an export on the database to pull up all 90,000+ email addresses - would it have the potential to crash the server? What would be the safest way to export all the email addresses?

    My goal is to basically be able to compare both lists and make sure there are no matches with either lists, and if there are then it would be removed.

    I want to go about with this as safe as possible. Please post your suggestions here. Appreciate it.
    90,000 rows is pretty small.

    Here are some suggestions in order of increasing programming skills required, though all are easy.

    If it were MySQL, I would say:

    Solution A: minimum programming skills

    1) Create an index for the database table email column, using ther MEMORY engine.

    2) Read the text file a line at a time. For each line, extract the email address and issue an SQL querty to see if it exists in the database table. This query will simply access the in-memory index, so the overall process should run about as fast as you can read the text file.

    3) If the SQL query results in one or more rows that match the email address in the text file, do whatever you were going to do.

    If for some reason you cannot create an index in memory, here is an alternative solution; it requires slightly more programming.

    Solution B: little programming skill required

    1) Create an associate array in your program, indexed on email_address, with default value 0.

    2) Read the entire database table row by row; for each row, extract the email address, and set its corresponding value in the associative array to 1

    3) Read the text file line by line; for each line, extract the email address, and check the corresponding value in the associative array. If the value is "1", there is a match.

    And if the programming language does not support associative arrays, then

    Solution C: some programming skill required to ensure correct working.

    Use a standard array containing the email addresses; when you extrract the email addresses from the database, use ORDER BY to get them in sorted order, add them to the array in that order, and use a binary search to check if an email address from the text file is in the array.

    The binary search will require execution of about 100 statements per item in the text file, which should take less than 0.1 milliseconds of CPU time, oer elss than 4 seconds for the complete job.

  15. #15
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,422
    PHP Code:
    # given the clarification of the spec, using text files, in Python:
    def get_emails(filename):
        return [
    line.strip().lower() for line in open(filename) if line.strip()]

    emails = [for e in get_emails('emails2.txt') if
              
    e not in get_emails('emails1.txt')] 
    Writing out the file:

    PHP Code:
    with open('result.txt''w') as fout:
        for 
    email in emails:
            
    fout.write('%s\n' email
    I want to go about with this as safe as possible.
    To fulfil your requirement be sure you do not accidentally delete rows in your existing 90K table. If concerned about this make a copy of both tables:

    create table email_a as select * from orig_email_tablename;
    # import the other data
    create table email_b as select * from imported_email_tablename;

    ... and operate on those.

    In either case to use SQL for the work you've got to be able to import lines from a text file into your MSSQL db or to use Python or some other language you'll need to export rows from MSSQL to a text file. As we've said here, 90K rows of email addresses is trivial. No matter what approach you use, this can be done in memory these days without thinking about it.
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

  16. #16
    Join Date
    Feb 2004
    Posts
    385
    thanks for all your help guys.
    “Intelligence is not to make no mistakes, but quickly to see how to make them good.” - Bertolt Brecht

Similar Threads

  1. Replies: 6
    Last Post: 09-27-2006, 11:16 PM
  2. [NL] Colocation High Volume
    By TwiLightINC in forum Dedicated Hosting Offers
    Replies: 0
    Last Post: 04-04-2006, 04:57 PM
  3. Windows VPS compare list?
    By skyaus in forum VPS Hosting
    Replies: 33
    Last Post: 06-20-2005, 07:04 AM
  4. perl or shell list compare/extraction
    By Slidey in forum Programming Discussion
    Replies: 3
    Last Post: 03-08-2005, 09:54 AM
  5. Need high volume solution for my biz
    By pepper9099 in forum Ecommerce Hosting & Discussion
    Replies: 9
    Last Post: 01-21-2005, 08:54 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •