Results 1 to 16 of 16
Thread: HIGH Volume MSSQL List Compare
-
10-21-2009, 12:48 PM #1Aspiring Evangelist
- Join Date
- Feb 2004
- Posts
- 385
HIGH Volume MSSQL List Compare
Hello folks,
I have a MSSQL database with about 90,000+ email addresses.
I have 35,000 email addresses which I have on a text file.
If I run a query for me to do an export on the database to pull up all 90,000+ email addresses - would it have the potential to crash the server? What would be the safest way to export all the email addresses?
My goal is to basically be able to compare both lists and make sure there are no matches with either lists, and if there are then it would be removed.
I want to go about with this as safe as possible. Please post your suggestions here. Appreciate it.“Intelligence is not to make no mistakes, but quickly to see how to make them good.” - Bertolt Brecht
-
10-21-2009, 01:05 PM #2Web Hosting Master
- Join Date
- May 2009
- Posts
- 766
Yes. This is MS after all (I know...let the flames commence)
This is totally unclear. If you have a matching email address in both lists, you want to remove it from both? Are you simply trying to merge the data? Need specifics here, or we don't even know what query you're trying to run. Give us an example with sample data in two lists and what the comparison process should look like.
-
10-21-2009, 01:23 PM #3Aspiring Evangelist
- Join Date
- Feb 2004
- Posts
- 385
Sorry, I meant to say that I want to remove any matching email addresses from list 2 (35k emails) found on list 1 (90k emails).
“Intelligence is not to make no mistakes, but quickly to see how to make them good.” - Bertolt Brecht
-
10-21-2009, 02:44 PM #4Web Hosting Master
- Join Date
- Nov 2001
- Location
- Vancouver
- Posts
- 2,422
I'd suck the email addresses in to a table and then you can create a view / union and export that:
Code:mysql> select * from foo; +------+------+ | id | name | +------+------+ | NULL | Joe | | NULL | Hary | +------+------+ 2 rows in set (0.00 sec) mysql> select * from bar; +------+------+ | id | name | +------+------+ | NULL | Joe | | NULL | Hary | | NULL | Mike | +------+------+ 3 rows in set (0.00 sec) mysql> (select f.name from foo f) union (select b.name from bar b); +------+ | name | +------+ | Joe | | Hary | | Mike | +------+ 3 rows in set (0.00 sec)
90 - 120K records is *nothing*.“Even those who arrange and design shrubberies are under
considerable economic stress at this period in history.”
-
10-21-2009, 03:04 PM #5Web Hosting Master
- Join Date
- Nov 2001
- Location
- Vancouver
- Posts
- 2,422
Alternatively you could write only the email column from your db, and take your existing other text data file, and run the following Python script against them, creating a sorted output file called "filtered.txt":
PHP Code:#!/usr/bin/env python
datafiles = ['emails1.txt', 'emails2.txt']
output = 'filtered.txt'
emails = set()
for datafile in datafiles:
for line in open(datafile, 'r'):
email = line.strip().lower()
if email:
emails.add(email)
fout = open(output, 'w')
for email in sorted(emails):
fout.write('%s\n' % email)
“Even those who arrange and design shrubberies are under
considerable economic stress at this period in history.”
-
10-21-2009, 04:29 PM #6Web Hosting Master
- Join Date
- May 2009
- Posts
- 766
I don't know what you'd do on MS equipment as far as using a script to do the processing, but say your table is "a" and you load the data from the text file into a table called "b". You can get the dataset you're looking for with the following query:
PHP Code:SELECT * FROM b WHERE email NOT IN (SELECT email FROM a);
-
10-21-2009, 04:59 PM #7Web Hosting Master
- Join Date
- Nov 2001
- Location
- Vancouver
- Posts
- 2,422
In my two examples I assumed the OP wanted a "union" or combined list with no duplicates (perhaps because I've had to do that one too many times!).
mattle's example shows the differences (one not in the other) between the two sets.
Whatever the OP wants, it's covered now.“Even those who arrange and design shrubberies are under
considerable economic stress at this period in history.”
-
10-21-2009, 05:06 PM #8******* Unleaded
- Join Date
- Feb 2004
- Posts
- 3,849
presuming you want to delete from table B if already in table A
Code:delete b where b.email in select a.email from a where a.email=b.email
Code:delete b from b left join a on a.email=b.email where b.email is null
edgedirector.com
managed dns global failover and load balance (gslb)
exactstate.com
uptime report for webhostingtalk.com
-
10-21-2009, 05:10 PM #9******* Unleaded
- Join Date
- Feb 2004
- Posts
- 3,849
edgedirector.com
managed dns global failover and load balance (gslb)
exactstate.com
uptime report for webhostingtalk.com
-
10-21-2009, 05:23 PM #10******* Unleaded
- Join Date
- Feb 2004
- Posts
- 3,849
sorry, wrong direction, should be:
Code:delete b from b,a where b.email = a.email
edgedirector.com
managed dns global failover and load balance (gslb)
exactstate.com
uptime report for webhostingtalk.com
-
10-21-2009, 11:55 PM #11Web Hosting Master
- Join Date
- May 2009
- Posts
- 766
-
10-22-2009, 12:07 AM #12Web Hosting Master
- Join Date
- May 2009
- Posts
- 766
PHP Code:if ($self->isReasonable())
{
exit;
}
else
{
wetGroundWithFlammableAccelerant();
switch ($self->mood)
{
case "righteous":
makeLinuxCase("With the real DBAs all on Linux, only idiots are left");
break;
case "snarky":
mockSpelling("indiot");
++$irony;
break;
default:
callNames();
break;
}
}
-
10-22-2009, 12:33 AM #13Web Hosting Master
- Join Date
- Nov 2001
- Location
- Vancouver
- Posts
- 2,422
-
10-22-2009, 02:11 AM #14Web Hosting Master
- Join Date
- Mar 2009
- Posts
- 2,222
90,000 rows is pretty small.
Here are some suggestions in order of increasing programming skills required, though all are easy.
If it were MySQL, I would say:
Solution A: minimum programming skills
1) Create an index for the database table email column, using ther MEMORY engine.
2) Read the text file a line at a time. For each line, extract the email address and issue an SQL querty to see if it exists in the database table. This query will simply access the in-memory index, so the overall process should run about as fast as you can read the text file.
3) If the SQL query results in one or more rows that match the email address in the text file, do whatever you were going to do.
If for some reason you cannot create an index in memory, here is an alternative solution; it requires slightly more programming.
Solution B: little programming skill required
1) Create an associate array in your program, indexed on email_address, with default value 0.
2) Read the entire database table row by row; for each row, extract the email address, and set its corresponding value in the associative array to 1
3) Read the text file line by line; for each line, extract the email address, and check the corresponding value in the associative array. If the value is "1", there is a match.
And if the programming language does not support associative arrays, then
Solution C: some programming skill required to ensure correct working.
Use a standard array containing the email addresses; when you extrract the email addresses from the database, use ORDER BY to get them in sorted order, add them to the array in that order, and use a binary search to check if an email address from the text file is in the array.
The binary search will require execution of about 100 statements per item in the text file, which should take less than 0.1 milliseconds of CPU time, oer elss than 4 seconds for the complete job.
-
10-22-2009, 01:31 PM #15Web Hosting Master
- Join Date
- Nov 2001
- Location
- Vancouver
- Posts
- 2,422
PHP Code:# given the clarification of the spec, using text files, in Python:
def get_emails(filename):
return [line.strip().lower() for line in open(filename) if line.strip()]
emails = [e for e in get_emails('emails2.txt') if
e not in get_emails('emails1.txt')]
PHP Code:with open('result.txt', 'w') as fout:
for email in emails:
fout.write('%s\n' % email)
I want to go about with this as safe as possible.
create table email_a as select * from orig_email_tablename;
# import the other data
create table email_b as select * from imported_email_tablename;
... and operate on those.
In either case to use SQL for the work you've got to be able to import lines from a text file into your MSSQL db or to use Python or some other language you'll need to export rows from MSSQL to a text file. As we've said here, 90K rows of email addresses is trivial. No matter what approach you use, this can be done in memory these days without thinking about it.“Even those who arrange and design shrubberies are under
considerable economic stress at this period in history.”
-
10-26-2009, 04:14 PM #16Aspiring Evangelist
- Join Date
- Feb 2004
- Posts
- 385
thanks for all your help guys.
“Intelligence is not to make no mistakes, but quickly to see how to make them good.” - Bertolt Brecht
Similar Threads
-
[NL, Amsterdam/Schiphol] High-Speed, High-Volume promo! 1TB on Gbit from only €100!
By Normanu in forum Dedicated Hosting OffersReplies: 6Last Post: 09-27-2006, 11:16 PM -
[NL] Colocation High Volume
By TwiLightINC in forum Dedicated Hosting OffersReplies: 0Last Post: 04-04-2006, 04:57 PM -
Windows VPS compare list?
By skyaus in forum VPS HostingReplies: 33Last Post: 06-20-2005, 07:04 AM -
perl or shell list compare/extraction
By Slidey in forum Programming DiscussionReplies: 3Last Post: 03-08-2005, 09:54 AM -
Need high volume solution for my biz
By pepper9099 in forum Ecommerce Hosting & DiscussionReplies: 9Last Post: 01-21-2005, 08:54 AM