hosted by liquidweb


Go Back   Web Hosting Talk : Other Forums : Web Hosting Lounge : Data sorting / duplication removal
Reply

Web Hosting Lounge Forum for general conversation, share interests, have a laugh or discuss topics not related to the above or below forums.
Forum Jump

Data sorting / duplication removal

Reply Post New Thread In Web Hosting Lounge Subscription
 
Send news tip View All Posts Thread Tools Search this Thread Display Modes
  #1  
Old 01-20-2005, 07:40 PM
fang666 fang666 is offline
Junior Guru Wannabe
 
Join Date: Sep 2004
Posts: 73

Data sorting / duplication removal


Hello,

A client has a list of their customers, in a spreadsheet. As this list came from 3 different databases, and is to be imported into 1 new system, there is a fair bit of duplication.

Is there somebody that offers a service where they take the spreadsheet, and sort through it removing duplications?

Reply With Quote
Sponsored Links
  #2  
Old 01-20-2005, 07:50 PM
cywkevin cywkevin is offline
Predatory Poster
 
Join Date: Jul 2003
Location: Goleta, CA
Posts: 5,550
Yeah this is a common need in business. Try googling someone that will do it. Or put in a request over at employment offer sin this forum.

__________________
Patron: I'd like my free lunch please.
Cafe Manager: Free lunch? Did you read the fine print stating it was an April Fool's joke.
Patron: I read the same way I listen, I ignore the parts I don't agree with. I'm suing you for false advertising.
Cafe Owner: Is our lawyer still working pro bono?

Reply With Quote
  #3  
Old 01-20-2005, 07:55 PM
Rob83 Rob83 is offline
Web Hosting Master
 
Join Date: Dec 2003
Location: Miami, FL
Posts: 3,262
why not import the three spreadsheets into 1, then sort them by the command term (ie. tel #) and then delete the dups?

Reply With Quote
Sponsored Links
  #4  
Old 01-20-2005, 08:08 PM
fang666 fang666 is offline
Junior Guru Wannabe
 
Join Date: Sep 2004
Posts: 73
Quote:
Originally posted by Rob83
why not import the three spreadsheets into 1, then sort them by the command term (ie. tel #) and then delete the dups?
There are 30,000 records.

Reply With Quote
  #5  
Old 01-20-2005, 08:11 PM
Rob83 Rob83 is offline
Web Hosting Master
 
Join Date: Dec 2003
Location: Miami, FL
Posts: 3,262
Quote:
Originally posted by fang666
There are 30,000 records.
You can easily run a macro that will search through the 30,000 records for duplicates by analyzing specific rows and columns.

Quote:
Sub Find_Duplicates()

Dim CompareRange As Variant
Dim Selection As Variant
Dim x As Variant
Dim y As Variant

Set Selection = Range("B2:B6")
Set CompareRange = Range("C2:C6")

For Each x In Selection
For Each y In CompareRange
If x = y Then x.Offset(0, 2) = x
Next y
Next x

End Sub

Reply With Quote
  #6  
Old 01-20-2005, 08:22 PM
fang666 fang666 is offline
Junior Guru Wannabe
 
Join Date: Sep 2004
Posts: 73
Rob - thanks for the informative reply.

The issue is, that there could be a few customers with the same name. So in the event of a duplicated name, we need to check the address to see if its the same.

Then of course, if the address is the same but entered slightly differently......

Reply With Quote
  #7  
Old 01-20-2005, 08:36 PM
JayC JayC is offline
Web Hosting Master
 
Join Date: Aug 2000
Location: NYC
Posts: 6,627
Yep, eliminating duplicates in addresses isn't that straightforward. The way it's generally done is to first "standardize" the address by formatting them to USPS specifications. There is software available to do that. Ideally it will check that the address range exists, and identify any addresses that are not possible. You can then find exact duplicates with that output, or take a further step of tokenizing the addresses and matching that version; or more sophisticated dupe software will scan through the addresses and give each a score that measures how closely it resembles other addresses. Some will be clearly atches, some clearly not, and some in a gray area that may or may not be. You can either treat all of those one way or the other, or manually evaluate them.

Near matches on names can be handled generally the same way... in fact the level of match between the names associated with each address can be part of the criteria for weighing whether two records are duplicates or not.

If you do a search and find a commercial service to do it, they should take an approach along those lines.

__________________
Specializing in SEO and PPC management.

Reply With Quote
Reply

Related posts from TheWhir.com
Title Type Date Posted
NaviSite Launches Two Enterprise Cloud Storage and Backup Services Web Hosting News 2013-03-06 11:54:15
New Google Transparency Report Provides Insight into URL Takedown Requests Web Hosting News 2012-05-25 13:59:26
Data, Data Everywhere! Blog 2011-08-16 15:05:45
Canadian Government Releases Data Center, Email Consolidation Plan Web Hosting News 2011-08-05 14:05:15
Data Storage Corporation Names Director of Sales Web Hosting News 2011-07-13 14:04:08


Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes
Postbit Selector

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off

Forum Jump
Login:
Log in with your username and password
Username:
Password:



Forgot Password?
Advertisement:
Web Hosting News:



 

X

Welcome to WebHostingTalk.com

Create your username to jump into the discussion!

WebHostingTalk.com is the largest, most influentual web hosting community on the Internet. Join us by filling in the form below.


(4 digit year)

Already a member?