Results 1 to 25 of 25
  1. #1
    Join Date
    Jun 2009
    Posts
    50

    Question Would cloud hosting be appropriate for this?

    A company we do business wants to give us a 1.2 gb CSV file daily of about 900,000 products. The whole file doesn't change daily, just a some products are dropped or added, but we still need to pull it every day to stay current. Each product has several associated images and some data associated with it (product id, color, price, etc.)

    We then need to display a portion of that list to our partners and let them search through it etc. Each partner would only need to be able to view a few hundred records.

    What I was thinking is that we'd import the 1.2 gb file once into a mysql database, then every day pull the file and check it against the database for changes. The problem is the dedi server we're using is a pentium 4 with 2 gb of RAM, and I know it would slow to a crawl if we tried to do this on the same server as our other sites.

    Would cloud hosting be an appropriate solution here? We could store the entire database externally like data.mydomain.com, have a script that pulls the file from that company's server, and then compare/update the database as necessary. That way the rest of the sites on our current dedicated server won't slow to a crawl. I could even maintain a duplicate database for yesterday so that while it's being updated, customers can still browse yesterday's listings.

    Or I could possibly just host the domain we're using for this entirely in the cloud, rather than just the database. Not sure if there's any advantage to that.

    Would you recommend cloud hosting for this, or any other ideas? It seems to make more sense to me than just getting another dedicated server with better processor/more RAM because we're only going to need that while the file is being parsed. It's not tens of thousands of partners are going to be browsing the list. Maybe 100 a day, though I'd like the room to grow to 2000-3000.

    Or Amazon SimpleDB is an idea, maybe?
    Last edited by satoo; 06-06-2011 at 10:25 PM.

  2. #2
    For reading from the database, you'll want to make sure the entire database fits in ram basically. A 1.2gb CSV file may end up being 2gb or more once you factor in indexes and stuff within a mysql database. If there's only a few writes to the database (only a few updates), then the hard drive system might not need to be anything special, but if you're doing a lot of random writes as well, then you'll definitely want to be storing everything on an SSD.

    Your thoughts are right that a pentium 4 with 2gb ram is likely to choke from this kind of thing, but I wouldn't necessarily say you would need a cloud solution to work with this. Keep in mind a cloud is basically a VPS with shared storage. Since this is a database heavy workload, the reduced i/o performance from shared cloud storage is likely to kneecap your performance.

    A decent dedicated server (quad core, 4-8gb ram, raid 1 storage for redundancy / failover) is likely to blow cloud out of the water and cost a similar amount of money. If you do end up needing to upgrade to SSD storage later on to keep the i/o performance high enough, you're not going to be able to do that in cloud anyway, a dedicated server is the only way to go there.
    IOFLOOD.com -- We Love Servers
    Phoenix, AZ Dedicated Servers in under an hour
    ★ Ryzen 9: 7950x3D ★ Dual E5-2680v4 Xeon ★
    Contact Us: sales@ioflood.com

  3. #3
    Join Date
    Mar 2004
    Location
    Cheshire, UK & WA, USA
    Posts
    245
    Funkywizard that isn't technically true.

    Cloud is many different things to many different companies.

    For example using cloud he could setup 1 webserver and load balanced mysql servers in either master/slave or master/master replication and share the resources across those 3 servers and get far better performance than putting all the resources into one server.

    He'd then benefit from not only being guarded against hardware failure (one server means if a major component dies the server dies), but also being able to access features such as scaling (he could have a plan that scales to 3 or more MySQL servers should the need arise) and other things like this.

    Amazon SimpleDB looks very interesting as well and with a little tweaking your database could use this system to great affect.

    I/O shouldn't be a problem with 100 partners browsing the list, nor should it be a problem in the near future and I'm not sure why you'd suggest you couldn't use SSD on a cloud setup.

    Satoo, the best thing to do is work out what you can do using a dedicated server and get a price etc. Then look at a few different cloud options. From there you'll have the pros and cons of each system and costs and it'll allow you to decide which is the best option based on your knowledge of the system. It may mean you spend a few days working hard comparing options, but you'll be thankful in the future!
    Old School Web Hoster
    138Media LTD (Media and Consulting)

  4. #4
    Join Date
    Jun 2009
    Posts
    50
    I'm open to getting a second dedi server, it just seems overkill to me because the processing power is only necessary once a day when the db is updated. Really, each partner will only be browsing a very small portion of the list...maybe 100 rows each depending on what zip code they're operating in.

    This is not a super high traffic site, just one with unique needs.

    That's why I thought cloud hosting made more logical sense, but Amazon SimpleDB (or perhaps some competitor's alternative) is another idea. I'm also open to some other solution altogether.

  5. #5
    Realistically, a VPS that's not oversold will work well for this as well, but, similar to cloud, you have no way of knowing who your neighbors on the server are, how many resources they're using, how many resources are left over for you, etc. Certainly if your goal is simply not to overload your existing server, and you don't mind if the daily CSV import takes a long time, than either cloud or VPS will be adequate here. To get the same performance from cloud that you would get from a dedicated server, you're going to pay a lot more money generally. This is the overlooked factor with "cloud", that it sounds great because you're paying hourly and you can scale up and down, but the performance per dollar is usually so much better on dedicated servers, that you can "wastefully" pay for an entire server's performance that you only need occasionally, and still come out cheaper than doing the same thing on cloud.
    IOFLOOD.com -- We Love Servers
    Phoenix, AZ Dedicated Servers in under an hour
    ★ Ryzen 9: 7950x3D ★ Dual E5-2680v4 Xeon ★
    Contact Us: sales@ioflood.com

  6. #6
    Join Date
    Mar 2004
    Location
    Cheshire, UK & WA, USA
    Posts
    245
    Funky,

    I think you are spreading a bit of miss information here. The majority of cloud platforms, be it on app, applogic or amazon all provide dedicated resources. In fact the majority of them do not have overselling features. With the amount of cloud systems out there it is a very very broad statement to say cloud would be more expensive especially if as the customer has stated the extra resources would only be used once a day.

    Even if you are a big believer in hardware over cloud, lets try and keep things factual so the the thread creator has a good view on things.

    Satoo,

    If you can provide some specific details for example how long you think the extra resources would be needed I'm sure some people with experience of Amazon etc can give you some pricing.
    Old School Web Hoster
    138Media LTD (Media and Consulting)

  7. #7
    Join Date
    Nov 2009
    Posts
    544
    satoo;

    It seems that throwing hardware at this issue would be futile.

    Though I certainly do not know your system, it appears that this would be a data manipulation problem that would be handled off line much more efficiently. I would see the process as something like:

    Downloading the daily file - comparing it to yesterdays file - creating sql file containing just the changes - uploading sql file to update the database.

    In any case the process can be handled relatively quickly (off line) by any desktop PC, processing this on a web server is probably the wrong thing to do.

    Are you able to link to the images on the company server or do you need to download these also?

    A couple of years ago, we handled some thing like this for an after-market / OEM auto parts catalog (~same number of products with a much higher web visitor load) only we got files from several distributors, manufacturers and media vendors. I know you are having fun with this.
    Last edited by srfreeman; 06-07-2011 at 12:32 AM.

  8. #8
    Join Date
    Jun 2009
    Posts
    50
    Quote Originally Posted by srfreeman View Post
    satoo;

    It seems that throwing hardware at this issue would be futile.

    Though I certainly do not know your system, it appears that this would be a data manipulation problem that would be handled off line much more efficiently. I would see the process as something like:

    Downloading the daily file - comparing it to yesterdays file - creating sql file containing just the changes - uploading sql file to update the database.

    In any case the process can be handled relatively quickly (off line) by any desktop PC, processing this on a web server is probably the wrong thing to do.

    Are you able to link to the images on the company server or do you need to download these also?

    A couple of years ago, we handled some thing like this for an after-market / OEM auto parts catalog (~same number of products with a much higher web visitor load) only we got files from several distributors, manufacturers and media vendors. I know you are having fun with this.

    That's a good idea. We could mass import once, then have some locally-running script automatically connect to the online database and do the changes.

    As for hotlinking, yeah I've been kinda wondering that myself. They're using a Rackspace CDN solution to host all the images so it seems that they're ok with us doing that? If we do it and they say what are you doing, I suppose that's easy enough to get around because storage is cheap enough that we could just store everything indefinitely.

  9. #9
    Join Date
    Nov 2009
    Posts
    544
    satoo;

    Yep, off line processing works well and the CDN should be good for the images.

    After re-reading your original post; Are you planning on hosting this on a shared hosting server with MySQL in the same box? If so, a P4 with just 2GB RAM may be a bit light for the large database, kicking in another 2GB RAM could be a good idea.

    Scaling far beyond your hundred or so users on this size system is probably a pipe dream. If or when there is enough money in the project, moving them to their own box would be a good idea. Moving to a virtual system (cloud, if you will) could have many advantages when considering scaling (yes, as mentioned earlier, I/O speed on a public cloud will require well tuned queries and judicious caching - you should be doing this anyway) your application to an increased user base but not just to do the file processing.

  10. #10
    Join Date
    Jun 2009
    Posts
    50
    Quote Originally Posted by srfreeman View Post
    satoo;

    Yep, off line processing works well and the CDN should be good for the images.

    After re-reading your original post; Are you planning on hosting this on a shared hosting server with MySQL in the same box? If so, a P4 with just 2GB RAM may be a bit light for the large database, kicking in another 2GB RAM could be a good idea.

    Scaling far beyond your hundred or so users on this size system is probably a pipe dream. If or when there is enough money in the project, moving them to their own box would be a good idea. Moving to a virtual system (cloud, if you will) could have many advantages when considering scaling (yes, as mentioned earlier, I/O speed on a public cloud will require well tuned queries and judicious caching - you should be doing this anyway) your application to an increased user base but not just to do the file processing.
    It's our dedi server, but yeah there are other sites on it. Nothing really processor or RAM-intensive...mostly static html actually, except for our two blogs.

    We also have two VPS at two other hosting companies for two completely unrelated projects. It just seemed to me a database this large couldn't work on a VPS, otherwise I would just do that.

    The 100 users wouldn't be concurrent. But throughout the day, there may be a total of 100 to start with, realistically a few hundred throughout the day after the first month and gradually going up.

    Maybe another dedicated box is a good idea. Or Amazon SimpleDB to host the DB entirely, still process the files offline, and just add/remove the records to that. Then all we'd really be hosting is the design itself and both the data and images would be pulled remotely.

    It seems interesting http://docs.amazonwebservices.com/Am...eveloperGuide/ and they have a PHP SDK for it. That way we can kind of scale as we need it. I realize at a certain point it would probably cost just as much as getting the dedi box to start with...

  11. #11
    The major issue here is going to be the database. You need to run some performance tests to see how long it is going to take to upload a 1.2GB CSV, and then process the 900,000 records. Databases are the least likely application to perform well in a cloud environment, and you wont want this affecting the performance of the overall site.
    ██ Enterprise Class Cloud Hosting And Disaster Recovery. SAN Replication.
    ██ VMware Hosting on HP Blades With NetApp or EqualLogic SAN Storage. 100% Guaranteed Uptime.
    ██ Build Your Own Virtual DataCentre In The Cloud. Fully Integrated With vCenter.
    ██ StratoGen Are An Authorised VMware Partner | StratoGen.net

  12. #12
    Quote Originally Posted by Latic View Post
    Funky,

    I think you are spreading a bit of miss information here. The majority of cloud platforms, be it on app, applogic or amazon all provide dedicated resources. In fact the majority of them do not have overselling features. With the amount of cloud systems out there it is a very very broad statement to say cloud would be more expensive especially if as the customer has stated the extra resources would only be used once a day.

    Even if you are a big believer in hardware over cloud, lets try and keep things factual so the the thread creator has a good view on things.

    Satoo,

    If you can provide some specific details for example how long you think the extra resources would be needed I'm sure some people with experience of Amazon etc can give you some pricing.
    Which is it, are the resources dedicated, or can you burst to use them once a day, leaving them free for others the rest of the time? "cloud" doesn't give you magical servers, they have the same cpus and capabilities of any other kind of server, just at greater expense, and shared with other users.

    edit: since you'd like to work with factual information on clouds, and you use amazon as an exmaple, let's look at their offerings. Their "small" instance, which has 1.7gb ram, is absolutely the smallest that could potentially be feasible for this in order to stand a decent chance of being able to import / merge changed items, by keeping most of the database in memory during the lookup process every day. At 8.5 cents / hour for linux, that puts you at $61.20 / mo, assuming there is no bandwidth use whatsoever, which is something amazon charges a mint for. Considering you can get a "full blown" dedicated server with quad core and plenty of ram in the ~$100 / mo range, you're certainly not saving much by "going cloud", and you're cutting your performance dramatically. The small instance comes with "1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit)". Their definition says: "One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor."

    So in essence, you're paying a minimum of $60 / mo for a Xen VPS with zero bandwidth (only 15 cents / gb, what a deal!), 1.7gb ram, and availability of 1 - 1.2ghz of circa 2007 cpu. Sign me up! Volumedrive may not be the best example because they're out of stock and have had some customer support slowness and other issues, but just to compare, they've been offering a dedicated server with a quad xeon and 100 meg unmetered for just north of $50 / mo, less than the minimum price for amazon. A more typical offering might be on the $100 range. In no case does the "cloudiness" of the offer make it any more appropriate, in fact, because the resources are dedicated, they're promising that you won't be able to get the resources of multiple cores if you need them.
    Last edited by funkywizard; 06-07-2011 at 04:13 AM.
    IOFLOOD.com -- We Love Servers
    Phoenix, AZ Dedicated Servers in under an hour
    ★ Ryzen 9: 7950x3D ★ Dual E5-2680v4 Xeon ★
    Contact Us: sales@ioflood.com

  13. #13
    Join Date
    Nov 2009
    Posts
    544
    satoo;

    Really, you could host that database on a net-pad in today's world. You just do it carefully.

    The one issue with running large MySQL databases or / and multiple MySQL databases on machines with low ram is the memory mapping feature. When the entire database won't fit into memory (or the memory left over when another database is already using some), it will move to the disk, causing slow downs and somewhat <tech_term>squirrely</tech_term> operation. Certainly, memory mapping can be turned off, as can other features but reducing features to gain performance is a slippery slope to a bad place.

    Though I have not personally used SimpleDB for a 900,000 row database, it certainly seems to be a viable idea and its data partitioning may suit you well. I enjoy working with Amazon's system (some will consider that a character flaw) and after 30 years of working with databases, the cloud systems are much faster than many high end servers I have worked with in the past (yes, we had 900,000 row databases back then and snappy queries too). Code, logic and design still rule.

    Yes, ditching the database altogether and using files is certainly viable. In fact this method is championed by many when considering scalability.

    Best of success with the path you choose.

  14. #14
    Join Date
    Apr 2002
    Location
    Auckland - New Zealand
    Posts
    1,575
    So bottom line is, with databases, use dedicated non-cloud (for mysql, oracle etc al) or modify to using files on disk, that can merge themselves, in the event of a failover, failback etc. If you are a corporate and this is your customer base, then do something else.

    Even Oracle raq solutions that cost millions of dollars can't cope with failover geographically, it's not a shameful thing to admit. There's about 1% of folks using such oracle database functionality, so we don't hear often that even Oracle raq fails as well, it's not a be-all-end-all solution. If you can stand 15 mins downtime, then it's usually easier and more functional to your business, to have a human fail over for you.
    A human, is probably half the cost of a decent IBM/EMC SAN, with failover if you buy a contractor from a reputable dba soultions company.
    Last edited by StevenG; 06-07-2011 at 06:08 AM.

  15. #15
    Join Date
    Mar 2004
    Location
    Cheshire, UK & WA, USA
    Posts
    245
    funkywizard,

    I know Amazon is expensive. That is why I'm telling him to consider other alternatives, do some home work and compare his options!

    The dedicate resources comes down to how each cloud is run. Applogic for example you'd have his instance running with X dedicated resources and then you'd decide what resources were needed in reserve.

    You are going to find the same everywhere bad planning no matter what the technology will result in a bad service.
    Old School Web Hoster
    138Media LTD (Media and Consulting)

  16. #16
    Join Date
    Nov 2009
    Posts
    544
    funkywizard;

    It would seem that the hosting world of today has taught the "There is more to it than price" lesson well.

    While it is true that - in a vacuum - a given application may run well on a cheap, leased server on just any network. When you add in the need for scalability and flexibility often necessary in today's marketplace, additional options made available through the likes of Amazon become advantageous.

    The "performance" of any application on any platform is a result of the application code quality, not the hardware used. The fact that you need to know what you are doing is as important today as it has ever been. Many deplore the laundry list of options available today but to those who can take advantage of them, they are invaluable.

    StevenG;

    While I don't completely understand the point of your rant, if you are saying that the correct tool should be chosen for any job, I may agree fully. Knowing your tools is the job of any craftsman.

    Latic;

    Expensive is always a relative term but the admonition to do the home work is certainly a good one.

    In terms of home work and education; In systems such as yours, how is one to know what resources are in reserve and what the maximum scalability will be down the road.

  17. #17
    Join Date
    Jun 2009
    Posts
    50
    By reducing the total number of image URL's for any one row to 5 from 20, and getting rid of some unimportant fields, we got the CSV down to 134 mb.

    A P4 with 2 gigs of RAM could easily handle that, right, even though there are still 900,000 records? I could get another VPS just for this database, too, like with just 1 gig?

  18. #18
    Join Date
    Nov 2009
    Posts
    544
    Quote Originally Posted by satoo View Post
    By reducing the total number of image URL's for any one row to 5 from 20, and getting rid of some unimportant fields, we got the CSV down to 134 mb.

    A P4 with 2 gigs of RAM could easily handle that, right, even though there are still 900,000 records? I could get another VPS just for this database, too, like with just 1 gig?
    Well, that was some major surgery. I would still do the daily update processing off line.

    There are so many options available to you and there is no way we can know your query requirements so being able to say "A P4 with 2 gigs of RAM could easily handle that", with any certainty, is impossible.

    It does appear that with your given the fact that the usage is nearly all read (with one tiny, daily update that can be done at off peak times), "in memory" features can affect the performance of your application. There is no time like the present to do some testing.

    You may want to read up on MySQL's Memory Storage Engine: http://dev.mysql.com/doc/refman/5.6/...ge-engine.html

  19. #19
    Join Date
    Mar 2004
    Location
    Cheshire, UK & WA, USA
    Posts
    245
    srfreeman,

    As standard we would normally do dedicated resources. We do however have customers with all sorts of configurations and custom appliances we have built around existing systems, which scale when needed.

    For an application such as this, without finding out lots more detail about the database etc. you'd look at have a webserver and probably two mysql servers running master to master. If you wanted to load balance the mysql then you could have mysql servers in reserve ready to come online when certain circumstances happened. As people have mentioned on this thread it may be just best if one mysql server was used to update and another used by the end user, that wouldn't require scaling.
    Old School Web Hoster
    138Media LTD (Media and Consulting)

  20. #20
    Quote Originally Posted by satoo View Post
    By reducing the total number of image URL's for any one row to 5 from 20, and getting rid of some unimportant fields, we got the CSV down to 134 mb.

    A P4 with 2 gigs of RAM could easily handle that, right, even though there are still 900,000 records? I could get another VPS just for this database, too, like with just 1 gig?
    A 134mb CSV file shouldn't be too terrible to process on your existing server. Just be sure to check what the size is of the resulting mysql database, and do some tests on how long various operations take on that database before sending live traffic to it, and you should be fine.
    IOFLOOD.com -- We Love Servers
    Phoenix, AZ Dedicated Servers in under an hour
    ★ Ryzen 9: 7950x3D ★ Dual E5-2680v4 Xeon ★
    Contact Us: sales@ioflood.com

  21. #21
    Join Date
    Nov 2009
    Posts
    544
    Latic;

    Thanks for the response. My question ("How is one to know what resources are in reserve and what the maximum scalability will be down the road." For those not reading the entire thread.) was generic but the OP's application gives a good example so let's go with it.

    Memory limitations are certainly on a per node basis. Though we can know the memory requirements of the application today and order the applicable, dedicated resources, we do not know the memory requirements of the future. How can we know the resources available on a given node or in fact the entire installation, for growth?

    Though it certainly may not be applicable to your system, there is so much talk about minimal installations using the likes of AppLogic that I have a real fear of using any. When installing an application, with the implied promise of being able to grow it in place, I would want the assurance that the resources will be available sometime in the future.

    In today's market, I can be relatively sure that the likes of Amazon or IBM will have the resources available but the smaller companies don't give me the same warm fuzzy feeling.

  22. #22
    Join Date
    Mar 2004
    Location
    Cheshire, UK & WA, USA
    Posts
    245
    Srfreeman,

    I cant talk about others, but if I was building an application for the future in applogic I'd personally not keep scaling up a single Virtual dedicated server, instead I'd build the application so I could scale up by adding more webservers behind a load balancer and the same with MySQL.

    Doing that would stop you worrying about memory on a single node, because each of your webservers could be placed on different nodes and the same with mysql servers, you'd also then take away healing / recovery time, because if a single node with a cloud failed you'd have several other servers running on the other nodes within the cloud.

    In regards to memory for an entire cloud the way we work is in N + 1 (at least) so at a very basic level if you have 6 nodes in a grid the maximum resources you'd ever use would be 4.5 of the nodes with one left for failover. The reason I say 4.5 is at that level you know at some point you'll be getting close to using a full 5 nodes and therefore you'll need to add another node to keep in N+1.

    The other option you have is simply getting a service provider to setup a private cloud, which means that instead of putting your trust in the company to keep things running smoothly and manage resources, you could monitor this yourself. Start small with 2 nodes, 1 node to use 1 node to failover to. You'd probably be looking around 32 Cores and 32GB between the two machines so thats 16 cores and 16gig to use, or something like that.

    As a company I believe we are making a move for private grids to smaller servers but more of them as it makes it a little cheaper for the customers and also allows them to use more of their resources.

    For example 4 servers in a grid 8 cores in each and 8gb in each means total across the grid is 32cores and 32gig, however keeping N+1 you could then use 24 cores and 24 gig.

    With cloud you are right to understand / fear that you are at the mercy of how well a network / cloud is being managed. But with good application design and planning you can minimise the worry.

    I hope some of the above is useful and it also shows why I'm a massive advocate of talking with a good few service provides and seeing which one's can match your goals etc.
    Old School Web Hoster
    138Media LTD (Media and Consulting)

  23. #23
    I think that you should refer a dedicated server for this purpose so that you can store and access conveniently with the g.b.p.s. range.

  24. #24
    Join Date
    Nov 2009
    Posts
    544
    Quote Originally Posted by Latic View Post
    For example 4 servers in a grid 8 cores in each and 8gb in each means total across the grid is 32cores and 32gig, however keeping N+1 you could then use 24 cores and 24 gig.
    If a company with this configuration were to sell 12 VMs with 2 cores and 2GB RAM it would appear fine but if I own 2 of them (1 web and 1 MySQL) and my plan were to spin up 2 web server clones to handle expected traffic in the future - How would I know that there are not the available resources to do so?
    Last edited by srfreeman; 06-18-2011 at 02:09 PM.

  25. #25
    Private grids or clouds are basically just dedicated servers running virtualisation technology - the idea of the cloud is that you can scale massively beyond what you might be able to afford / want to have sitting on standby for your own use all the time, and only pay for it when you've scaled.
    ____________________________________________
    European and USA IAAS Cloud Hosting
    http://www.dediserve.com Dublin, London, Dallas
    Ranked by Cloudharmony.com as the fastest cloud in the world.

Similar Threads

  1. Replies: 4
    Last Post: 04-16-2011, 02:34 PM
  2. Replies: 1
    Last Post: 02-05-2011, 11:53 PM
  3. Replies: 0
    Last Post: 12-10-2010, 01:08 PM
  4. Replies: 1
    Last Post: 12-02-2010, 12:00 PM
  5. Replies: 0
    Last Post: 11-15-2010, 04:05 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •