Web Hosting Talk







View Full Version : Database for file storage question


ncis
12-10-2009, 05:53 AM
Hello,

We run some fairly large trafficked sites with millions of records, basically like business directories.

At the moment all images/pictures uploaded by the businesses are stored in the filesystem with the reference to the file location and the business ID stored in a mysql table. This works great, however backing up and restoring these folders sucks up a lot of juice from the server (we do backups nightly) when the folder gets tar'd and gzipped.

I've been experimenting with SQLite and my question is... is there any cons to using an SQLite database to store all the files along with the business ID? so there would be one table with two columns, ID (indexed) and the picture contents (BLOB).

I've tested it out on another site that had about 50k images, so what I did was during the INSERT to the SQLite database was gzipped the files bytes, then base64 encoded the string (in PHP) so the inserted content was not all the gibberish text.

Performance on that was pretty good, selecting 50 - 100 records did take a second or two but it solves my problem with dealing with large file systems as all of that content is in one file.

Anyone done anything similar or can recommend a free/open source software that can store files in a flat file but is designed to handle millions/10's of millions of records with large blob's?

I've done a lot of googling to find a solution for this for LINUX based OS's and PHP api's but didn't really find anything.

Sorry for the long post!

xtrac568
12-10-2009, 06:23 AM
if the only problem is backup, you can use incremental backup, e.g. rsync, instead of making full backup tar archive every time.

ncis
12-10-2009, 06:36 AM
that is an option, I've never dabbled with rsync and I'm not sure how well it would perform on large filesystems and I am curious to know if the incremental backups are stored in seperate tarballs and requires rsync to restore.

The way our backups are done is once a day the folder is tar'd/gzipped with the date as the filename and sent off to Amazon S3 for archival (same process for our mysql databases using hotcopy). In the event of the disk failing or OS crashing (and this happened 3 times in the past year already) we can easily replace the drive, transfer the DB and the filesystem archives back from amazon, extract and voila we're live. Restoring takes literally 10 minutes total.

Rsync might be a last resort, the appeal of a flat file storage system using SQLite or an alternative is mouth watering for its simplicity :)

mattle
12-10-2009, 09:43 AM
Um...what's simpler than this?


root@host:~# mkdir test1
root@host:~# mkdir test2
root@host:~# touch test1/file1
root@host:~# touch test1/file2
root@host:~# rsync -avx test1/ test2/
building file list ... done
./
file1
file2

sent 179 bytes received 70 bytes 498.00 bytes/sec
total size is 0 speedup is 0.00
root@host:~# ls -l test2
total 0
-rw-r--r-- 1 root other 0 Dec 10 08:35 file1
-rw-r--r-- 1 root other 0 Dec 10 08:36 file2
root@host:~# rm test2/file2
root@host:~# rsync -avx test1/ test2/
building file list ... done
./
file2

### NOTICE: only the missing file is copied ###

sent 137 bytes received 48 bytes 370.00 bytes/sec
total size is 0 speedup is 0.00
Here's how I'd implement it:

/dev/hda1: /path/to/web/data
/dev/hdb1: /path/to/backup/data

Do an rsync between those two drives. That way, you've always got an uncompressed backup on the same machine...if the main drive fails, you're set.

After the rsync, create your .tgz archive at a low priority (continue to give preferential treatment to apache processes) on the backup drive (reserve disk spins on the main drive for real web accesses) and then ship it off to your Amazon cloud. This method also prevents files from changing while creating the archive.

tim2718281
12-10-2009, 12:48 PM
Hello,

We run some fairly large trafficked sites with millions of records, basically like business directories.

At the moment all images/pictures uploaded by the businesses are stored in the filesystem with the reference to the file location and the business ID stored in a mysql table. This works great, however backing up and restoring these folders sucks up a lot of juice from the server (we do backups nightly) when the folder gets tar'd and gzipped.


Have you measured the benefit of gzipping the image files? I'd have thought it would make .jpg files larger rather than smaller.

foobic
12-10-2009, 05:41 PM
I think gzip almost always achieves some compression, but with data that's already compressed (.jpg / .gif / .png images) the benefit would be tiny and the cpu usage could be significant.

Another vote for rsync. You might also want to look into s3fs - mount your S3 bucket as a local filesystem.

mattle
12-11-2009, 09:29 AM
I think gzip almost always achieves some compression, but with data that's already compressed (.jpg / .gif / .png images) the benefit would be tiny and the cpu usage could be significant.

Another vote for rsync. You might also want to look into s3fs - mount your S3 bucket as a local filesystem.

If you do that...make sure you use the -z option...even though rsync will just mirror your filesystem, you can compress the stream. If you're paying for bandwidth with Amazon...