If I have a static database consisting of folders and files, would access and manipulation be faster than SQL server type databases, considering this would be used in a CGI script?
When working with files and folders, what are the tricks to better performance?
I'll add to the it depends crowd.
This is the kind of question that has no generic answer but is heavily dependent on the situation at hand. I even recently moved some data from a SQL database to a flat file system because the overhead of the DB, combined with some DB connection reliability issues, made using flat files a better choice.
Some questions I would ask myself when making the choice include:
How am I consuming the data? For example will I just be reading from the beginning to the end rows in the order entered? Or will I be searching for rows that match multiple criteria?
How often will I be accessing the data during one program execution? Will I go once to get all books with Salinger as the author or will I go several times to get several different authors? Will I go more than once for several different criteria?
How will I be adding data? Can I just append a row to the end and that's perfect for my retrieval or will it need to be resorted?
How logical will the code look in six months? I emphasize this because I think this is too often forgotten in designing things (not just code, this hobby horse is actually from my days as a Navy mechanic cursing mechanical engineers). In six months when I have to maintain your code (or you do after working another project) which way of storing and retrieving data will make more sense. If going from flat files to a DB results in a 1% efficiency improvement but adds a week of figuring things out when you have to update the code have you really improved things.
Depends on what your information is and what your access patterns and scale are. Two of the biggest benefits of a relational databases are:
Caching. Unless you're very clever, you can't write a cache as good as that of a DB server
Optimizer.
However, for certain specialized applications, neither of these 2 benefits manifest itself compared to files+folders data store - therefore the answer is a resounding "depends".
As for files/folders, the tricks are:
- Cache the contents of frequently requested files
- Have small directories (files in deeply nested small directories are much faster to access than in a flatter structure, due to the time it takes to read the contents of a big directory).
- There are other, more advanced optimizations (slice across disks, placement on different places in a disk or different partition, etc..) - but if you have need of THAT level, you are better off with a database in the first place.
As a general rule, databases are slower than files.
If you require indexing of your files, a hard-coded access path on customised indexing structures will always have the potential to be faster if you do it correctly.
But 'performance' is not the the goal when choosing a database over a file based solution.
You should ask yourself whether your system needs any of the benefits that a database would provide. If so, then the small performance overhead is quite acceptable.
So:
- Do you need to deal with multiple users and concurrent updates? (Well; you did say it's static.)
- Do you need flexibility in order to easily query the data from a variety of angles?
- Do you have multiple users, and could gain from making use of an existing security model?
Basically, the question is more of which would be easier to develop. The performance difference between the two is not worth wasting dev time.
From my little bit of experience, server-based databases (even those served on the local machine) tend to to have very slow throughput compared to local filesystems. However, this depends on some things, one of which being asymptotic complexity. Comparing scanning a big list of files against using a database with an index to look up an item, the database wins.
My little bit of experience is with PostgreSQL. I had a table with three million rows, and I went to update a mere 8,000 records. It took 8 seconds.
As for the quote "Premature optimization is the root of all evil.", I would take that with a grain of salt. If you write your application using a database, then find it to be slow, it might take a tremendous amount of time to switch to a filesystem-based approach or something else (e.g. SQLite). I would say your best bet is to create a very simple prototype of your workload, and test it with both approaches. I believe it is important to know which is faster in this case.
As others have pointed out: it depends!
If you really need to find out which is going to be more performant for your purposes, you may want to generate some sample data to store in each format and then run some benchmarks. The Benchmark.pm module comes with Perl, and makes it fairly simple to do a side-by-side comparison with something like this:
use Benchmark qw(:all) ;
my $count = 1000; # Some large-ish number of trials is recommended.
cmpthese($count, {
'File System' => sub { ...your filesystem code... },
'Database' => sub { ...your database code... }
});
You can type perldoc Benchmark
to get more complete documentation.
It is very useful to use files instead of db when it comes to images if site structure is suitable. Create folders representing your matching data and place images inside. For example you have an article site, you store your articles in db. You don't have to place your image paths on db, name folders with your primary keys like 1,2,3.. and put images inside. E-books, music files, videos, this approach can be used in all media files. Same logic works with xml files if you won't search for something.
It depends on the profile of the data and what logic you are going to be using to access it. If you simply need to save and fetch named nodes then a filesystem-based database may be faster and more efficient. (You could also have a look at Berkeley DB for that purpose.) If you need to do index-based searches, and especially if you need to join different sets of data based on keys, then an SQL database is your best bet.
I would just go with whatever solution seems the most natural for your application.
As others have said, it depends: on the size and nature of the data and the operations you're planning to run on it.
Particularly for a CGI script, you're going to incur a performance hit for connecting to a database server on every page view. However if you create a naive file-based approach, you could easily create worse performance problems ;-)
As well as a Berkeley DB File solution you could also consider using SQLite. This creates a SQL interface to a database stored in a local file. You can access it with DBI and SQL but there's no server, configuration or network protocol. This could allow easier migration if a database server is necessary in the future (example: if you decide to have multiple front-end servers, but need to share state).
Without knowing any details, I'd suggest using a SQLite/DBI solution then reviewing the performance. This will give flexibility with a reasonably simple start up and decent performance.
To quickly access files, depending on what you are doing, an mmap can be very handy. I just wrote about this in the Effective Perl blog as Memory-map files instead of slurping them.
However, I expect that a database server would be much faster. It's difficult to say what would be faster for you when we have no idea what you are doing, what sort of data you need to access, and so on.
I'm going to give you the same answer everyone else gave you, It's Depends
In a simple scenario with a single server that returns data (READ Only), Yes file system will be great and easy to manage.
But, when you have more than one server you'll have to manage distributed files system like glusterfs, ceph, etc..
A database is a tool to manage all of it for you, distributed files system, compression, read/write, locks etc..
hope that's helpful.
Like other said DB is a tool, and it creates some overhead, but in case if your data is static and it's read-only data reading directory from files will be faster: Here are some tests that I've done: I had files with the name of the file as .csv In database I had indexed column as 'date' in order to find the same records in the database. Each day has 30K-50K records/rows and 100 columns of different type of data (90% floats).
DB Info: PostgreSQL 11.5, 16GB of RAM
Table:
335,162,867 records
Table size: 110GB
Index size: 7GB
Total size: 117GB
Files:
Number of files: 8033
Total Files size: 158GB
Number of records/lines per file/date: 30K - 50K
Reading data for a random date (1986-2019) from a file was constantly 4-5 times faster than reading data for the same date in PostgreSQL
来源:https://stackoverflow.com/questions/2147902/is-it-faster-to-access-data-from-files-or-a-database-server