Interesting insights.
Tx Guys And gals..
It seems like it really does depend on the entirety of the solutions design and usage( hence I don’t always like agile if I can’t see the requirements ahead nicely)
I wanted to add other considerations from the mainframe approach.
Generally we deal with large data and small windows so performance is very key, but so is reliability, maintainability , sustainability , scalability ( Ja all abilities :-)
and also consistency and integrity .. ok I think you get the picture.
What we learned here is that generally when you process in a batch mode ( bulk data arrives at a single point in time ..I.e. like all debits due on month end date) we will use files. Normally indexed or flat depending on wether you need keyed access or read every last record respectively.
The files are generally pre allocated and defined and careful consideration is given to block sizes to align pages. Bear in mind a flat read will actually retrieve blocks, which may contain multiple records in sequence for every read command.therefore if aligned Properly it will not cause the file system Manager to have to do work when having to retrieve in less optimal ways(unaligned record sizes) or to dynamically have rearrange it’s allocation and index etc..( very similar to GC) or to lock records at page or block level if you cannot do dirty/uncommitted reads etc..or to make space in the middle for something while running an index file with random inserts etc..
So sometimes even when there is a large percentage of say keyed inserts from one input to another or matching between the two, you can also use flat files and do Things like low key logic matching as you run using the most probable file as your driver file.. We sometimes even find it more efficient to dump the DB to a flat file and then process and reload after( if you gotta hit all record ma for example) Or sometimes even reading the underlying DB files directly..( ps! also a nice option for resilience is that sometimes if the DB connection or something cause it to be unavailable and you need to continue reading then check the return code and read the dB filesystem Directly when return code is on error)..
Driver file logic (which drives the main logic . I.e. for every one of the records on that master file do ...XYZ..) is generally as a rule of thumb defined for flat file processing and these files are typically pre sorted in a pre step, using high speed assembly based sort utilities to suit the logic breaks expected in the processing of the records( I.e. you want to process per client then pre sort on client or per branch then pre sort On Branch etc...etc).
Anyway , when it comes to high speed real-time requirements( opposite of batch bulk) we would normally use DB. Mostly this is related to UI , transactional( messaging) / event driven / interactive type use cases and hence better suited for direct/keyed access and it hardly needs the same responses a batch process requires.
You also have the possibility to read and lock at record level automatically to ensure integrity(you can also go faster if you don’t care about integrity and can afford to read uncommitted/dirty) ...
If you think about it, the OS is filesystem based and therefore the DB has to ultimately be filesystem based.. however the DB is extremely optimized for typical event/real time based, concurrent and high integrity requirement use cases .. A dev would have to sometimes do a lot to get all that functionality in a real-time/event driven scenario/use case.. think deadlocks , rollbacks , synchpoints , concurrency etc..over a complexed flow and to still remain highly responsive..
Writing this just made me realize that there is much more to consider when deciding what to use and when and why .. Sometimes m, even the same data at different points is best suited to different choices.. I.e. you might acquire the data through front ends using DB’s or other easier direct keyed access methods that’s optimized for it, but you may want to drop that on the filesystem
And processes a flat file if you have to do a monthly recon on that data at the specific point in time of the the month for example.
It does seem to mean that all this is design dependent. Say for example if you can’t do certain things at the point when you do the transaction/event due to response requirements and have to defer till later and still keep integrity in place etc.. as opposed to being able to do everything in a simpler process and not having to defer etc..
it will also be very dependent on the domain.. I.e. will you know if google missed a random hit when returning results on search ? But I bet you will know when your Bank does not show a transaction ( especially a credit to you ..lol)...