I am working on a (database-ish) project, where data is stored in a flat file. For reading/writing I\'m using the RandomAccessFile
class. Will I gain anything f
By my experience from C++ development the answer is: Yes, using multiple threads can improve performance when reading files. This applies to both sequential and serial access. I proved this more than once, although i always found that the real bottlenecks are somewhere else.
The reason is, that for disk access a thread will be suspended until the disk operation has completed. But most disks today support Native Command Queueing see (SAS) or Segate (SATA) (as well as do most RAID systems) and therefore do not have to handle requests in the order you make them.
Thus if you read 4 file chunks sequential, your program will have to wait for the first chunk, then you request the second one and so one. If you request the 4 chunks with 4 threads, they may be returned all at once. This kind of optimization has limits, but it works (although i have experiences only with C++ here). I measured that multiple threads can improve sequential read performance by more than 100%.
I am surprised every answer talks about performance, but no one distinguishes latency from throughput, whereas both are performance characteristics. While you may gain additional throughput employing multiple threads, as @RED SOFT ADAIR has shown, you trade off latency, especially in a case of Native Command Sequencing.
There is an option to memory map your flat file with NIO. In that case the OS memory manager becomes responsible for moving in-out sections of the file. You can also apply region locks for writers.
I now did a benchmark with the code below (excuse me, its in cpp). The code reads a 5 MB textfile with a number of threads passed as a command line argument.
The results clearly show that multiple threads always speed up a program:
Update: It came to my mind, that file caching will play quite a role here. So i made copies of the testdata file, rebooted and used a different file for each run. Updated results below (old ones in brackets). The conclusion remains the same.
Runtime in Seconds
Machine A (Dual Quad Core XEON running XP x64 with 4 10k SAS Drives in RAID 5)
Machine B (Dual Core Laptop running XP with one fragmented 2.5 Inch Drive)
Sourcecode (Windows):
// FileReadThreads.cpp : Defines the entry point for the console application.
//
#include "Windows.h"
#include "stdio.h"
#include "conio.h"
#include <sys\timeb.h>
#include <io.h>
///////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////
int threadCount = 1;
char *fileName = 0;
int fileSize = 0;
double GetSecs(void);
///////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////
DWORD WINAPI FileReadThreadEntry(LPVOID lpThreadParameter)
{ char tx[255];
int index = (int)lpThreadParameter;
FILE *file = fopen(fileName, "rt");
int start = (fileSize / threadCount) * index;
int end = (fileSize / threadCount) * (index + 1);
fseek(file, start, SEEK_SET);
printf("THREAD %4d started: Bytes %d-%d\n", GetCurrentThreadId(), start, end);
for(int i = 0;; i++)
{
if(! fgets(tx, sizeof(tx), file))
break;
if(ftell(file) >= end)
break;
}
fclose(file);
printf("THREAD %4d done\n", GetCurrentThreadId());
return 0;
}
///////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////
int main(int argc, char* argv[])
{
if(argc <= 1)
{
printf("Usage: <InputFile> <threadCount>\n");
exit(-1);
}
if(argc > 2)
threadCount = atoi(argv[2]);
fileName = argv[1];
FILE *file = fopen(fileName, "rt");
if(! file)
{
printf("Unable to open %s\n", argv[1]);
exit(-1);
}
fseek(file, 0, SEEK_END);
fileSize = ftell(file);
fclose(file);
printf("Starting to read file %s with %d threads\n", fileName, threadCount);
///////////////////////////////////////////////////////////////////////////
// Start threads
///////////////////////////////////////////////////////////////////////////
double start = GetSecs();
HANDLE mWorkThread[255];
for(int i = 0; i < threadCount; i++)
{
mWorkThread[i] = CreateThread(
NULL,
0,
FileReadThreadEntry,
(LPVOID) i,
0,
NULL);
}
WaitForMultipleObjects(threadCount, mWorkThread, TRUE, INFINITE);
printf("Runtime %.2f Secs\nDone\n", (GetSecs() - start) / 1000.);
return 0;
}
///////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////
double GetSecs(void)
{
struct timeb timebuffer;
ftime(&timebuffer);
return (double)timebuffer.millitm +
((double)timebuffer.time * 1000.) - // Timezone needed for DbfGetToday
((double)timebuffer.timezone * 60. * 1000.);
}
Oops, RandomAccessFile
is synchronised, so if you share an instance then you'll only have one thread running at one anyway.RandomAccessFile
is not synchronised, and sharing between threads is not entirely safe. You will, as ever, need to be careful when you have multiple thread accessing the same mutable datastructure, particularly when the vagaries of operating systems are involved.
Small operations of RandomAccessFile
are hideously slow.
For maximum performance, you are probably better off going straight for java.nio
, although I would suggest getting something working before getting it to work fast. OTOH, keep performance in mind.
A fairly common question. Basically using multiple threads will not make your hard drive go any faster. Instead performing concurrent request can make it slower.
Disk subsystems, esp IDE, EIDE, SATA, are designed to read/write sequentially fastest.