I am working on a database self project. I have an input file got from: http://ir.dcs.gla.ac.uk/resources/test_collections/cran/
After processing into 1400 separate file, each named 00001.txt,... 01400.txt...) and after applying Stemming on them, I will store them separately in a specific folder lets call it StemmedFolder with the following format:
in StemmedFolder: 00001.txt includes:
investig
aerodynam
wing
slipstream
brenckman
experiment
investig
aerodynam
wing
in StemmedFolder: 00756.txt includes:
remark
eddi
viscos
compress
mix
flow
lu
ting
And so on....
I wrote the codes that do:
- get the StemmedFolder, Count the Unique words
- Sort Alphabetically
- Add the ID of the document
- save each to a new file 00001.txt to 01400.txt as will be described
{I can provide my codes for these 4 sections in case somebody needs to see how is the implementation or change or any edit}
output of each file will be result to a separate file. (1400, each named 00001.txt, 00002.txt...) in a specific folder lets call it FrequenceyFolder with the following format:
in FrequenceyFolder: 00001.txt includes:
00001,aerodynam,2
00001,agre,3
00001,angl,1
00001,attack,7
00001,basi,4
....
in FrequenceyFolder: 00999.txt includes:
00999,aerodynam,5
00999,evalu,1
00999,lift,3
00999,ratio,2
00999,result,9
....
in FrequenceyFolder: 01400.txt includes:
01400,subtract,1
01400,support,1
01400,theoret,1
01400,theori,1
01400,.....
______________
Now my question:
I need to combine these 1400 files again to output a txt file that looks like this format with some calculation:
'aerodynam' totalFrequency=3docs: [[Doc_00001,5],[Doc_01344,4],[Doc_00123,3]]
'book' totalFrequncy=2docs: [[Doc_00562,6],[Doc_01111,1]
....
....
'result' totalFrequency=1doc: [[Doc_00010,5]]
....
....
'zzzz' totalFrequency=1doc: [[Doc_01235,1]]
Thanks for spending time reading this long post
You can use a Map
of List
.
Map<String,List<FileInformation>> statistics = new HashMap<>()
In the above map, the key will be the word and the value will be a List<FileInformation>
object describing the statistics of individual files containing the word. The FileInformation
class can be declared as follows :
class FileInformation {
int occurrenceCount;
String fileName;
//getters and setters
}
To populate the above Map, use the following steps :
- Read each file in the
FrequencyFolder
- When you come across a word for the first time, put it as a key in the
Map
. - Create a
FileInformation
object and set theoccurrenceCount
to the number of occurrences found and set thefileName
to the name of the file it was found in. Add this object in theList<FileInformation>
corresponding to the key created in step 2. - The next time you come across the same word in another file, create a new
FileInfomation
object and add it to theList<FileInformation>
corresponding to the entry in the map for the word.
Once you have the Map
populated, printing the statistics should be a piece of cake.
for(String word : statistics.keySet()) {
List<FileInformation> fileInfos = statistics.get(word);
for(FileInformation fileInfo : fileInfos) {
//sum up the occureneceCount for the word to get the total frequency
}
}
来源:https://stackoverflow.com/questions/30523883/hashmap-single-key-holding-a-class-count-the-key-and-retrieve-counter