I have 400 files, each one contains about 500000 character, and those 500000 characters consists only from about 20 letters. I want to make a histogram indicating the most 10 le
Note: This answers the original version of the question (the data consists of 10 letters only; a histogram is wanted). The question was completely changed (the data consists of about 20 letters, and a histogram of the 10 most used letters is wanted).
If the ten letters are arbitrary and not known in advance, you can't use hist(..., 10)
. Consider the following example with three arbitrary "letters":
h = hist([1 2 2 10], 3);
The result is not [1 2 1]
as you would expect. The problem is that hist
chooses equal-width bins.
Here are three approaches to do what you want:
You can find the letters with unique and then do the sum with bsxfun:
letters = unique(part(:)).'; %'// these are the letters in your file
h = sum(bsxfun(@eq, part(:), letters)); %// count occurrences of each letter
The second line of the above approach could be replaced by histc specifying the bin edges:
letters = unique(part(:)).';
h = histc(part, letters);
Or you could use sparse to do the accumulation:
t = sparse(1, part, 1);
[~, letters, h] = find(t);
As an example, for part = [1 2 2 10]
either of the above gives the expected result,
letters =
1 2 10
h =
1 2 1