问题
I'm using a similarity matrix with values between 0 and 1 (1 means that the elements are equals), and I'm trying to plot a MDS with python and scikit-learn.
I found multiple examples, but I'm not sure about what to give as an input to mds.fit().
For now, my data looks like that (file.csv) :
; A ; B ; C ; D ; E
A ; 1 ; 0.1 ; 0.2 ; 0.5 ; 0.2
B ; 0.1 ; 1 ; 0.3 ; 1 ; 0
C ; 0.2 ; 0.3 ; 1 ; 0.8 ; 0.6
D ; 0.5 ; 1 ; 0.8 ; 1 ; 0.2
E ; 0.2 ; 0 ; 0.6 ; 0.2 ; 1
I'm currently using this code :
import pandas
from sklearn import manifold
import matplotlib.pyplot as plt
data = pandas.read_table("file.csv", ";", header=0, index_col=0)
mds = manifold.MDS(n_components=2, random_state=1, dissimilarity="precomputed")
mds.fit(data)
points = mds.embedding_
# Prepare axes
ax = plt.axes([0,0,2,2])
ax.set_aspect(aspect='equal')
# Plot points
plt.scatter(points[:,0], points[:,1], color='silver', s=150)
# Add labels
for i in range(data.shape[0]):
ax.annotate(data.index[i], (points[i,0], points[i,1]), color='blue')
#plt.show() # Open display and show at screen
plt.savefig('out.png', format='png', bbox_inches='tight') # PNG
#plt.savefig('out.jpg', format='jpg', bbox_inches='tight') # JPG
I'm unsure about what sklearn is doing. I read a lot of examples where people are using "dissimilarity matrix" with 0 in the middles (instead of 1).
Should I make a transformation ? Or not ? If yes, which transformation should be done ? (I read there that a simple substraction is enough... but other methods exist... I'm a bit lost :( )
Does sklearn and MDS automatically understand the input ? (as a similarity or dissimilarity matrix with the 0 or 1 in the middle ?) Or does it use a distance matrix ? (In this case, how to obtain it from a similarity matrix ?)
In this link, they say the similarity are between 1 and -1... I'm using similarities between 0 and 1... I suppose I should transform my data? Which transformation should be used?
回答1:
I did a comparison with XLSTAT (an excel extension) in order to try a lot of scenarios and compare how to do what.
First : my input matrix was a "similarity" matrix, because I could interpret it as: "A and A are 100% equal". As MDS is taking a dissimilarity matrix as input, I must apply a transformation.
- In the litterature Ricco Rakotomalala's french course on data science (p 208-209), the easy way is to substract the maximum value to each cell (make a "1 - cell" operation). So you can easily make a python program, or (as I keep a trace of each matrix) an AWK pre-processing program :
similarity-to-dissimilarity-simple.awk
# We keep the tags around the CSV matrix
# X ; Word1 ; Word2 ; ...
# Header
NR == 1 {
# First column is just "X" (or space)
printf("%s", "X");
# For each column, print the word
for (i = 2; i <= NF; i++)
{
col = $i;
printf("%s%s", OFS, col);
}
# End of line
printf("\n");
}
# Other lines are processed
# WordN ; 1 ; 0.5 ; 0.2 ; ...
NR != 1 {
# First column is the word/tag
col = $1;
printf("%s", col);
# For each column, process the number
for (i = 2; i <= NF; i++)
{
# dissimilarity = (1 - similarity)
NUM = $i;
VAL = 1 - NUM;
printf("%s%s", OFS, VAL);
}
printf("\n");
}
It can be called using the command :
awk -F ";" -v OFS=";" -f similarity-to-dissimilarity-simple.awk input.csv > output-simple.csv
- A more complex way of calculating (I can't find back the reference, sorry :( ) is based on another transformation on each cell :
This method seems to be perfectly adapted if the diagonal does not contain the same value (I saw there a co-occurrence matrix... it should apply to his cas). In my case, as the diagonal is ALWAYS full of 1, I reduced it to :
The AWK program to make this transformation (I implemented the simplified one, because of my data) is therefore :
similarity-to-dissimilarity-complex.awk
# Header
# X ; Word1 ; Word2 ; ...
NR == 1 {
# First column is just "X" (or space)
printf("%s", "X");
# For each column, print the word
for (i = 2; i <= NF; i++)
{
col = $i;
printf("%s%s", OFS, col);
}
# End of line
printf("\n");
}
# Other lines are processed
# WordN ; 1 ; 0.5 ; 0.2 ; ...
NR != 1 {
# First column is the word
col = $1;
printf("%s", col);
# For each column, process the number
for (i = 2; i <= NF; i++)
{
# dissimilarity = (2 - 2 * similarity)^-1/2
NUM = $i;
VAL = sqrt(2 - 2 * NUM);
printf("%s%s", OFS, VAL);
}
printf("\n");
}
And you can call it with this command :
awk -F ";" -v OFS=";" -f similarity-to-dissimilarity-complex.awk input.csv > output-complex.csv
When I used the Kruskal's stress to check which version was better... in my case, the simple similarity to dissimilarity (1 - cell) was the best (I kept a stress between 0,34 and 0,32... which is not good... where the complex shows bigger values than 0,34, which is worse).
来源:https://stackoverflow.com/questions/64235182/how-to-plot-a-mds-from-a-similarity-matrix