lda

Kaldi-dnn 学习01

非 Y 不嫁゛ 提交于 2019-12-03 04:11:51
1. Kaldi 中实现的 dnn 共 4 种: a. nnet1 - 基于 Karel's 的实现,特点:简单,仅支持单 GPU, 由 Karel 维护 b. nnet2 - 基于 Daniel Povey p-norm 的实现,特点:灵活,支持多 GPU、CPU,由 Daniel 维护 c. nnet3 - nnet2 的改进,由 Daniel 维护 d. (nnet3 + chain) - Daniel Povey 改进的 nnet3, 特点:可以实现实时解码,解码速率为 nnet3 的 3~5 倍 目前来看:minibatch Stochastic Gradient Descent 用于 DNN 梯度下降的效果最好 从一个小样本含 (τ个样本) 估计出一个 avarage gradient , 这个小样本就叫做 minibatch 2. 先从 nnet2 说起 a. nnet2 最顶层的训练脚本:steps/nnet2/train_pnorm_fast.sh 通过多计算节点,完成并行化训练 b. 输入神经网络的特征 输入神经网络的特征是可配置的,通常为MFCC+LDA+MLLT+fMLLR, 40-维的特征,从网络上看到的是由7帧(从中间帧到左右帧都是3帧)组成的一个帧窗。由于神经网络很难从相关输入的数据中学习,因此,以 40*7 维特征作为一个不相关的固定变换形式,通过

推荐系统之冷启动问题

风流意气都作罢 提交于 2019-12-03 04:09:33
推荐系统之冷启动问题 转自http://blog.csdn.net/zhangjunjie789/article/details/51379127 如何在没有大量用户数据的情况下设计个性化推荐系统并且让用户对推荐结果满意从而愿意使用推荐系统,就是冷启动问题。 冷启动问题主要分为三类: (1) 用户冷启动:如何给新用户做个性化推荐的问题,新用户刚使用网站的时候,系统并没有他的行为数据; (2) 物品冷启动:解决如何将新的物品推荐给可能对它感兴趣的用户; (3) 系统冷启动:如何在新开发网站设计个性化推荐系统,此时网站上用户很少,用户行为也少,只有一些商品的信息。 冷启动的主要解决方案: (1) 提供非个性化推荐:如热门排行榜推荐,等到用户数据收集到一定程度时,切换到个性化推荐; 用户的注册信息分为3种:   1)人口统计学信息:年龄、性别、职业、民族、学历和居住地等;典型代表是Bruce Krulwich开发的Lifestyle Finder   2)用户兴趣的描述:有些网站要求用户填写;   3)从其他网站导入的用户站外行为数据。 有两个推荐系统数据集包含了人口统计学信息:BookCrossing数据集和Lastfm数据集。 利用的用户人口统计学特征越多,越能准确地预测用户兴趣。 (2) 利用用户注册信息:如性别,年龄,做粗粒度的个性化; 基于注册信息的个性化推荐流程:   1)

Kaldi-dnn 学习

坚强是说给别人听的谎言 提交于 2019-12-03 04:07:36
1. Kaldi 中实现的 dnn 共 4 种: a. nnet1 - 基于 Karel's 的实现,特点:简单,仅支持单 GPU, 由 Karel 维护 b. nnet2 - 基于 Daniel Povey p-norm 的实现,特点:灵活,支持多 GPU、CPU,由 Daniel 维护 c. nnet3 - nnet2 的改进,由 Daniel 维护 d. (nnet3 + chain) - Daniel Povey 改进的 nnet3, 特点:可以实现实时解码,解码速率为 nnet3 的 3~5 倍 目前来看: minibatch Stochastic Gradient Descent 用于 DNN 梯度下降的效果最好 从一个小样本含 ( τ个样本) 估计出一个 avarage gradient , 这个小样本就叫做 minibatch 2. 先从 nnet2 说起 a. nnet2 最顶层的训练脚本:steps/nnet2/train_pnorm_fast.sh 通过多计算节点,完成并行化训练 b. 输入神经网络的特征 输入神经网络的特征是可配置的,通常为MFCC+LDA+MLLT+fMLLR, 40-维的特征,从网络上看到的是由7帧(从中间帧到左右帧都是3帧)组成的一个帧窗。由于神经网络很难从相关输入的数据中学习,因此,以 40*7 维特征作为一个不相关的固定变换形式

Simple Python implementation of collaborative topic modeling?

北城余情 提交于 2019-12-03 03:34:19
问题 I came across these 2 papers which combined collaborative filtering (Matrix factorization) and Topic modelling (LDA) to recommend users similar articles/posts based on topic terms of post/articles that users are interested in. The papers (in PDF) are: " Collaborative Topic Modeling for Recommending Scientific Articles " and " Collaborative Topic Modeling for Recommending GitHub Repositories " The new algorithm is called collaborative topic regression . I was hoping to find some python code

Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

匿名 (未验证) 提交于 2019-12-03 03:04:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document. 回答1: As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel . What you're going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where documents are the new (i.e. out-of-training)

LDA ignoring n_components?

匿名 (未验证) 提交于 2019-12-03 02:20:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: When I am trying to work with LDA from Scikit-Learn, it keeps only giving me one component, even though I am asking for more: >>> from sklearn.lda import LDA >>> x = np.random.randn(5,5) >>> y = [True, False, True, False, True] >>> for i in range(1,6): ... lda = LDA(n_components=i) ... model = lda.fit(x,y) ... model.transform(x) Gives /Users/orthogonal/virtualenvs/osxml/lib/python2.7/site-packages/sklearn/lda.py:161: UserWarning: Variables are collinear warnings.warn("Variables are collinear") array([[-0.12635305], [-1.09293574], [ 1

how to determine the number of topics for LDA?

走远了吗. 提交于 2019-12-03 02:17:22
I am a freshman in LDA and I want to use it in my work. However, some problems appear. In order to get the best performance, I want to estimate the best topic number. After reading "Finding Scientific topics", I know that I can calculate logP(w|z) firstly and then use the harmonic mean of a series of P(w|z) to estimate P(w|T). My question is what does the "a series of" mean? Unfortunately, there is no hard science yielding the correct answer to your question. To the best of my knowledge, hierarchical dirichlet process (HDP) is quite possibly the best way to arrive at the optimal number of

Understanding LAPACK calls in C++ with a simple example

匿名 (未验证) 提交于 2019-12-03 02:14:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am a beginner with LAPACK and C++/Fortran interfacing. I need to solve linear equations and eigenvalues problems using LAPACK/BLAS on Mac OS-X Lion. OS-X Lion provides optimized BLAS and LAPACK libraries (in /usr/lib) and I am linking these libraries instead of downloading them from netlib. My program (reproduced below) is compiling and running fine, but it is giving me wrong answers. I have researched in the web and Stackoverflow and the issue may have to deal with how C++ and Fortran store arrays in differing formats (row major vs Column

Run cvb in mahout 0.8

≡放荡痞女 提交于 2019-12-03 01:48:23
The current Mahout 0.8-SNAPSHOT includes a Collapsed Variational Bayes (cvb) version for Topic Modeling and removed the Latent Dirichlet Analysis (lda) approach, because cvb can be parallelized way better. Unfortunately there is only documentation for lda on how to run an example and generate meaningful output. Thus, I want to: preprocess some texts correctly run the cvb0_local version of cvb inspect the results by looking at the top n words in each of the generated topics So here are the subsequent Mahout commands I had to call in a linux shell to do it. $MAHOUT_HOME points to my mahout/bin

python mallet LDA FileNotFoundError: [Errno 2] No such file or directory: 'C:\\\\Users\\\\abc\\\\AppData\\\\Local\\\\Temp\\\\d33563_state.mallet.gz'

匿名 (未验证) 提交于 2019-12-03 01:10:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: It is my first time to use mallet LDA. Basically, I downloaded the mallet-2.0.8 zip file and JDK. I installed JDK, extracted mallet-2.0.8 to a destination folder. I set the MALLET_HOME. Here is my code mallet_path='C:/Users/abc/mallet-2.0.8/bin/mallet' ldamallet=gensim.models.wrappers.LdaMallet(mallet_path,corpus=corpus,num_topics=20,id2word=id2word) However, it gives the error: FILENOTFOUNDERROR[ERROR2] I tried mallet_path='C:\\Users\\abc\\mallet-2.0.8\\bin\\mallet' and mallet_path=r'C:\Users\abc\mallet-2.0.8\bin\mallet' I got the same