fasttext | 易学教程

我是如何用 AI 把“请洗手”翻译成 500 种语言的？

阅读更多关于我是如何用 AI 把“请洗手”翻译成 500 种语言的？

通过使用人类和机器生成的翻译，可以将关键的健康短语翻译成世界各地的当地语言。你可能不知道，目前世界上有 7117 种语言在使用，不是方言，而是在用的语言! 然而，世界上许多数字媒体只能使用几十种语言，而像谷歌翻译这样的翻译平台只支持 100 种左右的语言。这样的现实意味着，由于缺乏及时获取信息的机会，全世界有数十亿人被边缘化。当前的冠状病毒（COVID-19）大流行已经让人痛苦地意识到了这一点，凸显了将健康相关的短语（如“请洗手wash your hands”或“保持距离”等）即时、快速翻译成小众语言的必要性。为此，我应用了最先进的 AI 技术，用 544 种语言构建出了与“请洗手”相近的短语并进行了统计（我的 GPU 还在运行）。多语言无监督和受监督嵌入Multilingual Unsupervised and Supervised Embeddings（MUSE）方法被用来训练这 544 种语言和英语之间的跨语言单词嵌入。然后，这些嵌入方法可以从现有文档中提取出与目标短语相似的短语。我与 SIL 国际公司的同事们合作完成了这项工作，他们收集了该短语的更多的人工翻译结果。这些人工翻译结果和我的一些机器翻译结果的组合可以在这个民族语指南页面上搜索到（机器生成的短语用一个小的机器人图标表示），更多的翻译将在生成/收集到的时候加入。利用现有的语料库 SIL

Fasttext (Bag of Tricks for Efficient Text Classification) 阅读笔记

阅读更多关于 Fasttext (Bag of Tricks for Efficient Text Classification) 阅读笔记

论文原文 Bag of Tricks for Efficient Text Classification 论文信息 EACL2017 论文解读 Hytn Chen 更新时间 2020-02-23 文本分类相关方法用作文本分类的卷积神经网络，有多个使用流程示意图如下 1维卷积堆卷积（Goldberg Book）延迟CNN（Kalchbrenner et al. 2016）动态CNN 详见这篇文章解读总结一下，CNN在文本分类中担任的主要角色就是encoding文本，之后可以用分类器解决分类的问题。CNN主要问题就是训练相对较慢，从而就限制了CNN不能在非常大的数据集上使用。论文提出的模型模型结构图如下简单来讲就是文本表征+线性模型，这里的文本表征由n-gram，词查找表以及CBOW组成，线性模型由多层softmax和rank constraint组成（实现参数共享的效果）。输入层：先看n-gram表征，很简单理解，假设一句话有N个词，1-gram就是单个词为一组，一共可以有N组；2-gram就是两个词为一组的排列组合，一共可以有 N ( N − 1 ) 2 \frac{N(N-1)}{2} 2 N ( N − 1 ) 组，如果再多那字典（要索引到具体组别，一个组别一个序号）的维度就呈指数增长了，因此文中使用了哈希字典的方式来避免这样的情况发生（把所有的n

ocr 连接开源项目

阅读更多关于 ocr 连接开源项目

https://github.com/Raymondhhh90/idcardocr:web部署,第二代居民身份证信息识别，速度略慢，待优化 https://github.com/wzb19960208/idCardRecognition身份证识别 https://github.com/rmtheis/android-ocr基于Tesseract的身份证识别 https://github.com/developer79433/passport_mrz_detector_cpp护照识别 https://github.com/evilgix/Evil: 银行卡、身份证、门牌号光学识别 https://github.com/YCG09/chinese_ocr:基于Tensorflow和Keras实现端到端的不定长中文字符检测和识别说明：样本有300万张 https://github.com/AstarLight/CPS-OCR-Engine:3755个（一级字库）的印刷体汉字识别 https://github.com/senlinuc/caffe_ocr:CNN+BLSTM+CTC的识别架构 https://github.com/simplezhli/Tesseract-OCR-Scanner:基于Tesseract-OCR实现自动扫描识别手机号车牌识别 https://github

Word2Vec实践

阅读更多关于 Word2Vec实践

Word2Vec实践 1 gensim word2vec API概述 2 模型训练 1、读取数据 2、数据预处理 3、模型训练 4、效果测试 3 与Fasttext对比 1 Fasttext简介 2 Fasttext模型训练 3 两者对比之前了解过Word2Vec的原理，但是没有做过项目实践，这次得到一批专利数据，所以自己上手实践一下。数据参考： https://github.com/newzhoujian/LCASPatentClassification 1 gensim word2vec API概述在gensim中，word2vec 相关的API都在包gensim.models.word2vec中。和算法有关的参数都在类gensim.models.word2vec.Word2Vec中。算法需要注意的参数有： 1) sentences: 我们要分析的语料，可以是一个列表，或者从文件中遍历读出。后面我们会有从文件读出的例子。 2) size: 词向量的维度，默认值是100。这个维度的取值一般与我们的语料的大小相关，如果是不大的语料，比如小于100M的文本语料，则使用默认值一般就可以了。如果是超大的语料，建议增大维度。 3) window：即词向量上下文最大距离，这个参数在我们的算法原理篇中标记为c，window越大，则和某一词较远的词也会产生上下文关系。默认值为5

Gensim: Any chance to get word frequency in Word2Vec format?

阅读更多关于 Gensim: Any chance to get word frequency in Word2Vec format?

问题 I am doing my research with fasttext pre-trained model and I need word frequency to do further analysis. Does the .vec or .bin files provided on fasttext website contain the info of word frequency? if yes, how do I get? I am using load_word2vec_format to load the model tried using model.wv.vocab[word].count, which only gives you the word frequency rank not the original word frequency. 回答1: I don't believe those formats include any word frequency information. To the extent any pre-trained word

Gensim: Any chance to get word frequency in Word2Vec format?

阅读更多关于 Gensim: Any chance to get word frequency in Word2Vec format?

文本分类模型的几种方法介绍及比较

阅读更多关于文本分类模型的几种方法介绍及比较

文本分类模型一、fastText https://fasttext.cc/docs/en/unsupervised-tutorial.html fastText模型架构: 其中x1,x2,…,xN−1,xN表示一个文本中的n-gram向量，每个特征是词向量的平均值。这和前文中提到的cbow相似，cbow用上下文去预测中心词，而此处用全部的n-gram去预测指定类别代码如下，只能在linux环境运行： #!/usr/bin/python # -*- coding: UTF-8 -*- # -*- coding:utf-8 -*- import pandas as pd import random import fasttext import jieba from sklearn.model_selection import train_test_split import os """ 函数说明：加载数据 """ def loadData(): #利用pandas把数据读进来 df_military = pd.read_csv("./data/junshi.csv",encoding ="utf-8") df_military=df_military.dropna() df_sports = pd.read_csv("./data/sports.csv",encoding =

FastText简单实践

阅读更多关于 FastText简单实践

fastText原理和文本分类实战 https://blog.csdn.net/feilong_csdn/article/details/88655927 Python interface https://github.com/salestock/fastText.py import fasttext root_path = "/Users/documents/" train_file = root_path + "target.train" valid_file = root_path + "target.valid" model_save_path = root_path + "model_py" def showResult(classifier, valid_file): result = classifier.test(valid_file) print("Number of examples:", result[0]) print("P@1:", result[1]) print("R@1:", result[2]) classifier = fasttext.train_supervised(input=train_file, epoch=15) #classifier.save_model(model_save_path) #classifier =

How to use Hindi Model in RASA NLU?

阅读更多关于 How to use Hindi Model in RASA NLU?

问题 I have build my model for Hindi language using FastText with spacy backend. I followed this tutorial to to build my model using FastText. This URL I have also linked my model with spacy by following command python -m spacy link nl_model hi Model is linked successfully you can check in the image below Now I am not finding any help for using hindi language, Like what kind of config files do I need to use, where to import hindi model and how to proceed now? I also have question like how our data

Handling C++ arrays in Cython (with numpy and pytorch)

阅读更多关于 Handling C++ arrays in Cython (with numpy and pytorch)

问题 I am trying to use cython to wrap a C++ library ( fastText , if its relevant). The C++ library classes load a very large array from disk. My wrapper instantiates a class from the C++ library to load the array, then uses cython memory views and numpy.asarray to turn the array into a numpy array, then calls torch.from_numpy to create a tensor. The problem arising is how to handle deallocating the memory for the array. Right now, I get pointer being freed was not allocated when the program exits

订阅 fasttext