word

PHP word index, performance and reasonable results

久未见 提交于 2019-12-06 10:25:29
问题 I'm currently working on an indexer for a search feature. The indexer will work over data from "fields". Fields looks like: Field_id Field_type Field_name Field_Data - 101 text Name Intel i7 - 102 integer Cores 4 physical, 4 virtual - 103 select Vendor Intel - 104 multitext Description The i7 is intel's next gen range of cpus. The indexer would generate the following results/index: Keyword Occurrences - intel 101, 103, 104 - i7 101, 104 - physical 102 - virtual 102 - next 104 - gen 104 -

PHP - Search String for a Specific Word Array and Match with an Optional + or -

假如想象 提交于 2019-12-06 09:46:00
问题 I need to search a string for a specific word and have the match be a variable. I have a specific list of words in an array: $names = array ("Blue", "Gold", "White", "Purple", "Green", "Teal", "Purple", "Red"); $drag = "Glowing looks to be +Blue."; $match = "+Blue"; echo $match +Blue What I need to do is search $drag with the $names and find matches with an option + or - character and have $match become the result. 回答1: Build a regular expression by joining the terms of the array with | , and

How to make this random text generator more efficient in Python?

本秂侑毒 提交于 2019-12-06 06:21:47
问题 I'm working on a random text generator -without using Markov chains- and currently it works without too many problems. Firstly, here is my code flow: Enter a sentence as input -this is called trigger string, is assigned to a variable- Get longest word in trigger string Search all Project Gutenberg database for sentences that contain this word -regardless of uppercase lowercase- Return the longest sentence that has the word I spoke about in step 3 Append the sentence in Step 1 and Step4

Elastic Search中normalization和分词器

纵然是瞬间 提交于 2019-12-06 05:31:02
为key_words提供更加完整的倒排索引。 如:时态转化(like | liked),单复数转化(man | men),全写简写(china | cn),同义词(small | little)等。 如:china 搜索时,如果条件为cn是否可搜索到。 如:dogs,搜索时,条件为dog是否可搜索到数据。 如果可以使用简写(cn)或者单复数(dog&dogs)搜索到想要的结果,那么称为搜索引擎normalization人性化。 normalization是为了提升召回率的(recall),就是提升搜索能力的。 normalization是配合分词器(analyzer)完成其功能的。 分词器的功能就是处理Document中的field的。就是创建倒排索引过程中用于切分field数据的。 如:I think dogs is human’s best friend.在创建倒排索引的时候,使用分词器实现数据的切分。 上述的语句切分成若干的词条,分别是: think dog human best friend。 常见搜索条件有:think、 human、 best、 friend,很少使用is、a、the、i这些数据作为搜索条件。 1 ES默认提供的常见分词器 要切分的语句:Set the shape to semi-transparent by calling set_trans(5)

Complex XSLT split?

让人想犯罪 __ 提交于 2019-12-06 03:49:41
问题 Is it possible to split a tag at lower to upper case boundaries i.e. for example, tag 'UserLicenseCode' should be converted to 'User License Code' so that the column headers look a little nicer. I've done something like this in the past using Perl's regular expressions, but XSLT is a whole new ball game for me. Any pointers in creating such a template would be greatly appreciated! Thanks Krishna 回答1: Using recursion, it is possible to walk through a string in XSLT to evaluate every character.

Emacs regular expression: what \< and \> can do that \b cannot do?

旧城冷巷雨未停 提交于 2019-12-05 17:33:03
问题 Regexp Backslash - GNU Emacs Manual says that \< matches at the beginning of a word, \> matches at the end of a word, and \b matches a word boundary. \b is just as in other non-Emacs regular expressions. But it seems that \< and \> are particular to Emacs regular expressions. Are there cases where \< and \> are needed instead of \b ? For instance, \bword\b would match the same as \<word\> would, and the only difference is that the latter is more readable. 回答1: You can get unexpected results

Unicode-ready wordsearch - Question

社会主义新天地 提交于 2019-12-05 16:46:47
Is this code OK? I don't really have a clue which normalization-form I should us (the only thing I noticed is with NFD I get a wrong output). #!/usr/local/bin/perl use warnings; use 5.014; use utf8; binmode STDOUT, ':encoding(utf-8)'; use Unicode::Normalize; use Unicode::Collate::Locale; use Unicode::GCString; my $text = "my taxt täxt"; my %hash; while ( $text =~ m/(\p{Alphabetic}+(?:'\p{Alphabetic}+)?)/g ) { #' my $word = $1; my $NFC_word = NFC( $word ); $hash{$NFC_word}++; } my $collator = Unicode::Collate::Locale->new( locale => 'DE' ); for my $word ( $collator->sort( keys %hash ) ) { my

基于word分词提供的文本相似度算法来实现通用的网页相似度检测

守給你的承諾、 提交于 2019-12-05 13:52:52
实现代码: 基于word分词提供的文本相似度算法来实现通用的网页相似度检测 运行结果: 检查的博文数:128 1、检查博文:192本软件著作用词分析(五)用词最复杂99级,相似度分值:Simple=0.968589 Cosine=0.955598 EditDistance=0.916884 EuclideanDistance=0.00825 ManhattanDistance=0.001209 Jaccard=0.859838 JaroDistance=0.824469 JaroWinklerDistance=0.894682 SørensenDiceCoefficient=0.924638 SimHashPlusHammingDistance=0.976563 博文地址1: http://my.oschina.net/apdplat/blog/388816 博文地址2: http://yangshangchuan.iteye.com/blog/2194214 2、检查博文:APDPlat的系统启动和关闭流程剖析,相似度分值:Simple=0.837996 Cosine=0.711649 EditDistance=0.55001 EuclideanDistance=0.003669 ManhattanDistance=0.000992 Jaccard=0.549422

SOLR4.2+NUTCH1.6

点点圈 提交于 2019-12-05 04:40:22
1、SOLR4.2集成NUTCH1.6 wget http://archive.apache.org/dist/lucene/solr/4.2.0/solr-4.2.0.tgz tar -xzvf solr-4.2.0.tgz cd solr-4.2.0/example 复制 nutch 的 conf 目录中的 schema-solr4.xml 文件到 solr/collection1/conf 目录,改名为 schema.xml ,覆盖原来文件 修改 solr/collection1/conf/schema.xml ,在 <fields> 下增加: <field name="_version_" type="long" indexed="true" stored="true"/> 2、给SOLR4.2配置中文分词器word分词 参考 https://github.com/ysc/word 的 Solr插件 部分 3、运行SOLR4.2 启动 SOLR4.2 服务器: java -jar start.jar & SOLR4.2 Web 界面: http://host2:8983 4、运行NUTCH提交索引 运行 solrindex命令 : bin/nutch solrindex http://host2:8983/solr data/crawldb -linkdb data

Pass multi-word arguments to a bash function

故事扮演 提交于 2019-12-04 23:30:53
Inside a bash script function, I need to work with the command-line arguments of the script, and also with another list of arguments. So I'm trying to pass two argument lists to a function, the problem is that multi-word arguments get split. function params() { for PARAM in $1; do echo "$PARAM" done echo . for ITEM in $2; do echo "$ITEM" done } PARAMS="$@" ITEMS="x y 'z t'" params "$PARAMS" "$ITEMS" calling the script gives me myscript.sh a b 'c d' a b c d . x y 'z t' Since there are two lists they must be passed as a whole to the function, the question is, how to iterate the elements while