问题
I am introducing sunspot search into my project. I got a POC by just searching by the name field. When I introduced the description field and reindexed sold I get the following error.
** Invoke sunspot:reindex (first_time)
** Invoke environment (first_time)
** Execute environment
** Execute sunspot:reindex
Skipping progress bar: for progress reporting, add gem 'progress_bar' to your Gemfile
rake aborted!
RSolr::Error::Http: RSolr::Error::Http - 400 Bad Request
Error: {'responseHeader'=>{'status'=>400,'QTime'=>18},'error'=>{'msg'=>'Illegal character ((CTRL-CHAR, code 11))
at [row,col {unknown-source}]: [42,1]','code'=>400}}
Request Data: "<?xml version=\"1.0\" encoding=\"UTF-8\"?><add><doc><field name=\"id\">ItemsDesign 1322</field><field name=\"type\">ItemsDesign</field><field name=\"type\">ActiveRecord::Base</field><field name=\"class_name\">ItemsDesign</field><field name=\"name_text\">River City Clocks Musical Multi-Colored Quartz Cuckoo Clock</field><field name=\"description_text\">This colorful chalet style German quartz cuckoo clock accurately keeps time and plays 12 different melodies. Many colorful flowers are painted on the clock case and figures of a Saint Bernard and Alpine horn player are on each side of the clock dial. Two decorative pine cone weights are suspended beneath the clock case by two chains. The heart shaped pendulum continously swings back and forth. On every
I assuming that the bad char is that you can see at the bottom. that is littered in a lot of the descriptions. I'm not even sure what char that is.
What can I do to get solr to ignore it or clean the data so that sold can handle it.
Thanks
回答1:
Put the following in an initializer to automatically clean sunspot calls of any UTF8 control characters:
# config/initializers/sunspot.rb
module Sunspot
#
# DataExtractors present an internal API for the indexer to use to extract
# field values from models for indexing. They must implement the #value_for
# method, which takes an object and returns the value extracted from it.
#
module DataExtractor #:nodoc: all
#
# AttributeExtractors extract data by simply calling a method on the block.
#
class AttributeExtractor
def initialize(attribute_name)
@attribute_name = attribute_name
end
def value_for(object)
Filter.new( object.send(@attribute_name) ).value
end
end
#
# BlockExtractors extract data by evaluating a block in the context of the
# object instance, or if the block takes an argument, by passing the object
# as the argument to the block. Either way, the return value of the block is
# the value returned by the extractor.
#
class BlockExtractor
def initialize(&block)
@block = block
end
def value_for(object)
Filter.new( Util.instance_eval_or_call(object, &@block) ).value
end
end
#
# Constant data extractors simply return the same value for every object.
#
class Constant
def initialize(value)
@value = value
end
def value_for(object)
Filter.new(@value).value
end
end
#
# A Filter to allow easy value cleaning
#
class Filter
def initialize(value)
@value = value
end
def value
strip_control_characters @value
end
def strip_control_characters(value)
return value unless value.is_a? String
value.chars.inject("") do |str, char|
unless char.ascii_only? and (char.ord < 32 or char.ord == 127)
str << char
end
str
end
end
end
end
end
Source (Sunspot Github Issues): Sunspot Solr Reindexing failing due to illegal characters
回答2:
I tried the solution @thekingoftruth proposed, however it did not solve the problem. Found an alternative version of the Filter class in the same github thread that he links to and that solved my problem.
The main difference was the i use nested models through HABTM relationships.
This is my search block in the model:
searchable do
text :name, :description, :excerpt
text :venue_name do
venue.name if venue.present?
end
text :artist_name do
artists.map { |a| a.name if a.present? } if artists.present?
end
end
Here is the initializer that worked for me:
(in: config/initializers/sunspot.rb
)
module Sunspot
#
# DataExtractors present an internal API for the indexer to use to extract
# field values from models for indexing. They must implement the #value_for
# method, which takes an object and returns the value extracted from it.
#
module DataExtractor #:nodoc: all
#
# AttributeExtractors extract data by simply calling a method on the block.
#
class AttributeExtractor
def initialize(attribute_name)
@attribute_name = attribute_name
end
def value_for(object)
Filter.new( object.send(@attribute_name) ).value
end
end
#
# BlockExtractors extract data by evaluating a block in the context of the
# object instance, or if the block takes an argument, by passing the object
# as the argument to the block. Either way, the return value of the block is
# the value returned by the extractor.
#
class BlockExtractor
def initialize(&block)
@block = block
end
def value_for(object)
Filter.new( Util.instance_eval_or_call(object, &@block) ).value
end
end
#
# Constant data extractors simply return the same value for every object.
#
class Constant
def initialize(value)
@value = value
end
def value_for(object)
Filter.new(@value).value
end
end
#
# A Filter to allow easy value cleaning
#
class Filter
def initialize(value)
@value = value
end
def value
if @value.is_a? String
strip_control_characters_from_string @value
elsif @value.is_a? Array
@value.map { |v| strip_control_characters_from_string v }
elsif @value.is_a? Hash
@value.inject({}) do |hash, (k, v)|
hash.merge( strip_control_characters_from_string(k) => strip_control_characters_from_string(v) )
end
else
@value
end
end
def strip_control_characters_from_string(value)
return value unless value.is_a? String
value.chars.inject("") do |str, char|
unless char.ascii_only? && (char.ord < 32 || char.ord == 127)
str << char
end
str
end
end
end
end
end
回答3:
You need to get rid of control characters from UTF8 while saving your content. Solr will not reindex this properly and throw this error.
http://en.wikipedia.org/wiki/UTF-8#Codepage_layout
You can use something like this:
name.gsub!(/\p{Cc}/, "")
edit: If you want to override it globally I think it could be possible by overriding value_for_methods in AttributeExtractor and if needed BlockExtractor. https://github.com/sunspot/sunspot/blob/master/sunspot/lib/sunspot/data_extractor.rb I wasn't checking this. If you manage to add some global patch, please let me know. I had lately same issue.
来源:https://stackoverflow.com/questions/23375336/solr-sunspot-bad-request-illegal-character