Search and Replace Words in HTML

前端 未结 3 1922
小蘑菇
小蘑菇 2021-02-18 14:24

what I\'m trying to do is make a \'jargon buster\'. Basically I have some html and some glossary terms in a database. When the person clicks on jargon buster it replaces the wor

相关标签:
3条回答
  • 2021-02-18 14:31

    Use the inverted word character \W to select for any characters other than numbers and letters in your regex pattern. Because this would still fail at the boundaries of the text blob, you would also need to test those conditions as well. Thus using the word 'term' as the text you are searching for:

    (^term$)|(^term\W)|(\Wterm\W)|(\Wterm$)
    

    The first condition checks to make sure that term isn't the only contents of the blob, the second checks if its the first word, the third if it contained within the blob, and the last if its the last word.

    If you want to consider any other characters as word characters (say a hyphen) you would need to repace the \W with [^\w\-].

    Hope this helps. There are probably optimizations that can performed as well, but this should at least be a good starting point.

    0 讨论(0)
  • 2021-02-18 14:40

    Assuming all your glossary "words" consist of standard "word" characters, (i.e. [A-Za-z0-9_]), then a simple word boundary assertion can be placed before and after the word in the regex pattern. Try replacing the pertinant statement with this:

    $element->innertext = preg_replace(
        '/\b'. $glossary_word .'\b/i',
        '<a '. $glossary_tip .' >'. $glossary['word'] .'</a>',
        $element->innertext);
    

    This assumes that $glossary_word has been run trough preg_quote (which your code does).

    However, if the glossary words may contain other non-standard word characters (such as a '-' dash), a more complex regex can be formulated which incorporates lookahead and lookbehind to ensure that only whole words are matched. For example:

    $re_pattern = "/         # Match a glossary whole word.
        (?<=[\s'\"]|^)       # Word preceded by whitespace, quote or BOS.
        {$glossary_word}     # Word to be matched.
        (?=[\s'\".?!,;:]|$)  # Word followed by ws, quote, punct or EOS.
        /ix";
    
    0 讨论(0)
  • 2021-02-18 14:40

    I had this problem in JS getting individual words. What I did was the following (you can translate it from JS to PHP):

    It actually works REALLY well for me. :)

    var words = document.body.innerHTML;
    
    // FIRST PASS
    
    // remove scripts
    words = words.replace(/<script[\s\S]*?>[\s\S]*?<\/script>/gi, '');
    // remove CSS
    words = words.replace(/<style[\s\S]*?>[\s\S]*?<\/style>/gi, '');
    // remove comments
    words = words.replace(/<!--[\s\S]*?-->/g, '');
    // remove html character entities
    words = words.replace(/&.*?;/g, ' ');
    // remove all HTML
    words = words.replace(/<[\s\S]*?>/g, '');
    
    // SECOND PASS
    
    // remove all newlines
    words = words.replace(/\n/g, ' ');
    // replace multiple spaces with 1 space
    words = words.replace(/\s{2,}/g, ' ');
    
    // split each word
    words = words.split(/[^a-z-']+/gi);
    
    0 讨论(0)
提交回复
热议问题