Remove all strings in { } delimiter using Regex or Html Agility Pack in ASP.NET web forms [duplicate]

问题

This question already has answers here:

What to do Regular expression pattern doesn't match anywhere in string? (8 answers)

RegEx match open tags except XHTML self-contained tags (34 answers)

Closed 5 years ago.

i'm trying to extract the text only content from a web page and displayed and i use the HtmlAgilityPack to do the text extraction but the text return with the javascript and css text and i don't want this so i'm trying to detect the { } delimiter to remove all string within the { } delimiter to delete all javascript and css text from the returned text and i use a regex to do that but is not working because i have a nested { } and this is my regex that i'm trying with :

string regex = "\t|\n|<.*?>|(\\[.*\\])|(\".*\")|('.*')|(\\(.*\\))|{\\[.*\\]}|{\".*\"}|{'.*'}|{\\(.*\\)}";
TextArea1.Value = Regex.Replace(s, regex, "");

Input Text:

Los Angeles Times - California, national and world news - Los Angeles Times;},svginImg:function;a.onload=function{var a=navigator.userAgent||navigator.vendor||window.opera;return/;},isIE9:function==9;}},notmobileCalccheck:function;a.style.cssText=;return !!a.style.length;},isAndroidBrowser:function{var a=navigator.userAgent||navigator.vendor;return/android/i.test&&!window.opera;},isSupportedBrowser:function&&!window.opera;},getScreenWidth:function;},isSupported:function isSupported{a=sessionStorage==;}else{try{a=this.supportsSvg{a=false;}}if<=8;}};trb.utils.redirect=function;b.name=;document.body.appendChild;b.submit;if{localStorage=d;}else{for{var c={};for{c;}return null;},remove:function remove;localStorage.removeItem{var b=localStorage;if;a=),f;for;}}},remove:function remove{a.trb=a.trb||{};trb.data=trb.data||{};trb.data.isMobile=trb.browsersupport.isMobile;trb.data.isIE9=trb.browsersupport.isIE9;trb.data.facebookAppId=;trb.data.parentSectionPath=);}if;}trb.data.isSectionFront=true;if;}trb.data.videos={};trb.data.videos.ndnFallbackJsURL=;trb.data.initialpathname=;trb.data.pages=trb.data.pages||{};trb.data.pages={};trb.data.pages.unsupportedBrowserPath=;trb.svg={};trb.svg.data={};trb.svg.data.svgStrings={};trb.svg.data.svgStrings.logoShort=;trb.svg.data.svgStrings.logo=;trb.svg.data.svgStrings.loadingCircle=;trb.svg.data.map={mastheadLogo:{colors:{PRIMARY_COLOR:},string:trb.svg.data.svgStrings.loadingCircle}}; { background: #404040; } .trb_allContentWrapper { background: #333; }

回答1:

i have been using HtmlAgilityPack to load an web page and extract the text content only so when i'm loading the page and extract the text the css and javascript text also is extracted so i try this method of regex to remove the javascript and css from the output text by detect the { } delimiter but was hard so i try anther way and it work and much simpler by using the Descendants() from HtmlAgilityPack and my code is

 HtmlWeb web = new HtmlWeb();
 HtmlDocument doc = web.Load(url);
 doc.DocumentNode.Descendants()
                            .Where(n => n.Name == "script" || n.Name == "style" || n.Name == "#comment")
                            .ToList()
                            .ForEach(n => n.Remove());

            string s = doc.DocumentNode.InnerText;
            TextArea1.Value = Regex.Replace(s, @"\t|\n|<.*?>","");

and find this from : THIS LINK

and every thing works now.

回答2:

why dont you simply try :

/\{.*?\}/g

and replace with nothing.

回答3:

You want to match all case of '{' to '}' including every character which isn't '}' between the pair, then use the following:

/\{[^\}]+\}/g

回答4:

You have nested braces.

In Perl, PHP, Ruby, you could match the nested braces using (?R) (recursion syntax). But .NET does not have recursion. Does this mean we are lost? Luckily, no.

Balancing Groups to the Rescue

C# regex cannot use recursion, but it has an awesome feature called balancing groups.

This regex will match complete nested braces.

(?<counter>{)(?>(?<counter>{)|(?<-counter>})|[^{}]+)+?(?(counter)(?!))

For instance, it will match

{sdfs{sdfs}sd{d{ab}}fs}
{ab}
But not {aa

回答5:

int x=0, y=0;
int l=string.lastIndexOf("}");
do
{
x= string.indexof("{", x) + 1;
y= string.indexof{"}", x};
string.remove(x, y-x);
}
while(y!=l);

来源：https://stackoverflow.com/questions/24114019/remove-all-strings-in-delimiter-using-regex-or-html-agility-pack-in-asp-net

标签

ASP.NET

regex

html-agility-pack