问题
i'm trying to extract the text only content from a web page and displayed and i use the HtmlAgilityPack to do the text extraction but the text return with the javascript and css text and i don't want this so i'm trying to detect the { } delimiter to remove all string within the { } delimiter to delete all javascript and css text from the returned text and i use a regex to do that but is not working because i have a nested { } and this is my regex that i'm trying with :
string regex = "\t|\n|<.*?>|(\\[.*\\])|(\".*\")|('.*')|(\\(.*\\))|{\\[.*\\]}|{\".*\"}|{'.*'}|{\\(.*\\)}";
TextArea1.Value = Regex.Replace(s, regex, "");
Input Text:
Los Angeles Times - California, national and world news - Los Angeles Times;},svginImg:function;a.onload=function{var a=navigator.userAgent||navigator.vendor||window.opera;return/;},isIE9:function==9;}},notmobileCalccheck:function;a.style.cssText=;return !!a.style.length;},isAndroidBrowser:function{var a=navigator.userAgent||navigator.vendor;return/android/i.test&&!window.opera;},isSupportedBrowser:function&&!window.opera;},getScreenWidth:function;},isSupported:function isSupported{a=sessionStorage==;}else{try{a=this.supportsSvg{a=false;}}if<=8;}};trb.utils.redirect=function;b.name=;document.body.appendChild;b.submit;if{localStorage=d;}else{for{var c={};for{c;}return null;},remove:function remove;localStorage.removeItem{var b=localStorage;if;a=),f;for;}}},remove:function remove{a.trb=a.trb||{};trb.data=trb.data||{};trb.data.isMobile=trb.browsersupport.isMobile;trb.data.isIE9=trb.browsersupport.isIE9;trb.data.facebookAppId=;trb.data.parentSectionPath=);}if;}trb.data.isSectionFront=true;if;}trb.data.videos={};trb.data.videos.ndnFallbackJsURL=;trb.data.initialpathname=;trb.data.pages=trb.data.pages||{};trb.data.pages={};trb.data.pages.unsupportedBrowserPath=;trb.svg={};trb.svg.data={};trb.svg.data.svgStrings={};trb.svg.data.svgStrings.logoShort=;trb.svg.data.svgStrings.logo=;trb.svg.data.svgStrings.loadingCircle=;trb.svg.data.map={mastheadLogo:{colors:{PRIMARY_COLOR:},string:trb.svg.data.svgStrings.loadingCircle}}; { background: #404040; } .trb_allContentWrapper { background: #333; }
回答1:
i have been using HtmlAgilityPack to load an web page and extract the text content only so when i'm loading the page and extract the text the css and javascript text also is extracted so i try this method of regex to remove the javascript and css from the output text by detect the { } delimiter but was hard so i try anther way and it work and much simpler by using the Descendants()
from HtmlAgilityPack and my code is
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
doc.DocumentNode.Descendants()
.Where(n => n.Name == "script" || n.Name == "style" || n.Name == "#comment")
.ToList()
.ForEach(n => n.Remove());
string s = doc.DocumentNode.InnerText;
TextArea1.Value = Regex.Replace(s, @"\t|\n|<.*?>","");
and find this from : THIS LINK
and every thing works now.
回答2:
why dont you simply try :
/\{.*?\}/g
and replace with nothing.
回答3:
You want to match all case of '{' to '}' including every character which isn't '}' between the pair, then use the following:
/\{[^\}]+\}/g
回答4:
You have nested braces.
In Perl, PHP, Ruby, you could match the nested braces using (?R)
(recursion syntax). But .NET does not have recursion. Does this mean we are lost? Luckily, no.
Balancing Groups to the Rescue
C# regex cannot use recursion, but it has an awesome feature called balancing groups.
This regex will match complete nested braces.
(?<counter>{)(?>(?<counter>{)|(?<-counter>})|[^{}]+)+?(?(counter)(?!))
For instance, it will match
{sdfs{sdfs}sd{d{ab}}fs}
{ab}
- But not
{aa
回答5:
int x=0, y=0;
int l=string.lastIndexOf("}");
do
{
x= string.indexof("{", x) + 1;
y= string.indexof{"}", x};
string.remove(x, y-x);
}
while(y!=l);
来源:https://stackoverflow.com/questions/24114019/remove-all-strings-in-delimiter-using-regex-or-html-agility-pack-in-asp-net