Parse ATOM rss feed and remove html tags

后端 未结 2 625
难免孤独
难免孤独 2021-01-26 15:20

am developing this code using powershell. I need to be able to extract the html tags.

  Invoke-WebRequest -Uri \'https://psu.box.com/shared/static/jf36ohodxnw7oe         


        
相关标签:
2条回答
  • 2021-01-26 16:00

    You should be able to use the following script. It makes use of the HTMLFile com object.

      Invoke-WebRequest -Uri 'https://*.rss' -  OutFile C:\*.rss
      [xml]$Content = Get-Content C:\*.rss -Raw
      $Regex = '(?s)SE1046.*?Description := "(?<Description>.*?)"'
    
     If ($Content -match $Regex) {
          "Description is '$($Matches['Description'])'"
          # do something here with $Matches['Description']
        }
     Else {
        "No match."
          }
       $Feed = $Content.rss.channel
     ForEach ($msg in $Feed.Item){
    
    
         $ParseData = $msg.description
        ForEach ($Datum in $ParseData){
         If ($Datum -like "Title"){[int]$Upvote = ($Datum).split(' ') | Select-Object -First 1}#EndIf
         If ($Datum -like "comments"){[int]$Downvote = ($Datum).split(' ') | Select-Object -First 1}    #EndIf
        }#EndForEach     
    
        $HTML = New-Object -ComObject "HTMLFile"
        $HTML.IHTMLDocument2_write($ParseData.InnerText)
    
         [PSCustomObject]@{
         'LastUpdated' = [datetime]$msg.pubDate
         'Title' = $msg.title
         'Category' = $msg.category
         'Author' = $msg.author
         'Link' = $msg.link
         'UpVotes' = $Upvote
         'DownVotes' = $Downvote
         'Validations' = $Validation
         'WorkArounds' = $Workaround
         'Comments' = $HTML.all.tags("p") | % InnerText           
         'FeedbackID' = $FeedBackID
        }#EndPSCustomObject
       }
    
    0 讨论(0)
  • 2021-01-26 16:13

    You could replace <br/> with actual line breaks, then tag-strip the rest completely:

    $commentsPlain = $msg.description.InnerText -replace '<br ?/?>',[System.Environment]::NewLine -replace '<[^>]+>'
    
    [PSCustomObject]@{
        'LastUpdated' = [datetime]$msg.pubDate
        'Title' = $msg.title
        'Category' = $msg.category
        'Author' = $msg.author
        'Link' = $msg.link
        'UpVotes' = $Upvote
        'DownVotes' = $Downvote
        'Validations' = $Validation
        'WorkArounds' = $Workaround
        'Comments' = $commentsPlain
        'FeedbackID' = $FeedBackID
    }
    
    0 讨论(0)
提交回复
热议问题