Text mining with tm.plugin.webmining package using GoogleFinanceSource function

问题

I am studying text mining on the online book http://tidytextmining.com/. In the fifth chapter: http://tidytextmining.com/dtm.html#financial

the following code:

library(tm.plugin.webmining)
library(purrr)

company <- c("Microsoft", "Apple", "Google", "Amazon", "Facebook",
             "Twitter", "IBM", "Yahoo", "Netflix")
symbol <- c("MSFT", "AAPL", "GOOG", "AMZN", "FB", "TWTR", "IBM", "YHOO", "NFLX")

download_articles <- function(symbol) {
    WebCorpus(GoogleFinanceSource(paste0("NASDAQ:", symbol)))
}
stock_articles <- data_frame(company = company,
                             symbol = symbol) %>%
    mutate(corpus = map(symbol, download_articles))

gives me the error:

StartTag: invalid element name
Extra content at the end of the document
Error: 1: StartTag: invalid element name
2: Extra content at the end of the document

Any hints? Someone suggested to remove company and symbol related to "Twitter", but it still doesn't work and returns the same error. Many thanks in advance

回答1:

I am having the same issue, however, have narrowed it down slightly. This snippet of the code results in the same error.

GoogleFinanceSource("NASDAQ:MSFT")

StartTag: invalid element name
Extra content at the end of the document
Error: 1: StartTag: invalid element name
2: Extra content at the end of the document

I also saw where others have suggested removing Twitter. I get the point it would have failed as Twitter is not on NASDAQ. I tried the suggested "NYSE:TWTR" and got the same result, however.

I attempted to use GoogleNewsSource to see if I would get the same issue and got a different error which this article on github suggests is being caused by the parser. I wonder if these two issues could be related. github.com/mannau/tm.plugin.webmining/issues/14.

GoogleNewsSource("Microsoft")

Unknown IO error failed to load external entity "http://news.google.com/news?hl=en&q=Microsoft&ie=utf-8&num=100&output=rss"
Error: 1: Unknown IO error2: failed to load external entity "http://news.google.com/news?hl=en&q=Microsoft&ie=utf-8&num=100&output=rss"

That all being said, I have found a work around using a modified ticker list and YahooFinanceSource as follows:

company <- c("Microsoft", "Apple", "Google")
symbol <- c("MSFT", "AAPL", "GOOG")

download_articles <- function(symbol) {
    WebCorpus(YahooFinanceSource(symbol))
}

stock_articles <- data_frame(company = company,
                         symbol = symbol) %>%
mutate(corpus = map(symbol, download_articles))

回答2:

The problem is the package tm.plugin.webmining is out of date.

Only the YahooFinanceSource and YahooNewsSource are alive at the time of this reply.

Here is a quick reference and test.

From the Vignette page written by the author, there should be 8 possible source sites:

GoogleBlogSearchSource
GoogleFinaceSource
GoogleNewsSource
NYTimesSource
ReutersNewsSource
YahooFinanceSource
YahooInplaySource
YahooNewsSource

But according to the Github page, the first one "GoogleBlogSearchSource" has already been proven to be discontinued. For the 7 sources remained, I did a simple test to see if they work:

library(tm)
library(tm.plugin.webmining)

googlefinance <- WebCorpus(GoogleFinanceSource("A"))
googlenews <- WebCorpus(GoogleNewsSource("A"))
nytimes <- WebCorpus(NYTimesSource("A", appid = nytimes_appid))
reutersnews <- WebCorpus(ReutersNewsSource("A"))
yahoofinance <- WebCorpus(YahooFinanceSource("A"))
yahooinplay <- WebCorpus(YahooInplaySource())
yahoonews <- WebCorpus(YahooNewsSource("M"))

The result shows that all the yahoo's sourses are technically still running, but the YahooInplaySource returns 0 documents no matter what parameter I chose.

> googlefinance <- WebCorpus(GoogleFinanceSource("NASDAQ:MSFT"))
StartTag: invalid element name
Extra content at the end of the document
Error in inherits(x, "WebSource") : 1: StartTag: invalid element name
2: Extra content at the end of the document
> googlefinance <- WebCorpus(GoogleFinanceSource("A"))
StartTag: invalid element name
Extra content at the end of the document
Error in inherits(x, "WebSource") : 1: StartTag: invalid element name
2: Extra content at the end of the document
> googlenews <- WebCorpus(GoogleNewsSource("A"))
Unknown IO errorfailed to load external entity "http://news.google.com/news?hl=en&q=A&ie=utf-8&num=100&output=rss"
Error in inherits(x, "WebSource") : 
  1: Unknown IO error2: failed to load external entity "http://news.google.com/news?hl=en&q=A&ie=utf-8&num=100&output=rss"
> nytimes <- WebCorpus(NYTimesSource("A", appid = nytimes_appid))
Error in inherits(x, "WebSource") : object 'nytimes_appid' not found
> reutersnews <- WebCorpus(ReutersNewsSource("A"))
Entity 'ldquo' not defined
Entity 'rdquo' not defined
Opening and ending tag mismatch: div line 60 and body
Opening and ending tag mismatch: body line 59 and html
Premature end of data in tag html line 1
Error in inherits(x, "WebSource") : 1: Entity 'ldquo' not defined
2: Entity 'rdquo' not defined
3: Opening and ending tag mismatch: div line 60 and body
4: Opening and ending tag mismatch: body line 59 and html
5: Premature end of data in tag html line 1
> yahoofinance <- WebCorpus(YahooFinanceSource("A"))
> yahoofinance
<<WebCorpus>>
Metadata:  corpus specific: 3, document level (indexed): 0
Content:  documents: 16
> yahooinplay <- WebCorpus(YahooInplaySource())
> yahooinplay
<<WebCorpus>>
Metadata:  corpus specific: 3, document level (indexed): 0
Content:  documents: 0
> yahoonews <- WebCorpus(YahooNewsSource("A"))
> yahoonews
<<WebCorpus>>
Metadata:  corpus specific: 3, document level (indexed): 0
Content:  documents: 0
> yahoonews <- WebCorpus(YahooNewsSource("M"))
> yahoonews
<<WebCorpus>>
Metadata:  corpus specific: 3, document level (indexed): 0
Content:  documents: 10

Also it worth to be mentioned that even though YahooFinanceSourse is working, it won't return the similar content as GoogleFinanceSource was supposed to do. If you want to play with the examples in , I think you may use YahooNewsSource with a customized list of queries.

回答3:

In the line of code below, try to change the default ie = "utf-8" to ie = "ansi". Try and apply it to your script, it should work.

WebCorpus(GoogleFinanceSource("NASDAQ:MSFT", params = list(hl = "en", q = "NASDAQ:MSFT", ie = "ansi", start = 0, num = 20, output = "rss")))

来源：https://stackoverflow.com/questions/47790148/text-mining-with-tm-plugin-webmining-package-using-googlefinancesource-function

标签

text-mining