scraper - R Scrape Atom Feed to a Data Frame -
I am working on a scraper in R for Atom feed and there are problems grabing the link for each article. Here's my code:
url & lt; - "http://www.stwnewspress.com/search/?mode=article&q=&nsa=eedition&t=article&l= 1000 and amp; s = & amp; SD = desc & f = Atoms & amp; d = & amp; amp; d1 = & amp; d2 = "pageSource & lt; - getURL (url, encoding = "UTF-8") parse & lt; - htmlParse (pageSource) titles & lt; - xpathSApply (pars, '// entry / title', XML value) & lt; - xpathSApply (perced, '// entry / author', xmlValue) link & lt; - xpathSApply (purse, '// entry / link / @Href') dataFrame & lt; - data.frame (pubDates, title, authors) My problem is that I am choosing 18 titles, 18 authors and 20 links, I think I get two links before the feed page I'm picking up, but I'm not sure how to stop them.
Thank you for your help!
You can work with "// Entry", not individual nodes. There are several links for example some entry nodes:
outside & lt; - xpathApply (Author) - xmlValue (children $ author) link & lt; xpathApply (pars, "// entry", function (x) {children & lt; -xmlChildren (x) title & lt; - xmlValue ; - sapply (link, function (y) {xmlGetAttr (y, "Href")} data.frame (title, author, link, etc.) in the "link" Stringsfactor = FALSE}}) Out [1]] Title Author In a serious injury accident in Ohio, CNN News Service 2 in Soap Opera Star Oni, Soap opera star CNHI Samach in serious injury accident Service Link 1 http://www.stwnewspress.com/cnhi_network/article_71fb99db- 0d47-5ead-9276-cae 9c947babc.html 2 http://bloximages.chicago2.vip.townnews.com/stwnewspress.com/content/tncms / assets / v3 / editorial / d / 97 / d97a9815-29c8-5b90-be11-41a3a8b12e9f / 54354a7b66bd9.image.jpg? Resize = 300% 2C450 & gt; External [[2]] Title Author link Q5: Reaches close to voter registration deadline Link by linking Michelle Charles / Stillwater News Press http://www.stwnewspress.com/news/local_news/article_ba35bd60-4ea4-11e4- 8da8-93d495865336.html Then you can rbind your personal entries together: res < - do.call (rbind.data.frame, outside) & gt; Str ("res") 'data.frame': The soap opera star in the serious injury accident in the 147 Object Ohio "Soap opera star" in the serious injury accident in Ohio "Q5: The deadline for voter registration closest" "Oklahoma State Investigation" ... $ Author: CRI "CNHI News Service" ... $ link: chr "http: //www.stwnewspress" by CRI "CNHI News Service", "Minnei Charles / Stillwater News Press" "Megan Sando / Stillwater News Press". Com / cnhi_network / Article_71fb99db-0d47-5ead-9276-cae 9c947babc.html "http://bloximages.chicago2.vip.townnews.com/stwnew Spress.com/content/tncms/assets/v3/editorial/d/97/d97a9815-29c8-5b90-be11-41a "| __trunked__" http://www.stwnewspress.com/news/local_news/article_ba35bd60-4ea4-11e4 -8da8-93d495865336.html "" http://www.stwnewspress.com/news/local_news/article_7023a110-4ea4-11e4-82dd -f735d5c5ed44.html "... Function working To understand that this x : url <- "http://www.stwnewspress.com/search/?mode=article& q = & amp; nsa = eedition & amp; t = article & l = 1000 & amp; s = & amp; sd = desc & amp; f = atom & amp; D = & amp; D1 = & amp; D2 = "pageSource & lt; - getURL (url, encoding =" UTF-8 ") parse & lt; - htmlParse (pageSource) x <- parse [" // entry "] [[1]] Children & lt ; - xml hair (x) & gt; name (children) [1] "title" "author" "link" "id" "content" "category" [7] "update" & gt; children's title & lt ; Title; BYRON YORK: Jindal is a GOP black horse in the race & lt; / title & gt; & gt; XmlValue ($ head of children) [1] "Baron Yorak: Jindal is a GOP intensive horse in 2016 race "
Comments
Post a Comment