r - quirky behavoir with rvest wikipedia scraping -


sorry, weird question, can't seem figure out myself. news is: think it's totally reproducible.

i'm trying build simple r function use {rvest} scrape wikipedia hometown of musicians. basically, function wrote works, artists, doesn't work (returns null). (randy newman 1 such, i'll use him example.)

when run whole thing (below) , findhome("randy newman") null when attempt debug, run tablemusic() function , artist <- "randy newman" , run guts of artistdata() function line line, works!

and then, once i've done that, can run findhome("randy newman") , work right. gives?! have in wrong order or something? can't seem figure out.

any appreciated. here code:

library(rvest) findhome <- function(artist) { ##function table right info tablemusic <- function(data) {     if(!any(grepl("years active|labels|instruments", data[,1], ignore.case=t))) {         (i in 2:5) {             data <- try(url %>% html %>% html_nodes(xpath=paste('//*[@id="mw-content-text"]/table[', i, ']', sep="")) %>% html_table(fill=t), silent=t)             if(!class(data)=="try-error" & length(data)>0) {                 if(class(data)!="data.frame") {data <- data.frame(data, stringsasfactors=f)}                 if(any(grepl("years active|labels|instruments", data[,1], ignore.case=t))) {                     break                 }             }         }     }     if(class(data)=="try-error" | length(data)<1) {         data <- null     } else if (!any(grepl("years active|labels|instruments", data[,1], ignore.case=t))) {         data <- null     }     data } #function pull data , try different pages if first wrong artistdata <- function(artist) {     artist <- gsub(" ", "_", artist)     artist <- gsub("'", "%27", artist)     ##first try getting data     url <- paste("https://en.wikipedia.org/wiki/", artist, sep="")     data <- try(url %>% html %>% html_nodes(xpath='//*[@id="mw-content-text"]/table[1]') %>% html_table(fill=t), silent=t)     ##check if it's right page (deal disambiguation issues)     if(!class(data)=="try-error" & length(data)>0) {         if(class(data)!="data.frame") {data <- data.frame(data, stringsasfactors=f)}         data <- tablemusic(data)     }     ## if try-error or musictable==null, try _(band)     if(class(data)=="try-error" | is.null(data) | length(data)<1) {         url <- paste("https://en.wikipedia.org/wiki/", artist, "_(band)", sep="")         data <- try(url %>% html %>% html_nodes(xpath='//*[@id="mw-content-text"]/table[1]') %>% html_table(fill=t), silent=t)         if(class(data)=="try-error"){             data <- null         } else {             if(class(data)!="data.frame") {data <- data.frame(data, stringsasfactors=f)}             data <- tablemusic(data)         }     } else {         if(class(data)!="data.frame") {data <- data.frame(data, stringsasfactors=f)}     }     ## if try-error or musictable==null, try _(musician)     if(class(data)=="try-error" | is.null(data) | length(data)<1) {         url <- paste("https://en.wikipedia.org/wiki/", artist, "_(musician)", sep="")         data <- try(url %>% html %>% html_nodes(xpath='//*[@id="mw-content-text"]/table[1]') %>% html_table(fill=t), silent=t)         if(class(data)=="try-error"){             data <- null         } else {             if(class(data)!="data.frame") {data <- data.frame(data, stringsasfactors=f)}             data <- tablemusic(data)         }     } else {         if(class(data)!="data.frame") {data <- data.frame(data, stringsasfactors=f)}     }     data } ## first try finding data data <- artistdata(artist) ## try finding and/& if(is.null(data)){data <- artistdata(unlist(strsplit(artist, " and| &"))[1])} ## if no matches return "" if(class(data)=="try-error" | is.null(data)) {     data <- ""     return() } else {     if(class(data)!="data.frame") {data <- data.frame(data, stringsasfactors=f)} }  ## if have matching page, pull relevant data origin <- data[data[,1]=="origin",2] if(length(origin)>0) {     home <- origin } else {     born <- data[data[,1]=="born",2]     if (length(born)>0) {         home <- unlist(strsplit(born, "age.[0-9]+)"))[2]     } else {         home <- ""     } } home }  findhome("randy newman") 

i figured out. had add url parameter tablemusic() function. was, recycling url past searches. suggestion.


Comments