Extracting Country Name from Author Affiliations

后端未结

关注

 3  1739

I am currently exploring the possibility of extracting country name from Author Affiliations (PubMed Articles) my sample data looks like:

Mechanical and Produc


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  不要未来只要你来        
                
              
                            
                2020-12-31 23:29
              
            
            
                                                                       
Here is a simple solution that might get you started some of the way.  It makes use of a database containing city and country data in the maps package.  If you can get hold of a better database, it should be simple to modify the code.

library(maps)
library(plyr)

# Load data from package maps
data(world.cities)

# Create test data
aa <- c(
    "Mechanical and Production Engineering Department, National University of Singapore.",
    "Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, U.K.",
    "Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, UK.",
    "Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285."
)

# Remove punctuation from data
caa <- gsub(aa, "[[:punct:]]", "")    ### *Edit*

# Split data at word boundaries
saa <- strsplit(caa, " ")

# Match on cities in world.cities
# Assumes that if multiple matches, the last takes precedence, i.e. max()
llply(saa, function(x)x[max(which(x %in% world.cities$name))])

# Match on country in world.countries
llply(saa, function(x)x[which(x %in% world.cities$country.etc)])


This is the result for cities:

[[1]]
[1] "Singapore"

[[2]]
[1] "Cambridge"

[[3]]
[1] "Cambridge"

[[4]]
[1] "Indianapolis"


And the result for countries:

[[1]]
[1] "Singapore"

[[2]]
[1] "UK"

[[3]]
[1] "UK"

[[4]]
character(0)


With a bit of data cleanup you may be able to do something with this.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  灰色年华        
                
              
                            
                2020-12-31 23:35
              
            
            
                                                                       
@Andrie's answer is nice, but it misses cities and countries that are more than one word e.g. New Zealand or New York. The second example is a concern as it would be labelled as a match to York, UK not New York, USA.

This alternative should capture those cases a bit better.

library(maps)
library(plyr)

# Load data from package maps
data(world.cities)

# Create test data
aa <- c(
    "Mechanical and Production Engineering Department, National University of Singapore.",
    "Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, U.K.",
    "Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, UK.",
    "Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285."
)

saa <- sapply(aa, strsplit, split = ", ", USE.NAMES = FALSE)
llply(saa, function(x)x[which(x %in% world.cities$name)])
llply(saa, function(x)x[which(x %in% world.cities$country.etc)])


The downside is that any entries without a specific country or city field is not going to return anything e.g. the University of Singapore example.

Cities:

[[1]]
character(0)

[[2]]
[1] "Cambridge"

[[3]]
[1] "Cambridge"

[[4]]
[1] "Indianapolis"


That is less of an issue for me than the multi-word city/country problem. Choose whichever is a better fit for your data. Maybe there's a way of combining the two?
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  春和景丽        
                
              
                            
                2020-12-31 23:42
              
            
            
                                                                       
One way could be to split your strings in order to isolate geographical information (for example by deleting everything up to the first coma), and then submit the result to a geocoding service.

For example, Google geocoding API allows to send an address and to get back a localization and the corresponding geographical informations, such as the country. I don't think there is a ready-made R package to do it, but you can find some functions here, for example :

Geocoding in R with Google Maps

There are also extensions in other languages such as Ruby :

http://geokit.rubyforge.org/

It also depends on the number of observations you have, the free Google API for example is limited to about 200 adresses / IP / day, if I remember correctly.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复