Python for loop with if/else and append function

后端未结

关注

 3  758

别那么骄傲 2021-01-21 18:26

On the basis of list as below, I have to create a DataFrame with \"state\" and \"region\" columns:

Original data:

 Alabama[edit]
 Auburn (Auburn Universi


      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   心在旅途
                                             
                
                
                (楼主)
            
              
              
                2021-01-21 18:56
              

            
            
                        
You can find an example of cleaning this dataset in the tutorial Pythonic Data Cleaning With NumPy and Pandas.

Option 1: Do String Processing in "Pure Python"

You can use a greedy for-loop over the lines of the file and load in O(n) time:

import pandas as pd

university_towns = []

with open('input/university_towns.txt') as file:
    for line in file:
        edit_pos = line.find('[edit]')
        if edit_pos != -1:
            # Remember this `state` until the next is found
            state = line[:edit_pos]
        else:
            # Otherwise, we have a city; keep `state` as last-seen
            parens = line.find(' (')
            town = line[:parens] if parens != -1 else line
            university_towns.append((state, town))

towns_df = pd.DataFrame(university_towns,
                        columns=['State', 'RegionName'])


Option 2: Do String Processing via Pandas API

Alternatively, you can do the string processing with Pandas' .str accessor:

import re

import pandas as pd

university_towns = []

with open('input/university_towns.txt') as file:
    for line in file:
        if '[edit]' in line:
            # Remember this `state` until the next is found
            state = line
        else:
            # Otherwise, we have a city; keep `state` as last-seen
            university_towns.append((state, line))

towns_df = pd.DataFrame(university_towns,
                        columns=['State', 'RegionName'])

towns_df['State'] = towns_df.State.str.replace(r'\[edit\]\n', '')
towns_df['RegionName'] = towns_df.RegionName\
    .str.strip()\
    .str.replace(r' \(.*', '')\
    .str.replace(r'\[.*', '')


Output:

>>> towns_df.head()
     State    RegionName
0  Alabama        Auburn
1  Alabama      Florence
2  Alabama  Jacksonville
3  Alabama    Livingston
4  Alabama    Montevallo

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复