What is the difference between Q-learning and Value Iteration?

后端未结

关注

 3  406

How is Q-learning different from value iteration in reinforcement learning?

I know Q-learning is model-free and training samples are transitions (s, a, s\', r)


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  谎友^        
                
              
                            
                2021-01-30 22:57
              
            
            
                                                                       
I don't think the accepted answer captured the essential of difference. To quote the newest version of Richard Sutton's book:


  "
  Having q∗ makes choosing optimal actions even easier. With q∗, the agent does not
  even have to do a one-step-ahead search: for any state s, it can simply find any action that maximizes q∗(s; a). The action-value function effectively caches the results of all one-step-ahead searches. It provides the optimal expected long-term return as a value that is locally and immediately available for each state{action pair. Hence, at the cost of representing a function of state{action pairs, instead of just of states, the optimal action value function allows optimal actions to be selected without having to know anything about possible successor states and their values, that is, without having to know anything
  about the environment’s dynamics.
  "


Usually in real problems the agent doesn't know the world(or the so called transformation) dynamics but we definitely know the rewards, because those are what the environment gives back during the interaction and the reward function is actually defined by us. 

The real difference between q-learning and normal value iteration is that: 
After you have V*, you still need to do one step action look-ahead to subsequent states to identify the optimal action for that state. And this look-ahead requires the transition dynamic after the action. But if you have q*, the optimal plan is just choosing a from the max q(s,a) pair.
    
    
        
            
            
                


    

        
            Share
                                                                            
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  说谎        
                
              
                            
                2021-01-30 23:07
              
            
            
                                                                       
Value iteration is used when you have transition probabilities, that means when you know the probability of getting from state x into state x' with action a. In contrast, you might have a black box which allows you to simulate it, but you're not actually given the probability. So you are model-free. This is when you apply Q learning.

Also what is learned is different. With value iteration, you learn the expected cost when you are given a state x. With q-learning, you get the expected discounted cost when you are in state x and apply action a.

Here are the algorithms:





I'm currently writing down quite a bit about reinforcement learning for an exam. You might also be interested in my lecture notes. However, they are mostly in German.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  北海茫月        
                
              
                            
                2021-01-30 23:12
              
            
            
                                                                       
You are 100% right that if we knew the transition probabilities and reward for every transition in Q-learning, it would be pretty unclear why we would use it instead of model-based learning or how it would even be fundamentally different. After all, transition probabilities and rewards are the two components of the model used in value iteration - if you have them, you have a model.

The key is that, in Q-learning, the agent does not know state transition probabilities or rewards. The agent only discovers that there is a reward for going from one state to another via a given action when it does so and receives a reward. Similarly, it only figures out what transitions are available from a given state by ending up in that state and looking at its options. If state transitions are stochastic, it learns the probability of transitioning between states by observing how frequently different transitions occur.

A possible source of confusion here is that you, as the programmer, might know exactly how rewards and state transitions are set up. In fact, when you're first designing a system, odds are that you do as this is pretty important to debugging and verifying that your approach works. But you never tell the agent any of this - instead you force it to learn on its own through trial and error. This is important if you want to create an agent that is capable of entering a new situation that you don't have any prior knowledge about and figuring out what to do. Alternately, if you don't care about the agent's ability to learn on its own, Q-learning might also be necessary if the state-space is too large to repeatedly enumerate. Having the agent explore without any starting knowledge can be more computationally tractable.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复