Extracting a list of substrings from MongoDB using a Regular Expression

后端未结

关注

 3  426

I need to extract a part of a string that matches a regex and return it.

I have a set of documents such as:

{\"_id\" :12121, \"fileName\" : \"apple.d


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  盖世英雄少女心        
                
              
                            
                2021-01-15 06:04
              
            
            
                                                                       
It's almost undoable to do it in the aggregation pipe, you want to project your matches and include only the part after the period. 
There is no (yet) operator to locate the position of the period.
You need the position because $substr (https://docs.mongodb.com/manual/reference/operator/aggregation/substr/) requires a start position.
In addition $regEx is only for matching, you cannot use it in a projection to replace.

I think for now it's a easier to do it in code. here you could use a replace regex or any other solution provided by your language
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  攒了一身酷        
                
              
                            
                2021-01-15 06:30
              
            
            
                                                                       
It will be possible to do this in the upcoming version of MongoDB(as the time of this writing) using the aggregation framework and the $indexOfCP operator. Until then, your best bet here is MapReduce.

var mapper = function() { 
    emit(this._id, this.fileName.substring(this.fileName.indexOf(".")))
};

db.coll.mapReduce(mapper, 
                  function(key, value) {}, 
                  { "out": { "inline": 1 }}
)["results"]


Which yields:

[
    {
        "_id" : 12121,
        "value" : ".doc"
    },
    {
        "_id" : 12125,
        "value" : ".txt"
    },
    {
        "_id" : 12126,
        "value" : ".pdf"
    },
    {
        "_id" : 12127,
        "value" : ".txt"
    }
]




For completness here is the solution using the aggregation framework^*

db.coll.aggregate(
    [
        { "$match": { "name": /\.[0-9a-z]+$/i } },
        { "$group": { 
            "_id": null,
            "extension":  { 
                "$push": {
                    "$substr": [ 
                        "$fileName", 
                        { "$indexOfCP": [ "$fileName", "." ] }, 
                        -1 
                    ]
                }
            }
        }}
    ])


which produces:

{ 
    "_id" : null, 
    "extensions" : [ ".doc", ".txt", ".pdf", ".txt" ] 
}




_{*current development version of MongoDB (as the time of this writing).}
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  轮回少年        
                
              
                            
                2021-01-15 06:30
              
            
            
                                                                       
Starting Mongo 4.2, the $regexFind aggregation operator makes things easier:

// { _id : 12121, fileName: "apple.doc" }
// { _id : 12125, fileName: "rap.txt" }
// { _id : 12126, fileName: "tap.pdf" }
// { _id : 12127, fileName: "cricket.txt" }
// { _id : 12129, fileName: "oops" }
db.collection.aggregate([
  { $set: { ext: { $regexFind: { input: "$fileName", regex: /\.\w+$/ } } } },
  { $group: { _id: null, extensions: { $addToSet: "$ext.match" } } }
])
// { _id: null, extensions: [ ".doc", ".pdf", ".txt" ] }


This makes use of:


The $set operator, which adds a new field to each the documents.
This new field (ext) is the result of the $regexFind operator, which captures the result of a matching regex. If a match is found, it returns a document that contains information on the first match. If a match is not found, returns null. For instance:


For { fileName: "tap.pdf" }, it produces { matches: { match: ".pdf", idx: 3, captures: [] }.
For { fileName: "oops" }, it produces { matches: null }.

Finally, using a $group stage, coupled with $addToSet on the match subfield, we can generate the list of distinct extensions.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复