Extracting data between two tags in HTML file

前端 未结 2 617
遥遥无期
遥遥无期 2020-12-04 00:43

I\'ve got a HUUUGE HTML file here saved on my system, which contains data from a product catalogue. The data is structured such that for each product record the name is bet

相关标签:
2条回答
  • 2020-12-04 01:30

    A file of size 50 MB isn't so big that you can't just load its contents directly into MATLAB as a string, which you can do with the function FILEREAD:

    strContents = fileread('yourfile.html');
    

    Assuming the file format you have above, you can then parse the contents with the function REGEXP (using named token capture):

    expr = '<(?<tag>name|prodId|color)>''([^<>]+)''</\k<tag>>';
    tokens = regexp(strContents,expr,'tokens');
    tokens = vertcat(tokens{:});
    

    And the contents of token using your sample file contents will be:

    tokens = 
    
        'name'      'hat'        
        'prodId'    '1829493'    
        'color'     'cyan'       
        'name'      'shirt'      
        'prodId'    '193'        
        'name'      'dress'      
        'prodId'    '18'         
        'color'     'dark purple'
    

    You may then want to parse the resulting N-by-2 cell array and place the contents in a structure array with fields 'name', 'prodId', and 'color'. The difficulty is that not every entry will have all three fields. Assuming each 'name' will be followed by either a 'prodId', a 'color', or both (in the order 'prodId' then 'color'), then the following code should work for you:

    s = struct('name',[],'prodId',[],'color',[]);  %# Initialize structure
    nTokens = size(tokens,1);                      %# Get number of tokens
    nameIndex = find(strcmp(tokens(:,1),'name'));  %# Find indices of 'name'
    [s(1:numel(nameIndex)).name] = deal(tokens{nameIndex,2});  %# Fill 'name' field
    
    %# Find and fill 'prodId' that follows a 'name':
    index = strcmp(tokens(min(nameIndex+1,nTokens),1),'prodId');
    [s(index).prodId] = deal(tokens{nameIndex(index)+1,2});
    
    %# Find and fill 'color' that follows a 'name':
    index = strcmp(tokens(min(nameIndex+1,nTokens),1),'color');
    [s(index).color] = deal(tokens{nameIndex(index)+1,2});
    
    %# Find and fill 'color' that follows a 'prodId':
    index = strcmp(tokens(min(nameIndex+2,nTokens),1),'color');
    [s(index).color] = deal(tokens{min(nameIndex(index)+2,nTokens),2});
    

    And the contents of s using your sample file contents will be:

    >> s(1)
    
          name: 'hat'
        prodId: '1829493'
         color: 'cyan'
    
    >> s(2)
    
          name: 'shirt'
        prodId: '193'
         color: []
    
    >> s(3)
    
          name: 'dress'
        prodId: '18'
         color: 'dark purple'
    
    0 讨论(0)
  • 2020-12-04 01:30

    There are two ways of solving this sort of problem: string manipulation with regexes (as suggested by gnovice) or parsing the file (or a mix of the two). Parsing is often best if your file is very well structured; regexes win for messy files.

    Here's the parsing solution.

    Start by downloading xmliotools, and calling xml_read on your file. Your example isn't completely reproducible, so here are two different versions of the data.

    Save this to test1.xml:

    <?xml version="1.0" encoding="utf-8"?>
    <root>
    <name>'hat'</name>
    <prodId>'1829493'</prodId>
    <color>'cyan'</color>
    <name>'dress'</name>
    <prodId>'18'</prodId>
    <color>'dark purple'</color>
    </root>
    

    Save this to test2.xml.

    <?xml version="1.0" encoding="utf-8"?>
    <root>
    <item>
    <name>'hat'</name>
    <prodId>'1829493'</prodId>
    <color>'cyan'</color>
    </item>
    <item>
    <name>'dress'</name>
    <prodId>'18'</prodId>
    <color>'dark purple'</color>
    </item>
    </root>
    

    Now compare

    x1 = xml_read('test1.xml')
    x2 = xml_read('test2.xml')
    
    0 讨论(0)
提交回复
热议问题