I am trying to create an array of field names that I can use later in my script. Regular expressions are kicking my butt. I haven\'t written code in a long time. The fiel
Your sample data isn't XML. Your slashes are backwards. Assuming it is XML you're trying to parse, the answer is 'don't use regular expressions'.
They're simply not able to cope with the recursion and nesting to the degree necessary.
So with that in mind - assuming your sample data is actually well formed XML and that is a typo, something like XML::Twig
will do it quite handily:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig -> parse ( \*DATA );
#extract a single field value
print $twig -> root -> first_child_text('title'),"\n";
#get a field name
print $twig -> root -> first_child -> tag,"\n";
#can also use att() if you have attributes
print "Field names:\n";
#children() returns all the children of the current (in this case root) node
#We use map to access all, and tag to read their 'name'.
#att or trimmed_text would do other parts of the XML.
print join ( "\n", map { $_ -> tag } $twig -> root -> children );
__DATA__
<XML>
<record>DEFECT000179</record><state>Approved</state><title>Something is broken</title>
</XML>
This prints:
Something is broken
record
Field names:
record
state
title
You also have a variety of other really useful tools, such as pretty_print
for formatting your output XML, twig_handlers
that let you manipulate XML as you parse (particularly handy for purge
), cut
and paste
to move nodes around, and get_xpath
to let you use an xpath expression to find elements based on path and attributes.
Edit: Based on comments, if you really want to extract data from:
</something>
The thing that's going wrong in your thingy is that .*
is greedy. You either need to use a negated match - like:
m,</[^>]>,g
Or a nongreedy match:
m,</(.*?)>,g
Oh, and given you've a backslash - you need to escape it:
my $firstLineOfXMLFile = '<record>DEFECT000179<\record><state>Approved<\state><title>Something is broken<\title>';
my @fieldNames = $firstLineOfXMLFile =~ m(<\\(.*?)>)g;
print @fieldNames;
Will do the trick. (but seriously - deliberately creating something that looks like XML that isn't is a really bad thing to do)