Good morning guys
Is there a good way to use regular expression in C# in order to find all filenames and their paths within a string
variable?
For e
Here's something I came up with:
using System;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
string s = @"Hello John these are the files you have to send us today:
C:\projects\orders20101130.docx also we would like you to send
C:\some\file.txt, C:\someother.file and d:\some file\with spaces.ext
Thank you";
Extract(s);
}
private static readonly Regex rx = new Regex
(@"[a-z]:\\(?:[^\\:]+\\)*((?:[^:\\]+)\.\w+)", RegexOptions.IgnoreCase);
static void Extract(string text)
{
MatchCollection matches = rx.Matches(text);
foreach (Match match in matches)
{
Console.WriteLine("'{0}'", match.Value);
}
}
}
Produces: (see on ideone)
'C:\projects\orders20101130.docx', file: 'orders20101130.docx'
'C:\some\file.txt', file: 'file.txt'
'C:\someother.file', file: 'someother.file'
'd:\some file\with spaces.ext', file: 'with spaces.ext'
The regex is not extremely robust (it does make a few assumptions) but it worked for your examples as well.
Here is a version of the program if you use <file>
tags. Change the regex and Extract
to:
private static readonly Regex rx = new Regex
(@"<file>(.+?)</file>", RegexOptions.IgnoreCase);
static void Extract(string text)
{
MatchCollection matches = rx.Matches(text);
foreach (Match match in matches)
{
Console.WriteLine("'{0}'", match.Groups[1]);
}
}
Also available on ideone.
If you use <file>
tag and the final text could be represented as well formatted xml document (as far as being inner xml, i.e. text without root tags), you probably can do:
var doc = new XmlDocument();
doc.LoadXml(String.Concat("<root>", input, "</root>"));
var files = doc.SelectNodes("//file"):
or
var doc = new XmlDocument();
doc.AppendChild(doc.CreateElement("root"));
doc.DocumentElement.InnerXml = input;
var nodes = doc.SelectNodes("//file");
Both method really works and are highly object-oriented, especially the second one.
And will bring rather more performance.
See also - Don't parse (X)HTML using RegEx
If you put some constraints on your filename requirements, you can use code similar to this:
string s = @"Hello John
these are the files you have to send us today: C:\Development\Projects 2010\Accounting\file20101130.csv, C:\Development\Projects 2010\Accounting\orders20101130.docx
also we would like you to send C:\Development\Projects 2010\Accounting\customersupdated.xls
thank you";
Regex regexObj = new Regex(@"\b[a-z]:\\(?:[^<>:""/\\|?*\n\r\0-\37]+\\)*[^<>:""/\\|?*\n\r\0-\37]+\.[a-z0-9\.]{1,5}", RegexOptions.IgnorePatternWhitespace|RegexOptions.IgnoreCase);
MatchCollection fileNameMatchCollection = regexObj.Matches(s);
foreach (Match fileNameMatch in fileNameMatchCollection)
{
MessageBox.Show(fileNameMatch.Value);
}
In this case, I limited extensions to a length of 1-5 characters. You can obviously use another value or restrict the characters allowed in filename extensions further. The list of valid characters is taken from the MSDN article Naming Files, Paths, and Namespaces.