How to Create a Tag Parser
Today's post is non-LINQ related…
Download the Tag Parsing Source Code:Tag Parser Engine.zip (6.93 kb)
Recently, I had the need to create a document with markup tags (similar to how HTML uses <b></b>, for example). I wanted a simple, easy solution for parsing out the tags in the document, something that could be easily reused.
I found one solution on CodeProject (Tag Parser). While the author's idea was very good, I found his implementation to have some problems – most notably with returning each tag found via an event, instead of returning a list of the tags from the function call. So, I heavily modified the original idea (adding comments, reorganizing functions, improving efficiency in functions, returning results from the function call instead of through an event).
This code will allow you take a string of text and parse out the tags. A tag starts with "<" and ends with "/>" (just like XHTML). Tags have a name, such as "link", and have attributes which hold additional information about the tag (such as "href" or "text"). An example tag would be:
<link href="http://www.google.com" text="Google" />
The tag is "link", and has attributes "href" and "text".
Note: As of this writing, there is one caveat to using this code – the tag attributes must be in the same order as created in the AttributesCollection. In the example below, "href" must always come before "text".
The example below will parse the following string:
const string myContent = @"Sample Content <link href=""http://google.com"" text=""Click to Google""/> Some blah text <link href=""http://blog.linqexchange.com"" text=""LINQ Exchange"" />";
And return the following results:

We first need to define our Attributes class. We also define an AttributesCollection which will allow our tag class to handle multiple attributes on a tag.
Tag Attribute Class
using System;
namespace TagParser
{
/// <summary />
/// Class to define attribute of a tag
/// </summary />
public class TagAttribute
{
public string Name { get; set; }
public string Value { get; set; }private UInt16 _ordinal = 1;
public UInt16 Ordinal
{
get { return _ordinal; }
set { _ordinal = value; }
}
}/// <summary />
/// Collection of attributes
/// </summary />
public class TagAttributeCollection : System.Collections.Generic.List<TagAttribute>
{
private UInt16 ordinal = 1;
public new void Add(TagAttribute attr)
{
attr.Ordinal = ordinal;
ordinal++;
base.Add(attr);
}
}
}
Next, we need to define an Interface to hold the tag information. When we run the tag parser, we will create a custom class using this Interface as the template for storing the tag information.
Tag Parser Interface
using System;
namespace TagParser
{
/// <summary />
/// Interface to define your own tag
/// </summary />
public interface ITag
{
/// <summary />
/// Property to get the Prefix for your tag.
/// e.g. <link href="http://google.com"
/// text="Click to Google"/>
/// Over here link is your Prefix.
/// </summary />
string Prefix
{
get;
}/// <summary />
/// Property to get an indication whether the parsing
/// should be case sensitive or not.
/// Return true if you want case sensitive matching
/// </summary />
bool IsCaseSensitive
{
get;
}/// <summary />
/// Property to get the Type of tag. Return a small description.
/// e.g. anchor
/// </summary />
string Type
{
get;
}/// <summary />
/// Property to get the Attributes of tag.
/// In case of <link href="http://google.com"
/// text="Click to Google"/> href and
/// text are two attributes.
/// </summary />
TagAttributeCollection TagAttributes
{
get;
}/// <summary>
/// Clone this object to another object
/// </summary>
ITag Clone();
}
}
Now that we have the Interface defined, let's see how you would create the class which holds tag parsing information.You can create multiple tag parsing classes using this technique, which allows you to only need to have one instance of the parsing engine object.
Notice the Clone() function. This is used by the tag parsing engine to get a new instance of the tag info, but retain the name/type/attributes data. Since we haven't done any work with the data in this class, we can just return a new instance of the class (which will populate the properties for us).
Example Tag Parsing Info Class
/// <summary>
/// A sample class showing how you can define your own tag to be parsed.
/// </summary>
public class MyExampleTag : ITag
{
private static TagAttributeCollection attrCollection;#region ITag Members
public ITag Clone()
{
return new MyExampleTag();
}public string Prefix
{
get { return "link"; }
}public bool IsCaseSensitive
{
get { return false; }
}public string Type
{
get { return "anchor"; }
}public TagAttributeCollection TagAttributes
{
get
{
if (attrCollection == null)
{
attrCollection = new TagAttributeCollection();TagAttribute attrHref = new TagAttribute();
attrHref.Name = "href";
attrCollection.Add(attrHref);TagAttribute attrText = new TagAttribute();
attrText.Name = "text";
attrCollection.Add(attrText);TagAttribute attrTarget = new TagAttribute();
attrTarget.Name = "target";
attrCollection.Add(attrTarget);
}
return attrCollection;
}
}
#endregion
}
Finally, the tag parsing engine. This is where the actual tag parsing is done. Regular Expressions are used heavily in this class.
Tag Parsing Engine
using System;
using System.Text.RegularExpressions;
using System.Collections.Generic;
using System.Text;namespace MerlinWebsiteHelperEx.TagParser
{
public class clsTagParser
{
#region Variablesprivate const string HTML_TAG_REGEX_PATTERN =
@"<([A-Za-z_:]|[^x00-x7F])([A-Za-z0-9_:.-]|" +
@"[^x00-x7F])*([ ntr]+([A-Za-z_:]|[^x00-x7F])" +
@"([A-Za-z0-9_:.-]|[^x00-x7F])*([ ntr]+)?=([ ntr]+)" +
@"?(""[^<""]*""|'[^<']*'))*([ ntr]+)?/?>";private const string CONTENT_TAG_REGEX_PATTERN = @"<{0}b(?>s+(?:{1})|[^s>]+|s+)*>";
private const string CONTENT_TAG_ATTRIBUTE_REGEX_PATTERN = @"{0}=""([^""]*)""";private Regex contentTagRegex;
private ITag contentTag;// Variables
#endregion#region Properties
private static Regex tagRegEx;
private static Regex TagRegEx
{
get
{
if (tagRegEx == null)
{
tagRegEx =
new Regex(HTML_TAG_REGEX_PATTERN
, RegexOptions.Singleline
| RegexOptions.IgnoreCase
| RegexOptions.CultureInvariant
|
RegexOptions.IgnorePatternWhitespace);
}
return tagRegEx;
}
}// Properties
#endregion#region Public Functions
/// <summary>
/// Parses a string to find matching tags
/// </summary>
/// <param name="content">the string to parse</param>
/// <param name="tag">the tag information to find</param>
public List<ITag> Parse(string content, ITag tag)
{
List<ITag> retList = new List<ITag>();// set the tag information, used by other functions
contentTag = tag;// build our regex string
ConstructHTMLTagRegex();
foreach (Match m in TagRegEx.Matches(content))
{
// read the tag and convert to itag
ITag matchedTag = ReadTag(m);// if we found something, add to the list
if (matchedTag != null)
retList.Add(matchedTag);
}// return the list
return retList;
}// Public Functions
#endregion#region Private Functions
/// <summary>
/// Build the regex used to find our tags in the content string
/// </summary>
private void ConstructHTMLTagRegex()
{
// build the attrible regex string
StringBuilder sbAttributeRegex = new StringBuilder();
foreach (TagAttribute attribute in contentTag.TagAttributes)
{
sbAttributeRegex.AppendFormat("{0}|", string.Format(
CONTENT_TAG_ATTRIBUTE_REGEX_PATTERN, attribute.Name));
}// build the pattern string
string tagRegexPattern = string.Format(CONTENT_TAG_REGEX_PATTERN,
contentTag.Prefix, sbAttributeRegex);// build the content regex, which is used for finding our tags
contentTagRegex = new Regex(tagRegexPattern,
RegexOptions.Singleline |
(contentTag.IsCaseSensitive ?
RegexOptions.Singleline :
RegexOptions.IgnoreCase)
| RegexOptions.CultureInvariant
|
RegexOptions.IgnorePatternWhitespace);
}/// <summary>
/// Find tags in the regex match
/// </summary>
/// <param name="match">the matched tag</param>
private ITag ReadTag(Match match)
{
// exit if nothing passed in
if (match == null)
return null;// find the tag itself
Match matchTag = contentTagRegex.Match(match.Value);//Match successful
return (matchTag.Success) ? ReadAttributes(matchTag) : null;
}/// <summary>
/// Get the attributes of the tag
/// </summary>
/// <param name="match">the tag to be read</param>
private ITag ReadAttributes(Match match)
{
// set the return tag information
ITag rTag = contentTag.Clone();int attribCount = contentTag.TagAttributes.Count;
// loop through the attributes in the original
// tag's attribute collection. Find any matches in the
// content string
for (int i = 0; i < attribCount; i++)
{
// get the attribute to test
TagAttribute attr = contentTag.TagAttributes[i];// get the attribute value
string attrValue = GetGroupCollectionValue(match.Groups, attr.Ordinal);
// if we have a value, add to the collection
if (!string.IsNullOrEmpty(attrValue))
rTag.TagAttributes[i].Value = attrValue;
}// return the tag
return rTag;
}/// <summary>
/// Returns the value for this attribute
/// </summary>
/// <param name="gc">the group to search</param>
/// <param name="gcIndex">the index this attribute can be found</param>
/// <returns></returns>
private string GetGroupCollectionValue(GroupCollection gc, int gcIndex)
{
string namedItemValue = string.Empty;
try
{
namedItemValue = gc[gcIndex].Captures[0].ToString();
}
//explicitly eating up the exception to make this method generic
//for all replacements.
catch { }
return namedItemValue;
}
// Private Functions
#endregion
}
}
Now that we have all the code done, here is an example console app which demonstrates how to parse out the tags.
Tag Parsing Example Console Application
class Program
{
static void Main(string[] args)
{
const string myContent = @"Sample Content <link href=""http://google.com"" text=""Click to Google""/> Some blah text <link href=""http://blog.linqexchange.com"" text=""LINQ Exchange"" />";//There can be multiple tags of different kinds in your application.
//Repeat the call to tp.Parse
//with your respective ITag type
clsTagParser tp = new clsTagParser();
List<ITag> tagList = tp.Parse(myContent, new MyExampleTag());// output results
foreach (ITag itag in tagList)
{
Console.WriteLine(string.Format("Tag Type is: {0}", itag.Type));
Console.WriteLine("Attributes for the tag are:");// output this tag's attributes
foreach (TagAttribute attr in itag.TagAttributes)
{
Console.WriteLine(
string.Format(
"tAttribute Name: {0}tAttribute Value: {1}",
attr.Name, attr.Value));
}
}Console.ReadLine();
}
}
Download the Tag Parsing Source Code:Tag Parser Engine.zip (6.93 kb)
No related posts.
Related posts brought to you by Yet Another Related Posts Plugin.
