Stack Exchange data dump

February 18, 2014 Leave a comment

I was looking for an older area 51 proposal and saw it was closed due to inactivity. Fortunately for me, Stack Exchange provides a data dump of all the questions and answers in XML.

I searched for an existing program that would allow me to quickly import and view the questions and answers, but I didn’t see anything that I could get running quickly. So instead, I just threw together a couple lines of code in LINQPad to view it. It’s not clean and it’ll probably throw an out of memory exception on larger files, but it works good enough for the data I have.

To view questions sorted by score:

XDocument
	.Load(@"C:\temp\posts.xml")
	.Element("posts")
	.Elements("row")
	.Where(x => x.Attribute("PostTypeId").Value == "1")
	.OrderByDescending(x => Int32.Parse(x.Attribute("Score").Value))
	.Select(x => 
		new 
		{
			Score = x.Attribute("Score").Value,
			Id = x.Attribute("Id").Value,
			Title = x.Attribute("Title").Value,
			Body = Util.RawHtml(WebUtility.HtmlDecode(x.Attribute("Body").Value)),
		})
	.Dump();

To view answers for a specific questions:

string parentId = "19422";

XDocument
	.Load(@"C:\temp\posts.xml")
	.Element("posts")
	.Elements("row")
	.Where(x => x.Attribute("ParentId") != null && x.Attribute("ParentId").Value == parentId)
	.OrderByDescending(x => Int32.Parse(x.Attribute("Score").Value))
	.Select(x => 
		new 
		{
			Score = x.Attribute("Score").Value,
			Body = Util.RawHtml(WebUtility.HtmlDecode(x.Attribute("Body").Value)),
		})
	.Dump();

To view a summary of all the questions and answers sorted by score:

void Main()
{
	var rows = XDocument
		.Load(@"C:\temp\posts.xml")
		.Element("posts")
		.Elements("row")
		.OrderBy(x => x.Attribute("PostTypeId").Value);		
		
	List<Question> threads = new List<Question>();
	
	foreach (var row in rows)
	{
		if (row.Attribute("PostTypeId").Value == "1")
		{
			var t = new Question
			{
				AcceptedAnswerId = row.Attribute("AcceptedAnswerId") != null ? row.Attribute("AcceptedAnswerId").Value : null,
				Answers = new List<Post>(),
				Id = row.Attribute("Id").Value,
				Body = row.Attribute("Body").Value,
				Title = row.Attribute("Title").Value,
				Score = Int32.Parse(row.Attribute("Score").Value)
			};
			
			threads.Add(t);
		}
		else if (row.Attribute("PostTypeId").Value == "2")
		{
			var parent = threads.FirstOrDefault(x => x.Id == row.Attribute("ParentId").Value);
					
			var t = new Post
			{
				Id = row.Attribute("Id").Value,
				Body = row.Attribute("Body").Value,
				Score = Int32.Parse(row.Attribute("Score").Value)
			};
			
			if (parent.AcceptedAnswerId == t.Id)
				parent.Answers.Insert(0, t);
			else
				parent.Answers.Add(t);
		}
	}
	
	threads.Sort((x, y) => y.Score.CompareTo(x.Score));
	
	foreach (var thread in threads)
		thread.Answers.Sort((x, y) => y.Score.CompareTo(x.Score));
		
	threads
		.Select(x => new 
		{
			Score = x.Score,
			Title = x.Title,
			Body = Util.RawHtml(WebUtility.HtmlDecode(x.Body)),
			Answers = x.Answers.Select(y => new
			{
				Score = y.Score,
				Body = Util.RawHtml(WebUtility.HtmlDecode(y.Body))
			})
		})
		.Dump();
}

public class Question : Post
{
	public string Title { get; set; }
	public string AcceptedAnswerId { get; set; }
	public List<Post> Answers { get; set; }
}

public class Post
{
	public string Id { get; set; }
	public int Score { get; set; }
	public string Body { get; set; }
}

Logging Lync conversations

January 24, 2014 Leave a comment

A feature missing in Lync is the ability to log chat conversations to a text file. There is an option to log conversations to the “Conversation History” folder in Outlook, but this option can be disabled by an administrator. I like being able view and search my conversation history. Since this option will be disabled soon, I need to find a way to continue logging my conversations. Fortunately for me, I can use the Lync SDK to my advantage.

Each time a chat or video session is started, they are encapsulated within a container. These containers are known as Conversations in Lync.

LyncClient client = LyncClient.GetClient();

client.ConversationManager.ConversationAdded += 
	(sender, eventArgs) => Console.WriteLine("Chat window opened.");

client.ConversationManager.ConversationRemoved += 
	(sender, eventArgs) => Console.WriteLine("Chat window closed.");

Conversations can contain multiple modes or “modalities”. In my case, I’m only interested in logging instant messages. On first try, my code looked similar to this:

client.ConversationManager.ConversationAdded += (sender, eventArgs) =>
{
	var instantMessageModality = 
		(InstantMessageModality) eventArgs.Conversation.Modalities[ModalityTypes.InstantMessage];

	instantMessageModality.InstantMessageReceived += (o, data) =>
	{
		var mode = (InstantMessageModality) o;
		var name = (string) mode
			.Participant
			.Contact
			.GetContactInformation(ContactInformationType.DisplayName);

		Console.WriteLine("Received a message from: " + name);
		Console.WriteLine("The message received is: " + data.Text);
	};
};

I noticed the console was only printing messages from myself. When a message was received from another person, nothing happened. It turns out that I need to create an event handler for each person involved in the conversation. Since I was the original person who opened the chat window, the participant was set to myself.

Since multiple chat windows can be opened at the same time, we need to create a handler for each conversation window. Within each conversation window, we need to create a handler for each participant. For each participant, we need to create a handler to handle any messages we receive.

// For each conversation
client.ConversationManager.ConversationAdded += (cSender, cEventArgs) =>
{
	// For each participant
	cEventArgs.Conversation.ParticipantAdded += (pSender, pEventArgs) =>
	{
		var modality = (InstantMessageModality) pEventArgs
			.Participant
			.Modalities[ModalityTypes.InstantMessage];

		// Register for messages
		modality.InstantMessageReceived += (mSender, mEventArgs) =>
		{
		};
	};
};

At this point, we have access to the person sending the message and the message text. We can append it to a file or log it to a database.

// Register for messages
modality.InstantMessageReceived += (mSender, mEventArgs) =>
{
	var instantMessageModality = (InstantMessageModality) mSender;
	var person = (string) instantMessageModality
		.Participant
		.Contact
		.GetContactInformation(ContactInformationType.DisplayName);

	using (var conn = new SQLiteConnection(@"data source=D:\Lync\messages.db"))
	using (var cmd = conn.CreateCommand())
	{
		conn.Open();

		cmd.CommandText = 
			@"insert into messages (conversationid, date, person, message) 
			values (@conversationid, @date, @person, @message)";
			
		cmd.Parameters.AddWithValue("@conversationId", conversationId);
		cmd.Parameters.AddWithValue("@date", DateTime.Now);
		cmd.Parameters.AddWithValue("@person", person);
		cmd.Parameters.AddWithValue("@message", mEventArgs.Text);

		cmd.ExecuteNonQuery();
	}
};

Archiving emails – Part II

January 23, 2014 Leave a comment

After giving some more thought to my last post, I came up with a slightly better solution.

Due to the presence of the policy “PSTDisableGrow”, Outlook cannot create new PST files or add mail to existing PST files. This basically means I can’t use Outlook’s archiving feature.

However, that doesn’t stop me from creating an Outlook addin that can do exactly that. Instead of saving emails to my local machine as MSG files, I’ll just move the emails into a new PST file that I create.

I can create a new PST file using the NameSpace.AddStoreEx method. If the PST file does not exist, Outlook will create it.

Application
	.Session
	.AddStoreEx(@"D:\EmailTest\ArchiveTest.pst", OlStoreType.olStoreDefault);

By default, the name of this new data file will be “Outlook Data File”. There isn’t any obvious way to change the display name, so I need to loop through all the root folders and look for my PST file. I can set the display name when I cast it to type Folder.

Folders folders = Application.Session.Folders;

for (int i = 1; i <= folders.Count; i++)
{
	Folder target = (Folder) folders[i];
	
	Store store = target.Store;
	string path = store.FilePath;
	Marshal.ReleaseComObject(store);
	
	if (path == @"D:\EmailTest\ArchiveTest.pst")
	{
		target.Name = "Archive Test";
		Marshal.ReleaseComObject(target);
		
		break;
	}
	
	Marshal.ReleaseComObject(target);
}

Marshal.ReleaseComObject(folders);

Now I can loop through my inbox, make a copy of the email, and save it to my new archive.

for (int i = 1; i <= inboxItems.Count; i++)
{
	var email = inboxItems[i] as MailItem;

	if (email == null)
		continue;

	MailItem copy = (MailItem) email.Copy();
	copy.Move(archiveInbox)

	Marshal.ReleaseComObject(copy);
	Marshal.ReleaseComObject(email);
}

In my previous post, I used the entry ID as the file name when I saved the email to my local machine. I used this ID as a unique identifier to determine which emails I have already saved. However, I cannot use the entry ID as a unique identifier when I move emails from my default mailbox to my archive. MAPI assigns a unique entry ID to each email that comes into a mailbox. However, that entry ID changes when it moves from one store (my default mailbox) to another store (my archive).

Instead, I can utilize user properties on the email itself. Each email contains a collection of user properties, which are just key value pairs. I can add my own custom user property to indicate that an email has already been archived.

for (int i = 1; i <= inboxItems.Count; i++)
{
	var email = inboxItems[i] as MailItem;

	if (email == null)
		continue;

	UserProperties userProperties = email.UserProperties;
	UserProperty archivedProperty = userProperties.Find("_archived");  

	if (archivedProperty == null)
	{      
		MailItem copy = (MailItem) email.Copy();
		copy.Move(archiveInbox)
		Marshal.ReleaseComObject(copy);
		
		userProperties.Add("_archived", OlUserPropertyType.olText, false, OlFormatText.olFormatTextText);
		email.Save();
	}
	else
		Marshal.ReleaseComObject(archivedProperty);

	Marshal.ReleaseComObject(userProperties);
	Marshal.ReleaseComObject(email);
}

Archiving emails… the hard way

January 22, 2014 2 comments

Email storage can be a problem. Many email providers limit the storage size for a given user. The wrong way to handle email storage is to limit how long an email can be kept.

Unfortunately, this is something I have to deal with. For whatever reason, the powers that be decided people don’t need emails longer then three months, so emails older then three months will be automatically deleted. I find the whole situation comical, but that’s an entirely different conversation.

Since I use Outlook/Exchange for these emails, a normal person would recommend archiving my emails to a PST file on my local machine. Unfortunately, a group policy was pushed out that added the registry key “PSTDisableGrow” for Outlook. This prevents Outlook from adding emails to PST files, even if it’s stored locally.

So now I’m stuck in a position where I can’t automatically archive my emails without paying for a third party product. I need a way to automatically save all my emails as either MSG or EML files to my hard drive, so at least I have a copy.

There are a couple options that I’m exploring. Be warned that the solutions I’m going to talk about are TERRIBLE. They are very much hacks and something that I would completely avoid if I had a chance. I’m open to any suggestions and/or free products.

The first thing I tried was to create a new rule in Outlook for all incoming emails. This rule would execute a custom VB script that saves a copy of the email to my machine. Unfortunately, I couldn’t get it to work. I fumbled around with it for about an hour before I gave up.

The second option was to utilize Exchange Web Services (EWS). Newer versions of Exchange expose a SOAP web service for anyone to use. Most of the time, the location of the web service can be discovered by going to the address http://webmail.example.com/ews/exchange.asmx, where “example.com” is your domain. Microsoft provides a managed interface called Exchange Web Services Managed API that simplifies access. I was quite surprised at how easy it was to develop a simple solution.

var service = new ExchangeService(ExchangeVersion.Exchange2010_SP1)
{
	Credentials = new WebCredentials("user", "password"),
	Url = new Uri("https://webmail.example.com/ews/exchange.asmx")
};

Folder folder = Folder.Bind(service, WellKnownFolderName.Inbox);
FindItemsResults<Item> emails = folder.FindItems(new ItemView(Int32.MaxValue));

service.LoadPropertiesForItems(emails, new PropertySet(ItemSchema.MimeContent));

string archiveDirectory = Path.Combine(@"D:\EmailArchive", DateTime.Now.ToString("yyyy-MM"));

if (!Directory.Exists(archiveDirectory))
	Directory.CreateDirectory(archiveDirectory);

foreach (Item email in emails)
{
	string path = Path.Combine(archiveDirectory, email.StoreEntryId + ".eml");

	if (!File.Exists(path))
		File.WriteAllBytes(path, email.MimeContent.Content);
}

This code snippet basically downloads my entire inbox and saves it locally. It doesn’t get any easier than that. EWS also supports streaming, push, and pull notifications. This allows me to monitor any incoming/outgoing emails and immediately archive them. I could fall back to iterating over the entire inbox every few days to catch any emails I missed.

As much as I like how simple this solution is, I can’t depend on it. Unfortunately, EWS can be disabled by an Exchange administrator. Knowing how people have reacted before, this feature of Exchange will probably be disabled once they realize someone is using it.

The final option I’m currently exploring is to create an Outlook addin using VSTO. Unfortunately, it utilizes COM objects for nearly everything. I have very little experience with COM, so I ran into several issues.

Folder inbox = (Folder) Application.Session.GetDefaultFolder(OlDefaultFolders.olFolderInbox);
string archiveDirectory = Path.Combine(@"D:\EmailArchive", DateTime.Now.ToString("yyyy-MM"));

if (!Directory.Exists(archiveDirectory))
	Directory.CreateDirectory(archiveDirectory);

foreach (object item in inbox.Items)
{
	var email = item as MailItem;

	if (email == null)
		continue;

	string path = Path.Combine(archiveDirectory, email.EntryID + ".msg");

	if (!File.Exists(path))
		email.SaveAs(path);
}

There are several things wrong with the code above. Since nearly everything is a COM object, I need to release them after I’m done. The first time I ran this, it worked for the first few hundred emails. At the 300 mark, I received the exception “Your server administrator has limited the number of items you can open simultaneously.” In the code above, each iteration of the loop would reference a new COM object. After a couple hundred iterations of the loop, it would fail because I never released any of them.

This article mentions a pretty good guideline.

1 dot good, 2 dots bad

This means I need to pay special attention to property chaining. For example:

Folder inbox = (Folder) Application.Session.GetDefaultFolder(OlDefaultFolders.olFolderInbox);

// Bad
inbox.Items.ItemAdd += OnItemAdd;

// Good
Items inboxItems = inbox.Items;
inboxItems.ItemAdd += OnItemAdd;

Since I need to release the COM objects in the opposite order of creation, I used a stack to keep track of all my references.

Stack<object> comObjects = new Stack<object>();

Folder inbox = (Folder) Application.Session.GetDefaultFolder(OlDefaultFolders.olFolderInbox);
comObjects.Push(inbox);

Items inboxItems = inbox.Items;
comObjects.Push(inboxItems);

Folder sent = (Folder) Application.Session.GetDefaultFolder(OlDefaultFolders.olFolderSentMail);
comObjects.Push(sent);

Items sentItems = sent.Items;
comObjects.Push(sentItems);

// 
// Do something
//

while (comObjects.Count != 0)
{
	object obj = comObjects.Pop();

	if (obj != null)
		Marshal.ReleaseComObject(obj);
}

While iterating through the Items collection using a for loop, I would immediately receive an exception saying that the array was out-of-bounds. Like any C# developer, I start the iterating arrays at index 0. However, the Items collection starts at index 1. MSDN documents it here.

The items collection contains an event that fires for each new email that is received. I can utilize this event for both the inbox and sent folders to immediately archive new emails. There is caveat to using this event that is mentioned in this article. Whenever 16 or more items are added at the same time, this event does not fire. I don’t need to worry about this limitation most of the time, but I would still need to fall back to iterating over the entire inbox every once in a while to make sure all the emails have been saved.

Creating an Outlook addin isn’t as simple as using the EWS Managed API. With the restrictions in place, it seems this is the only option I currently have. Even after I save all the emails to my local machine, I still need to create a separate service to parse and index all the emails.

A few people have started migrating all their old emails into OneNote by highlighting their entire inbox and pressing one button. If things get too complicated, I might have to fall back to this.

This is a lot of work for a simple problem. The wrong way to handle email storage is to limit how long an email can be kept.

Levenshtein distance

December 27, 2013 Leave a comment

Imagine a scenario where a single script is deployed to several hundred different locations. Due to various constraints, this script cannot be centralized, so making a change means I’ll need to deploy it to several hundred locations.

But it gets worse. Some of these scripts are customized and include special logic, so I cannot blindly copy the updated script to all locations. In addition to that, most of the existing scripts contain comments such as:

#
# This script was created on 1/1/1970 by John Doe.
#

If these scripts didn’t include their own unique comments, I could have compared the file sizes or generated SHA1 hashes for each script to see which were identical and which contained their own special logic. Since each script contains their own unique comments, generating hashes would mean a different hash for each script.

Instead of reviewing each script individually, I can use the Levenshtein distance to determine how similar the target script is compared to my updated script.

According to Wikipedia, the Levenshtein distance is:

… a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other.

Sorting each script by the Levenshtein distance gives me a good indication of which scripts I can safely copy over and which I need to review manually. Overwriting scripts with a Levenshtein distance close to zero gives me reasonable assurance that I won’t break anything. While it’s not a bullet proof solution, it’s better then reviewing hundreds of scripts manually.

Loading animations using pure CSS

December 26, 2013 1 comment

With the advent of CSS animations, it’s quite easy to create a loading animation using just CSS. Loading animations have traditionally been done using an animated gif. Using CSS animations only requires a single div element and a few lines of CSS:

#loading-image
{
	width: 25px;
	height: 25px;
	border-width: 8px;
	border-style: solid;
	border-color: #000;
	border-right-color: transparent;
	border-radius: 50%;
	animation-name: loading;
	animation-duration: 1s;
	animation-timing-function: linear;
	animation-iteration-count: infinite;
}

@keyframes loading
{
	0% { transform: rotate(0deg); }
	100%   { transform: rotate(360deg); }
}

Here is the result in JSFiddle.

I recently participated in a code review for a website and instead of using animated images, a developer decided to use CSS animations. While this is neat, I believe it’s a mistake to use this on a customer facing website. Perhaps my opinion will change in five years, but there are still too many people using older browsers that don’t support CSS animations.

Using pure CSS does have some merit. For example, a page might load a tiny bit faster because there is one less image it has to download, which reduces the number of http requests and the size of the page. You can also use LESS to dynamically change the animation color to match customer defined themes, background colors, etc….

While there are some reasons to use CSS animations, there are more reasons not to. The most important reason against using CSS animations at this time is avoiding unnecessary complexity. If you decide to use CSS animations in customer facing websites, you’ll still need to include a fallback method for browsers that don’t support it. I don’t see any reason to complicate things when animated gifs work perfectly fine.

Unless a website is highly dynamic with ever changing colors, I don’t see a reason to use CSS animations for loading images. Again, my opinion might change in five years when more browsers support CSS animations.

Enterprise Development with NServiceBus

December 12, 2013 1 comment

I’m in Minneapolis Minnesota this week to attend a course on NServiceBus.  Over the next couple weeks, I intend to post about my experiences and impressions as I start developing a system from the ground up.  Most of the posts won’t be interesting to veteran NServiceBus developers since they’ll be “hello world” type posts for various features.  These posts will mainly be notes for myself to remember what I’ve done and what features are available.  I’m actually quite eager to start using what I’ve learned this week.

Follow

Get every new post delivered to your Inbox.