I have a Microsoft Word Document (docx) and I use Open XML SDK 2.0 Productivity Tool to generate C# code from it.
I want to programmatically insert some database values
I do not know of a way to cleanup the XML, but I've always used #placeholder
for my placeholder text and that seems to stay in one run more than any other placeholder text I've tried in the past. It seems the longer the placeholder text, the more likely it is to be split into multiple runs.
You need to get rid of the Rsid information. According to this page Rsid information
enables merging of two documents that have forked.
You need to install in order to run the sample code below. The easiest way to do that is to run the following in the Package Manager Console
Install-Package OpenXmlPowerTools
Then you will be all set to run the following code. (Assuming that you already have a "Test.docx" file added to your document. If you are using Visual Studio, you need to make sure that you have a copy of the file in either the Debug or Release folder according to your build mode.)
//Sample code to remove Rsid information from a "Test.docx" document
using (WordprocessingDocument doc = WordprocessingDocument.Open("Test.docx", true))
{
SimplifyMarkupSettings settings = new SimplifyMarkupSettings
{
RemoveRsidInfo = true
};
MarkupSimplifier.SimplifyMarkup(doc, settings);
}
This will remove Rsid information that may get in the way in the process of manipulating Word files.
I have found a solution: the Open XML PowerTools Markup Simplifier.
I followed the steps described at http://ericwhite.com/blog/2011/03/09/getting-started-with-open-xml-powertools-markup-simplifier/, but it didn't work 1:1 (maybe because it is now version 2.2 of Power Tools?). So, I compiled PowerTools 2.2 in "Release" mode and made a reference to the OpenXmlPowerTools.dll in my TestMarkupSimplifier.csproj. In the Program.cs I only changed the path to my DOCX file. I ran the program once and my document seems to be fairly clean now.
Code quoted from Eric's blog in the link above:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using OpenXmlPowerTools;
using DocumentFormat.OpenXml.Packaging;
class Program
{
static void Main(string[] args)
{
using (WordprocessingDocument doc = WordprocessingDocument.Open("Test.docx", true))
{
SimplifyMarkupSettings settings = new SimplifyMarkupSettings
{
RemoveComments = true,
RemoveContentControls = true,
RemoveEndAndFootNotes = true,
RemoveFieldCodes = false
RemoveLastRenderedPageBreak = true,
RemovePermissions = true,
RemoveProof = true,
RemoveRsidInfo = true,
RemoveSmartTags = true,
RemoveSoftHyphens = true,
ReplaceTabsWithSpaces = true,
};
MarkupSimplifier.SimplifyMarkup(doc, settings);
}
}
}
For those looking for manual non-programmatic solution:
http://www.translationtribulations.com/2010/06/cleaning-up-superfluous-tags-in-docx.html
I've tested that free-trial of memoQ 2014 can indeed be used as a bulky workaround for cleaning Word spell tags.
Still looking for an easier ready-out-of-the-box tool.