How can I remove any UTF-8 BOM that exists -within- some text, not at the start of some text

筅森魡賤 提交于 2019-12-22 11:45:54

问题


We receive some files, which have been concatenated by another party. In the middle of these files are some BOM characters.

Is there a way we can detect these 3 chars and remove them? I've seen plenty of examples about how to remove the BOM from the -start- of a file ... but not the middle.


回答1:


Assuming that your file is small enough to hold in memory, and that you have an Enumerable.Replace extension method for replacing subsequences, then you could use:

var bytes = File.ReadAllBytes(filePath);
var bom = new byte[] { 0xEF, 0xBB, 0xBF };
var empty = Enumerable.Empty<byte>();
bytes = bytes.Replace(bom, empty).ToArray();
File.WriteAllBytes(filePath, bytes);

Here is a simple (inefficient) implementation of the Replace extension method:

public static IEnumerable<TSource> Replace<TSource>(
    this IEnumerable<TSource> source,
    IEnumerable<TSource> match,
    IEnumerable<TSource> replacement)
{
    return Replace(source, match, replacement, EqualityComparer<TSource>.Default);
}

public static IEnumerable<TSource> Replace<TSource>(
    this IEnumerable<TSource> source,
    IEnumerable<TSource> match,
    IEnumerable<TSource> replacement,
    IEqualityComparer<TSource> comparer)
{
    int sLength = source.Count();
    int mLength = match.Count();

    if (sLength < mLength || mLength == 0)
        return source;

    int[] matchIndexes = (
        from sIndex in Enumerable.Range(0, sLength - mLength + 1)
        where source.Skip(sIndex).Take(mLength).SequenceEqual(match, comparer)
        select sIndex
    ).ToArray();

    var result = new List<TSource>();
    int sPosition = 0;
    foreach (int mPosition in matchIndexes)
    {
        var sPart = source.Skip(sPosition).Take(mPosition - sPosition);
        result.AddRange(sPart);
        result.AddRange(replacement);
        sPosition = mPosition + mLength;
    }

    var sLastPart = source.Skip(sPosition).Take(sLength - sPosition);
    result.AddRange(sLastPart);
    return result;
}


来源:https://stackoverflow.com/questions/25228263/how-can-i-remove-any-utf-8-bom-that-exists-within-some-text-not-at-the-start

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!