问题
I'm having a problem using Json.net and creating a large Bson file. I have the following test code:
Imports System.IO
Imports Newtonsoft.Json
Public Class Region
Public Property Id As Integer
Public Property Name As String
Public Property FDS_Id As String
End Class
Public Class Regions
Inherits List(Of Region)
Public Sub New(capacity As Integer)
MyBase.New(capacity)
End Sub
End Class
Module Module1
Sub Main()
Dim writeElapsed2 = CreateFileBson_Stream(GetRegionList(5000000))
GC.Collect(0)
End Sub
Public Function GetRegionList(count As Integer) As List(Of Region)
Dim regions As New Regions(count - 1)
For lp = 0 To count - 1
regions.Add(New Region With {.Id = lp, .Name = lp.ToString, .FDS_Id = lp.ToString})
Next
Return regions
End Function
Public Function CreateFileBson_Stream(regions As Regions) As Long
Dim sw As New Stopwatch
sw.Start()
Dim lp = 0
Using stream = New StreamWriter("c:\atlas\regionsStream.bson")
Using writer = New Bson.BsonWriter(stream.BaseStream)
writer.WriteStartArray()
For Each item In regions
writer.WriteStartObject()
writer.WritePropertyName("Id")
writer.WriteValue(item.Id)
writer.WritePropertyName("Name")
writer.WriteValue(item.Name)
writer.WritePropertyName("FDS_Id")
writer.WriteValue(item.FDS_Id)
writer.WriteEndObject()
lp += 1
If lp Mod 1000000 = 0 Then
writer.Flush()
stream.Flush()
stream.BaseStream.Flush()
End If
Next
writer.WriteEndArray()
End Using
End Using
sw.Stop()
Return sw.ElapsedMilliseconds
End Function
End Module
I have used FileStream instead of StreamWriter in the first using statement and it makes no difference.
The CreateBsonFile_Stream fails at just over 3m records with an OutOfMemory exception. Using the memory profiler in visual studio shows the memory continuing to climb even though I'm flushing everything I can.
The list of 5m regions comes to about 468Mb in memory.
Interestingly, if I use the following code to produce Json it works and memory statys steady at 500Mb:
Public Function CreateFileJson_Stream(regions As Regions) As Long
Dim sw As New Stopwatch
sw.Start()
Using stream = New StreamWriter("c:\atlas\regionsStream.json")
Using writer = New JsonTextWriter(stream)
writer.WriteStartArray()
For Each item In regions
writer.WriteStartObject()
writer.WritePropertyName("Id")
writer.WriteValue(item.Id)
writer.WritePropertyName("Name")
writer.WriteValue(item.Name)
writer.WritePropertyName("FDS_Id")
writer.WriteValue(item.FDS_Id)
writer.WriteEndObject()
Next
writer.WriteEndArray()
End Using
End Using
sw.Stop()
Return sw.ElapsedMilliseconds
End Function
I'm pretty certain this is a problem with the BsonWriter but can't see what else I can do. Any ideas?
回答1:
The reason you are running out of memory is as follows. According to the BSON specification, every object or array - called documents in the standard - must contain at the beginning a count of the total number of bytes comprising the document:
document ::= int32 e_list "\x00" BSON Document. int32 is the total number of bytes comprising the document.
e_list ::= element e_list
| ""
element ::= "\x01" e_name double 64-bit binary floating point
| "\x02" e_name string UTF-8 string
| "\x03" e_name document Embedded document
| "\x04" e_name document Array
| ...
Thus when writing the root object or array, the total number of bytes to be written to the file must be precalculated.
Newtonsoft's BsonDataWriter and underlying BsonBinaryWriter implement this by caching all tokens to be written in a tree, then when the contents of the root token have been finalized, recursively calculating the sizes before writing the tree out. (The alternatives would have been to make the application (i.e. your code) somehow precalculate this information -- practically impossible -- or to seek back and forth in the output stream to write this information, possibly only for those streams for which Stream.CanSeek == true.) You are getting the OutOfMemory exception because your system has insufficient resources to hold the token tree.
For comparison, the JSON standard does not require byte counts or sizes to be written anywhere in the file. Thus JsonTextWriter
can stream your serialized array contents out immediately, without the need to cache anything.
As a workaround, based on the BSON spec and BsonBinaryWriter I have created a helper method that incrementally serializes an enumerable to a stream for which Stream.CanSeek == true. It doesn't require caching the entire BSON document in memory, but rather seeks to the beginning of the stream to write the final byte count:
public static partial class BsonExtensions
{
const int BufferSize = 256;
public static void SerializeEnumerable<TItem>(IEnumerable<TItem> enumerable, Stream stream, JsonSerializerSettings settings = null)
{
// Created based on https://github.com/JamesNK/Newtonsoft.Json/blob/master/Src/Newtonsoft.Json/Bson/BsonBinaryWriter.cs
// And http://bsonspec.org/spec.html
if (enumerable == null || stream == null)
throw new ArgumentNullException();
if (!stream.CanSeek || !stream.CanWrite)
throw new ArgumentException("!stream.CanSeek || !stream.CanWrite");
var serializer = JsonSerializer.CreateDefault(settings);
var contract = serializer.ContractResolver.ResolveContract(typeof(TItem));
BsonType rootType;
if (contract is JsonObjectContract || contract is JsonDictionaryContract)
rootType = BsonType.Object;
else if (contract is JsonArrayContract)
rootType = BsonType.Array;
else
// Arrays of primitives are not implemented yet.
throw new JsonSerializationException(string.Format("Item type \"{0}\" not implemented.", typeof(TItem)));
stream.Flush(); // Just in case.
var initialPosition = stream.Position;
var buffer = new byte[BufferSize];
WriteInt(stream, (int)0, buffer); // CALCULATED SIZE TO BE CALCULATED LATER.
ulong index = 0;
foreach (var item in enumerable)
{
if (item == null)
{
stream.WriteByte(unchecked((byte)BsonType.Null));
WriteString(stream, index.ToString(NumberFormatInfo.InvariantInfo), buffer);
}
else
{
stream.WriteByte(unchecked((byte)rootType));
WriteString(stream, index.ToString(NumberFormatInfo.InvariantInfo), buffer);
using (var bsonWriter = new BsonDataWriter(stream) { CloseOutput = false })
{
serializer.Serialize(bsonWriter, item);
}
}
index++;
}
stream.WriteByte((byte)0);
stream.Flush();
var finalPosition = stream.Position;
stream.Position = initialPosition;
var size = checked((int)(finalPosition - initialPosition));
WriteInt(stream, size, buffer); // CALCULATED SIZE.
stream.Position = finalPosition;
}
private static readonly Encoding Encoding = new UTF8Encoding(false);
private static void WriteString(Stream stream, string s, byte[] buffer)
{
if (s != null)
{
if (s.Length < buffer.Length / Encoding.GetMaxByteCount(1))
{
var byteCount = Encoding.GetBytes(s, 0, s.Length, buffer, 0);
stream.Write(buffer, 0, byteCount);
}
else
{
byte[] bytes = Encoding.GetBytes(s);
stream.Write(bytes, 0, bytes.Length);
}
}
stream.WriteByte((byte)0);
}
private static void WriteInt(Stream stream, int value, byte[] buffer)
{
unchecked
{
buffer[0] = (byte)value;
buffer[1] = (byte)(value >> 8);
buffer[2] = (byte)(value >> 16);
buffer[3] = (byte)(value >> 24);
}
stream.Write(buffer, 0, 4);
}
enum BsonType : sbyte
{
// Taken from https://github.com/JamesNK/Newtonsoft.Json/blob/master/Src/Newtonsoft.Json/Bson/BsonType.cs
// And also http://bsonspec.org/spec.html
Number = 1,
String = 2,
Object = 3,
Array = 4,
Binary = 5,
Undefined = 6,
Oid = 7,
Boolean = 8,
Date = 9,
Null = 10,
Regex = 11,
Reference = 12,
Code = 13,
Symbol = 14,
CodeWScope = 15,
Integer = 16,
TimeStamp = 17,
Long = 18,
MinKey = -1,
MaxKey = 127
}
}
And then call it as follows:
BsonExtensions.SerializeEnumerable(regions, stream)
Notes:
You could use the method above to serialize to a local
FileStream
or aMemoryStream
-- but not, say, aDeflateStream
, which cannot be repositioned.Serializing enumerables of primitives is not implemented, but could be.
In Release 10.0.1 Newtonsoft moved BSON processing into a separate nuget Newtonsoft.Json.Bson and replaced BsonWriter with BsonDataWriter. If you are using an earlier version of
Newtonsoft
the answer above applies equally to the oldBsonWriter
.Since Json.NET is written in c# and my primary language is c#, the workaround is also in c#. If you need this converted to VB.NET, let me know and I can try.
Demo fiddle with some simple unit tests here.
回答2:
Found it - BsonWriter is trying to be 'intelligent'... because I am producing the json as an array of regions it seems to be keeping the whole array in memory regardless of any flushing that you do.
To prove this I took out the Start and End Array writes and ran the routine - memory usage stayed at 500Mb and the procedure ran properly.
My guess is that this is a bug that got fixed in the JsonWriter but not in the lesser used BsonWriter
来源:https://stackoverflow.com/questions/33451929/outofmemory-exception-with-streams-and-bsonwriter-in-json-net