I have a requirement where I need to fetch the entire data collection Users
from RavenDB and compare the retrieved result set with another set of data. There are cl
I like Al Dass solution of getting ids to operate on instead of complete large objects. Also getting the ids directly from the index. However the recursion scares me a bit (even though I think it might be ok) and I removed the reflection.
public List<string> GetAllIds<T>()
{
var allIds = new List<string>();
IDocumentSession session = null;
try
{
session = documentStore.OpenSession();
int queryCount = 0;
int start = 0;
while (true)
{
var current = session.Advanced.DocumentQuery<T>()
.Take(1024)
.Skip(start)
.SelectFields<string>("__document_id")
.AddOrder("__document_id")
.ToList();
if (current.Count == 0)
break;
allIds.AddRange(current);
queryCount += 1;
start += current.Count;
if (queryCount == 30)
{
queryCount = 0;
session.Dispose();
session = documentStore.OpenSession();
}
}
}
finally
{
if (session != null)
{
session.Dispose();
}
}
return allIds;
}
also, this is updated to ravendb 3
You use paging, and read this 1024 items at a time.
int start = 0;
while(true)
{
var current = session.Query<User>().Take(1024).Skip(start).ToList();
if(current.Count == 0)
break;
start+= current.Count;
allUsers.AddRange(current);
}
With a slight twist on @capaj's post. Here is a generic way of getting all the document IDs as a list of strings. Note the use of Advanced.LuceneQuery<T>(idPropertyName)
, SelectFields<T>(idPropertyName)
and GetProperty(idPropertyName)
to make things generic. The default assumes "Id"
is a valid property on the given <T>
(which should be the case 99.999% of the time). In the event you have some other property as your Id
you can pass it in as well.
public static List<string> getAllIds<T>(DocumentStore docDB, string idPropertyName = "Id") {
return getAllIdsFrom<T>(0, new List<string>(), docDB, idPropertyName);
}
public static List<string> getAllIdsFrom<T>(int startFrom, List<string> list, DocumentStore docDB, string idPropertyName ) {
var allUsers = list;
using (var session = docDB.OpenSession())
{
int queryCount = 0;
int start = startFrom;
while (true)
{
var current = session.Advanced.LuceneQuery<T>().Take(1024).Skip(start).SelectFields<T>(idPropertyName).ToList();
queryCount += 1;
if (current.Count == 0)
break;
start += current.Count;
allUsers.AddRange(current.Select(t => (t.GetType().GetProperty(idPropertyName).GetValue(t, null)).ToString()));
if (queryCount >= 28)
{
return getAllIdsFrom<T>(start, allUsers, docDB, idPropertyName);
}
}
}
return allUsers;
}
An example of where/how I use this is when making a PatchRequest
in RavenDb using the BulkInsert
session. In some cases I may have hundreds of thousands of documents and can't afford to load all the documents in memory just to re-iterate over them again for the patch operation... thus the loading of only their string IDs to pass into the Patch
command.
void PatchRavenDocs()
{
var store = new DocumentStore
{
Url = "http://localhost:8080",
DefaultDatabase = "SoMeDaTaBaSeNaMe"
};
store.Initialize();
// >>>here is where I get all the doc IDs for a given type<<<
var allIds = getAllIds<SoMeDoCuMeNtTyPe>(store);
// create a new patch to ADD a new int property to my documents
var patches = new[]{ new PatchRequest { Type = PatchCommandType.Set, Name = "SoMeNeWPrOpeRtY" ,Value = 0 }};
using (var s = store.BulkInsert()){
int cntr = 0;
Console.WriteLine("ID Count " + allIds.Count);
foreach(string id in allIds)
{
// apply the patch to my document
s.DatabaseCommands.Patch(id, patches);
// spit out a record every 2048 rows as a basic sanity check
if ((cntr++ % 2048) == 0)
Console.WriteLine(cntr + " " + id);
}
}
}
Hope it helps. :)
Building up on the Ayende answer, here is a complete method, that does overcome the problem of 30 queries per session and indeed return all documents of the supplied class:
public static List<T> getAll<T>(DocumentStore docDB) {
return getAllFrom(0, new List<T>(), docDB);
}
public static List<T> getAllFrom<T>(int startFrom, List<T> list, DocumentStore docDB ) {
var allUsers = list;
using (var session = docDB.OpenSession())
{
int queryCount = 0;
int start = startFrom;
while (true)
{
var current = session.Query<T>().Take(1024).Skip(start).ToList();
queryCount += 1;
if (current.Count == 0)
break;
start += current.Count;
allUsers.AddRange(current);
if (queryCount >= 30)
{
return getAllFrom(start, allUsers, docDB);
}
}
}
return allUsers;
}
I hope it is not too hacky to do it like this.
I honestly prefer the following function:
public IEnumerable<T> GetAll<T>()
{
List<T> list = new List<T>();
RavenQueryStatistics statistics = new RavenQueryStatistics();
list.AddRange(_session.Query<T>().Statistics(out statistics));
if (statistics.TotalResults > 128)
{
int toTake = statistics.TotalResults - 128;
int taken = 128;
while (toTake > 0)
{
list.AddRange(_session.Query<T>().Skip(taken).Take(toTake > 1024 ? 1024 : toTake));
toTake -= 1024;
taken += 1024;
}
}
return list;
}
[]'s
This question was posted before this feature was available in RavenDB, but in case anyone else stumbles upon this now...
The encouraged way to do this is via the Streaming API. The RavenDB client batches the stream so it can automatically 'page' the requests/responses to/from the server. If you opt in to using the Streaming API, the client assumes you "know what you're doing" and does not check the 128/1024/30 limits that are used for regular queries.
var query = session.Query<User>();
using (var enumerator = session.Advanced.Stream(query)) {
while (enumerator.MoveNext()) {
allUsers.Add(enumerator.Current.Document);
}
}
var count = allUsers.Count;
Tip: Though this is the encouraged way to solve the problem... As a general rule it is best to avoid the situation to start with. What if there are a million records? That allUsers
list is going to get huge. Maybe an index or transform could be done first to filter out what data you actually need to display to the user/process? Is this for reporting purposes? Maybe RavenDB should be automatically exporting to a SQL server with reporting services on it? Etc...