Is there a fast and non-fancy C# code/algorithm to compress a string of comma separated digits close to maximum info density?

一曲冷凌霜 提交于 2019-12-07 21:24:58

问题


In a nutshell, I programmed myself into a corner by creating a CLR aggregate that performs row id concatenation, so I say:

select SumKeys(id), name from SomeTable where name='multiple rows named this'

and I get something like:

SumKeys         name
--------        ---------
1,4,495         multiple rows named this

But it dies when SumKeys gets > 8000 chars and I don't think I can do anything about it.

As a quick fix (it's only failing 1% of the time for my application) I thought I might compress the string down and I thought some of you bright people out there might know a slick way to do this.

Something like base64 made for 0-9 and a comma?


回答1:


You'd be much better of if you figure out more reasonable storage for your data (maybe HashSet)...

But for compression try regular System.IO.Compression.GZipStream ( http://msdn.microsoft.com/en-us/library/system.io.compression.gzipstream.aspx ) and convert resulting byte array to base64 string if needed... or store as byte array.




回答2:


How about a hexadecimal representation, where every digit represents a 4-bit half of a character byte (a nibble), with 0xa used as the comma? You will only get a 50% compression, but it is fast and simple.




回答3:


Not sure how "fancy" you'd consider it, but zip/gzip compression is highly effective for any text (sometimes to the tune of 90% reduction or better). Since you're already working with C# and CLR integration, it hopefully wouldn't be too hard to setup/deploy. I haven't tinkered with any C# libraries for compression yet, but it's easy to find them. For example: http://sharpdevelop.net/OpenSource/SharpZipLib/ or http://dotnetzip.codeplex.com/ or even http://msdn.microsoft.com/en-us/library/system.io.compression.gzipstream.aspx

Or an easier option might be to switch your field to text or varchar/nvarchar(max), if that's feasible.




回答4:


You can use a Huffman tree. This is basically an algorithm to compress ascii into binary. I was told that it is basically what WinZIP uses, but I'm not sure if that is really true or not. I did a quick search for huffman coding c# and there seems to be at least one decent implementation out there, though I haven't used any of them.

If your "vocabulary" is just digits and commas, a Hoffman tree will get you very good compression.

http://www.enusbaum.com/blog/2009/05/22/example-huffman-compression-routine-in-c/




回答5:


try:

SELECT name, GROUP_CONCAT(id) FROM SomeTable GROUP BY name WHERE name = 'multiple rows named this'



回答6:


I came across a method that will work with SQL Server:

SELECT
  STUFF((
    SELECT ','+id FROM SomeTable a WHERE a.name = b.name FOR XML PATH('')
  ),1,1,'') AS SumKeys, name
FROM SomeTable b
GROUP BY name
WHERE name = 'multiple rows named this'

The WHERE clause is optional



来源:https://stackoverflow.com/questions/6023117/is-there-a-fast-and-non-fancy-c-sharp-code-algorithm-to-compress-a-string-of-com

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!