Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to compress strings #160

Open
GoogleCodeExporter opened this issue Mar 24, 2015 · 36 comments
Open

Try to compress strings #160

GoogleCodeExporter opened this issue Mar 24, 2015 · 36 comments

Comments

@GoogleCodeExporter
Copy link

we can try to create a separate cluster for strings and try to compress them...

Original issue reported on code.google.com by marianopeck on 29 May 2012 at 1:54

@GoogleCodeExporter
Copy link
Author

I'd like that very much. We have a graph in which we also store xml data. Not 
much but it accumulates. There's a lot of whitespace for instance (which is 
visible when looking at the .fuel file in an editor) so there's a lot of room 
for compression. Probably even a very fast / low rate compression could reduce 
file sizes grately if there is a lot of text to be serialized.

Original comment by [email protected] on 20 Feb 2013 at 10:38

@GoogleCodeExporter
Copy link
Author

I took that idea from last night and implemented a quick deflate / inflate 
logic, so that now all that xml data is stored in a byte array instead. That 
gives me fuel files of 12.8MB to 35.9MB and an image size of 47.3MB to 70.4MB.

Granted, if we were to integrate something like this into fuel, we'd have to 
use some mechanism like a threshold, because
1. we would want as little deflate operations as possible
2. fragments below a certain size actually grow when deflated

So something like (very roughly):

ByteString>>serializeOn: anEncoder
    | stream |
    stream := (ZLibWriteStream on: ByteArray new)
        nextPutAll: self asByteArray;
        close.
    "delegate serialization to ByteArray"
    stream encodedStream contents serializeOn: anEncoder

Original comment by [email protected] on 21 Feb 2013 at 9:19

@GoogleCodeExporter
Copy link
Author

Nice!

I didn't understand this part:
"12.8MB to 35.9MB and an image size of 47.3MB to 70.4MB."

So the fuel file is reduced from 35MB to 12MB using this?
Why the image size is reduced?

What about the time performance? (#timeToRun is enough for me!)

Original comment by [email protected] on 21 Feb 2013 at 10:27

@GoogleCodeExporter
Copy link
Author

BTW 
Mariano: what happened with you lz4 work?
http://marianopeck.wordpress.com/2012/11/16/lz4-binding-for-pharo/

Original comment by [email protected] on 21 Feb 2013 at 10:32

@GoogleCodeExporter
Copy link
Author

Yes. To give you an estimate: the model contains 3068 documents with a source 
string length average of 10463 characters. the average length of the compressed 
strings is 2922 bytes.

See below.

Oops, I think you misunderstood me. I didn't implement anything in Fuel for 
this. I simply compressed all the xml strings in my model and compared the 
sizes of the fuel files and images with / without compression. But I might try 
to implement something like this quickly in Fuel (like in the example) and 
compare the runtimes. I'll let you know.

Original comment by [email protected] on 21 Feb 2013 at 11:52

@GoogleCodeExporter
Copy link
Author

I will answer quickly (then with more details). THis issue was ment to add 
compression to the whole string cluster, that means, to compress ALL strings 
together (in one compression). I found this quite complicated and never found 
time to really implement it. 

The other possibility is to compress EACH string...but of course, this gives 
way smaller ratios. But for particular cases, like the bioinformatics, this is 
still very useful. But here you don't need anything special from Fuel, just use 
the substitution hook.

Static way:

ByteString >> fuelAccept: aGeneralMapper


  (BioParser tokenizeFasta: o) second isDNASequence  
        ifTrue: [

          aGeneralMapper visitSubstitution: self by: self zipped
onRecursionDo: [ super fuelAccept: aGeneralMapper ].
          ]

        ifFalse: [ super fuelAccept: aGeneralMapper ]
     ^  



Dynamic way:


objectToSerialize := Array with: 'hello' with: (FileStream readOnlyFileNamed: 
'GGA28.fa') contents.
threshold := 1000.


FileStream forceNewFileNamed: 'demo.fuel' do: [ :aStream |
   aSerializer := FLSerializer newDefault.
   aSerializer analyzer 
       when: [ :o | o isString and: [ o size > threshold and: [ o isZipped ] ] ]
       substituteBy: [ :o | o zipped ].

   aSerializer         
       serialize: objectToSerialize
       on: aStream binary ].

result := FileStream oldFileNamed: 'demo.fuel' do: [ :aStream |
       aMaterialization := FLMaterializer newDefault
materializeFrom: aStream binary.
zippedStrings := aMaterialization objects select: [:o | o isString and: [ o 
isZipped not ]].
unzippedStrings := zippedStrings collect: [:o | o unzipped ].
zippedStrings elementsExchangeIdentityWith: unzippedStrings.
aMaterialization root ].



And yes, I recommend to use LZ4 for this since it gives a good enough 
compression in a very very small time. 

Original comment by marianopeck on 21 Feb 2013 at 11:59

@GoogleCodeExporter
Copy link
Author

Hm, what was the problem, do you recall? Because at first glance it seems 
pretty straigt forward:

stream := ZLibWriteStream on ByteArray new.
cluster objects do: [ :string |
    stream nextPutAll: string asByteArray ].
stream close.

byteToSerialize := stream encodedStream contents. 


Or something like this...

Original comment by [email protected] on 21 Feb 2013 at 12:10

@GoogleCodeExporter
Copy link
Author

Hi Max. The problem was related to the "indexes"...In other words, while the 
graph was being visited during analysis/serialization, you record certain 
offsets/indexes/position for the visited strings...then you compress, so the 
cluster is smaller. Then during materialization what happened is that when I 
need to materialize a string, it was difficult because it was kind of that the 
indexes were shifted. 

Maybe there is a workaround....

Original comment by marianopeck on 21 Feb 2013 at 12:15

@GoogleCodeExporter
Copy link
Author

Ah yes, I see. Maybe there's a need for pre-analysis hooks. But as you wrote, 
most of this can be done manually, especially if you know that you have large 
amounts of uniform data.

Original comment by [email protected] on 21 Feb 2013 at 12:19

@GoogleCodeExporter
Copy link
Author

Hi max. Needing a pre-analysis hook is not the big problem. The big complexity 
is how to be able to compress all the strings of the cluster together rather 
than compressing each string (as they do in bioinformatics and as I posted 
above).

But if you want to give it a try Max, please be my guest. Sometimes new blood 
just works better :)

Original comment by marianopeck on 21 Feb 2013 at 9:16

@GoogleCodeExporter
Copy link
Author

I've been thinking about this and I'd like to give it a try. Might be a while 
though, since this is really not a pressing issue. 

Original comment by [email protected] on 22 Feb 2013 at 7:48

@GoogleCodeExporter
Copy link
Author

Please go ahead. And let me know how it goes :)   Basically the idea is to be 
able to compress/uncompress the strings of the cluster all together. And the 
same for symbols.
For the first step, don't worry for the compress, use whatever.  Then, if it 
works, I will give it a try with LZ4 :)
That would be supercool. 

Original comment by marianopeck on 23 Feb 2013 at 4:04

@GoogleCodeExporter
Copy link
Author

[deleted comment]

@GoogleCodeExporter
Copy link
Author

Hi hacked together a very rough version (really a proof of concept only) with 
an arbitrary encoding strategy. Load all the attachments and try it with:

o := Dictionary new
    add: 1 -> 'bar';
    add: 2 -> { 'foo'. 'baz' };
    yourself.

FLSerializer serialize: o toFileNamed: 'foo'.
FLMaterializer materializeFromFileNamed: 'foo'


Seems to work :)

Note that I simply chose the way of least resistance by subclassing a cluster. 
There's probably a better way.

Original comment by [email protected] on 24 Feb 2013 at 5:38

Attachments:

@GoogleCodeExporter
Copy link
Author

Hi Max. I took a look to the code. It looks quite similar to what I did some 
time ago. But I don't know why yours seem to work while mine didn't :)

What about doing the following (for TRUNK, not 1.9):

1) Add the FLByteStringCluster
2) Make both, string and symbol use  ByteStringCluster.
3) Make ByteStringCluster delegate to a CompressorStrategy to which we send the 
string and it answers the string to really write in the stream.
4) We do a concrete sublcass of CompressorStrategy called NoCompressionStrategy 
and we use that by default. It will just answer the same string.
5) we make a class side setter to  ByteStringCluster to set others compressor 
and we write a ZLib compressor subclass. 
6) then we can do a LZ4 compressor subclass :)
7) Create one subclass of FLStreamStrategy for compressor type. This way, we 
can run all tests for a particular compressor and see if it works. Look 
previous versions of FuelCompression and you can take  alook to FLGZipStrategy. 

What do you think?

Thanks Max, this was pretty coooooool!!!!

Original comment by marianopeck on 24 Feb 2013 at 7:17

@GoogleCodeExporter
Copy link
Author

I'm just lucky :)

Sounds good to me.

My pleasure!

Original comment by [email protected] on 24 Feb 2013 at 8:00

@GoogleCodeExporter
Copy link
Author

Excellent! sounds very good.
Then I will take a look. 
For now, I think that all in "trunk" (ie. the main repo), can be in 1.9... so 
can we add this code to trunk after 1.9 is released? or alternatively put it in 
FuelExperiments?

Original comment by [email protected] on 24 Feb 2013 at 8:24

@GoogleCodeExporter
Copy link
Author

If I start to implement something I'll put it into the experiments repo.

Original comment by [email protected] on 24 Feb 2013 at 9:00

@GoogleCodeExporter
Copy link
Author

Be careful because Squeaksource got read only. And FuelExperiments is in SS. We 
shhould create a FuelExperiments repo in SmalltalkHub and migrate it to 
there....

Original comment by marianopeck on 25 Feb 2013 at 12:19

@GoogleCodeExporter
Copy link
Author

I already committed to experiments. The only thing you can't do is create new 
projects. 

Original comment by [email protected] on 25 Feb 2013 at 7:10

@GoogleCodeExporter
Copy link
Author

Cool, I'm anxious for benchmark it!

Original comment by [email protected] on 25 Feb 2013 at 10:12

@GoogleCodeExporter
Copy link
Author

Max, you are right, I am always confused about that :)
I am also very ansious to test. Maybe we cna just use the benchs/samples for 
strings/symbols. 
I also wanna test with LZ4. It is quite easy to test in fact the readme 
explains everything: http://smalltalkhub.com/#!/~marianopeck/LZ4/


Original comment by marianopeck on 25 Feb 2013 at 10:56

@GoogleCodeExporter
Copy link
Author

Martin made an interesting suggestion yesterday. The compression could also be 
made pluggable by passing different streams to FLSerializer (like we already 
try experimentally with GZip).

Although I like the idea for its simplicity, after having given it some thought 
I don't think it's flexible enough.

1. The user would have to provide the correct stream if he doesn't use the 
class side methods
2. a compressing stream would compress *all* contents which would consume a lot 
of time and slow down Fuel
3. if we'd use a stream wrapper to make a selection of objects that we want to 
compress (like strings) and objects that we don't want to compress, that would 
be feasible but put the responsibility in the hands of the wrong objects (in my 
opinion). Neither streams nor en- / decoders should be concerned about the data 
they write but only with the writing itself.

I therefore will continue working with Mariano's proposal for the 
implementation for now.

Original comment by [email protected] on 27 Feb 2013 at 10:08

@GoogleCodeExporter
Copy link
Author

Well, I also agree with my idea hahahah (thanks God hahah). What Martin 
proposes is already provided out of the box since it has nothing to do with 
Fuel itself: just pass around a compression stream and that's all. In fact, 
that's what FuelCompression used to do :)

But for the reasons you mention above, I think giving a try with our other 
alternative is worth!

Original comment by marianopeck on 27 Feb 2013 at 11:26

@GoogleCodeExporter
Copy link
Author

I don't disagree! in the contrary, i think it's cool to experiment these ideas.

Original comment by [email protected] on 6 Mar 2013 at 3:52

@GoogleCodeExporter
Copy link
Author

OK. I followed Max idea and I found a few problems and possible improvements. 

- There was a bug with strings bigger than 255 characters because we used only 
1 byte to store size. We now use either one byte or 4. If someone could improve 
this even more, then cool. Most strings will fit in 1 byte, so that's cool.

- The [0000] mark was unncessary and it meant 4 extra bytes PER string. 

- Now it supports both, strings and symbols. 

Original comment by marianopeck on 2 Oct 2013 at 10:20

@GoogleCodeExporter
Copy link
Author

btw, I commited to http://smalltalkhub.com/mc/Fuel/Experiments/main

Original comment by marianopeck on 2 Oct 2013 at 10:21

@stale
Copy link

stale bot commented May 18, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will remain open but will probably not come into focus. If you still think this should receive some attention, leave a comment. Thank you for your contributions.

@theseion
Copy link
Owner

@tinchodias @marianopeck We should totally do this! This can be such a big improvement, and with 4.0.0 we can make it configurable very easily.

@tinchodias
Copy link
Collaborator

Wow, I don't remember too much about this feature but it's great to have the discussion from 2013. I re-read it not.
Would you recover the old code, or implement from scratch?

@theseion
Copy link
Owner

theseion commented Nov 6, 2021

I don't know if I still have the code. But I would want to take a look at the code, there are probably some good ideas in there.

@stale
Copy link

stale bot commented Jan 5, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will remain open but will probably not come into focus. If you still think this should receive some attention, leave a comment. Thank you for your contributions.

@stale stale bot added the stale label Jan 5, 2022
@theseion
Copy link
Owner

theseion commented Jan 5, 2022

Not stale

@stale stale bot removed the stale label Jan 5, 2022
@stale
Copy link

stale bot commented Mar 6, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will remain open but will probably not come into focus. If you still think this should receive some attention, leave a comment. Thank you for your contributions.

@stale stale bot added the stale label Mar 6, 2022
@theseion
Copy link
Owner

theseion commented Mar 6, 2022

Not stale.

@stale stale bot removed the stale label Mar 6, 2022
@theseion theseion added the pinned Never mark this issue stale label Mar 6, 2022
@tinchodias
Copy link
Collaborator

This was good...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants