View Full Version : Reducing XML file size by tokenizing?
jamesmcness
02-22-2009, 08:30 AM
I've got an XML file which is sitting at about 300k and will only get bigger.
The tag names are quite verbose so it's frustrating because it's obvious the file size would be a lot smaller if those tag names were only a couple of characters long.
I'm not sure if I'm using the term "tokenizing" right, but does anyone know of a method to convert verbose text into a series of unique shorthand "tokens" or markers... and back again?
I don't want to change the tag names as the logic within flash refers to them so it would mean an annoying rewrite.
Any ideas?
Arif-sama
02-22-2009, 10:36 AM
annoying rewrite? Replace function will do it in couple seconds.
P.S. Doesnt mean there is no way to "tokenize" your XML :)
jamesmcness
02-22-2009, 11:16 AM
Bit beside the point either way as I prefer the verbose name for comprehension.
Even though in my case a simple search/replace wouldn't actually work so straightforwardly, I was just trying to pre-empt suggestions to change the names ;)
wvxvw
02-22-2009, 12:51 PM
You may try something like this:
var hash:Object = {};
var generator:String = "ABCDEFGHIJKLMNOP";
var uidindex:uint;
var xml:XML =
<data>
<node attr="foo"/>
<node>
<bar>
<blah>
<foo attr="bar"/>
<node check="1"/>
</blah>
</bar>
</node>
<bar/>
</data>;
function generateUID(nodeName:String):String
{
uidindex++;
var pin:String = "";
var upos:uint = uidindex;
var i:int;
var pos:int;
while (upos)
{
pos = upos % 16;
upos /= 16;
i++;
pin += generator.charAt(pos);
}
hash[nodeName] = pin;
return pin;
}
function batchRename(inxml:XML):XML
{
if (hash[inxml.name()])
{
inxml.setName(hash[inxml.name()]);
} else {
inxml.setName(generateUID(inxml.name()));
}
inxml.*.(batchRename(valueOf()));
return inxml;
}
trace(batchRename(xml).toXMLString());
But this will rename everything you have in your XML, but will give it shorter names.
jamesmcness
02-22-2009, 12:58 PM
How would you reverse that back to the verbose form?
wvxvw
02-22-2009, 01:10 PM
var hash:Object = {};
var generator:String = "ABCDEFGHIJKLMNOP";
var uidindex:uint;
var xml:XML =
<data>
<node attr="foo"/>
<node>
<bar>
<blah>
<foo attr="bar"/>
<node check="1"/>
</blah>
</bar>
</node>
<bar/>
</data>;
function generateUID(nodeName:String):String
{
uidindex++;
var pin:String = "";
var upos:uint = uidindex;
var i:int;
var pos:int;
while (upos)
{
pos = upos % 16;
upos /= 16;
i++;
pin += generator.charAt(pos);
}
hash[nodeName] = pin;
return pin;
}
function batchRename(inxml:XML):XML
{
if (hash[inxml.name()])
{
inxml.setName(hash[inxml.name()]);
} else {
inxml.setName(generateUID(inxml.name()));
}
inxml.*.(batchRename(valueOf()));
return inxml;
}
function reverse(inxml:XML):XML
{
var oldName:String = inxml.name();
for(var p:String in hash)
{
if (hash[p] == oldName)
{
inxml.setName(p);
break;
}
}
inxml.*.(reverse(valueOf()));
return inxml;
}
trace(batchRename(xml).toXMLString());
trace(reverse(xml).toXMLString());
:)
jamesmcness
02-22-2009, 01:35 PM
Thanks W, that's pretty much along the lines of what I'm looking for. I'll have to work out a solution for n number of tags though as it could easily be in the hundreds.
wvxvw
02-22-2009, 01:44 PM
This solution gives you 0xFFFFFFFF unique names... I believe that's a lil bit more than hundreds :)
jamesmcness
02-22-2009, 01:55 PM
Maybe I didn't understand it properly, I reduced the number of letters down to 4 and it freaked out. I should have increased the number of nodes instead I guess.
wvxvw
02-22-2009, 01:58 PM
With 4 letters you may have only 4 ^ 4 unique names (0xFF)...
Just out of curiosity, why do you need to reduce the number of letters?
jamesmcness
02-22-2009, 02:03 PM
Sorry, it's late. I was trying to work out how many different "tokens" it would create. I was guessing it just creaeted single character tokens from the hash pool. I figured that the dummy xml in your code wouldn't push it past the 16 options. So instead of adding more nodes, I reduced the number of chars in the hash pool.
Flawed logic I know, as I didn't full understand your implimentation.
Is it actually just counting up in hex numbers?
wvxvw
02-22-2009, 03:43 PM
Yes, kind of... it takes the modulo of division of numeric ID by 16 for the first letter of token, then divides the numeric ID by 16 and then takes the modulo again for the next letter.
I.e.
0x 12 34 56 78 ---> % 16 = 8 (take the letter #8)
0x 12 34 56 7 ---> % 16 = 7 (take the letter #7)
0x 12 34 56 ---> % 16 = 6 (take the letter #6)
... etc
Thus you'll never get the name longer than 8 characters, but surely, each name will be unique.
jamesmcness
02-23-2009, 01:55 AM
Thanks W, just what I need.
Question...don't you still have a 300KB XML file you need to load? Don't see how shrinking names inside Flash is helping your original problem.
jamesmcness
02-23-2009, 03:43 AM
I'll be saving the shrunken version and then expanding it once loaded (will have to store a legend section within the XML file)
Ah ok, so you're making a utility out of it. Let us know how much the compression saved you in file size. I'm curious what the result is.
jamesmcness
02-23-2009, 04:09 AM
Sure, will do. I am hoping it will be quite a lot as my file isn't data heavy - it's tag heavy!
|
vBulletin® v3.8.4, Copyright ©2000-2009, Jelsoft Enterprises Ltd.