Strings

Whilst I understand and appreciate the rationale behind "fixed length" records, and the need to store an offset to a string in a field position, I feel that it may be overcomplicated.

Why couldn't proparse.jar return 3 blobs: Header, Record and Strings.

The strings blob is simply a collection of null-terminated strings (no sizes etc)

FOO\0
BAR\0

Each record field that is a string could simply contain a pointer to the starting position of the string. The progress GET-STRING() will read a string until it hits a NULL, so we don't need to record the size of the string.

So, instead of having

ASSIGN lv_RecordOffset = GET-LONG(DataBlob, offsetOfIndex + (GET-LONG(DataBlob, FirstOffset + p_Node * 4)) * 4) + 1.
lv_OffSet = GET-LONG(DataBlob, offsetOfIndex + (GET-LONG(DataBlob, lv_RecordOffset + TextOffSet)) * 4) + 2.
[snip]
NodeText = GET-STRING(DataBlob, lv_offset + 4, GET-LONG(DataBlob, lv_offset))

we would have

ASSIGN lv_RecordOffset = GET-LONG(DataBlob, offsetOfIndex + (GET-LONG(DataBlob, FirstOffset + p_Node * 4)) * 4) + 1.
[snip]
NodeText = GET-STRING(StringBlob, GET-LONG(DataBlob, lv_RecordOffset + {&TextOffSet})

Julian


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
john's picture

Strings, on second thought

After sleeping on this, I'm not so sure that it would really be best to make those changes.

By creating a Strings segment, we would be treating Strings different than other Object types. Keep in mind that Strings are Objects that can appear in Lists of mixed Object types, and we need to be able to iterate through the index numbers in the List. So, Strings do have to have index numbers, and they might as well be treated the same as other objects.

Since this code is going to be at the lowest level in the API, I'm not too worried about it being a little verbose. Low-level API functions like the ones I wrote in xferblob.i should be used, for example, a call to stringFromIndexAt() takes the place of a few lines of code.

Putting the logic inline, rather than using function (method) calls would be an over-optimization. This is especially true when considering that when analyzing the syntax tree, fetching node or token text is certainly not the most common operation. (The most common operations are fetching the node type, and fetching firstChild/nextSibling.)

Now let's consider the performance of null-terminated strings rather than strings prefixed by length. With the length prefix, the get-string function is able to read N bytes, and create a string from that. When null-terminated strings are used, the get-string function must read one character at a time and check at every character if it is the null to terminate the string. So, I'm not sure that removing the one get-long() call from the ABL code will actually make performance better. IIRC, Java serialization uses a length prefix rather than null terminator, so it may be that people who have implemented these sorts of things in the past have found that the length prefix is actually faster.

There are advantages and disadvantages to each approach. I'd be happy to make a change if a new approach was clearly better than the old approach, but in this case, it's not really clear-cut.


Fair enough. It's your blob

Fair enough. It's your blob :)


john's picture

Re: Strings

That makes sense. In fact, we shouldn't even need the strings to be a separate blob. The offset of the first string can be written to the header, and all string offsets would be relative to that.

(A secondary design goal was for this Java Xfer blob builder to be general purpose so I could use it elsewhere, and all its data to be self contained.)

I'll take a look, and if I don't find any problems with it, I'll make the change.