Schema and speed

One thing that I am thinking of that is costing us in speed terms is the schema of each record. For each node, there is a pointer to the node record. Once we have that record, we then need the schema pointers to get us to the value of each field. This is done because we need to have the capability of changing the schema around if required.

This means that

a) we need to include the schema in each blob
b) we need to read the schema to create the record and field offsets
c) each field's position has to be calculated, and stored in a variable

How often is the schema going to change ?

Could we not define the schema as a set of include files as part of the source ?

so, instead of having to define variables, read schema, assign variables, and then when creating nodes having to GET-LONG(DataBlob,lv_RecordOffset + NodeTypeOffSet)

we could simply pass the pointer to the blob and the record offset to the node, and the node does

{schema/node.i}

type = GET-LONG(DataBlob,lv_RecordOffset + {&NodeTypeOffSet})

there must be some trick that we can play to force a class to recompile if the schema has changed.

Comments ?


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
tamhas's picture

I agree that it should be

I agree that it should be static, but I'm not very keen about it being in an include file. Why isn't it simply a versioned class or an aspect of a versioned class?


What I wanted to do is embed

What I wanted to do is embed it into the classes where it is needed, and also create a static class for the schema with the offsets defined as properties so we have the best of both worlds. For simplicities sake, I just thought that generating an include file and using it in both classes would be easier, but it is no problem either way


tamhas's picture

The notion of include files

The notion of include files seeems rather counter to the notion of encapsulation.


We are trying to define the

We are trying to define the Field offsets as static "inline" values for speed (no need to retrieve them from a class, variable etc)

 &SCOPED NodeTypeOffSet 23
...
 NodeTypeNum  = GET-LONG(DataBlob,lv_RecordOffset + {&NodeTypeOffSet}) 
...

so we needed some way of maintaining the "&SCOPED NodeTypeOffSet 23"

Either we manually sync the class definitions with the schema changes (that's probably not going to be that often) or we generate the definitions automatically when the schema changes and "slurp" them into the class file. For me, the only way I could see this was to dump the definitions into an include file and use that include file in the class.

We could create a base class for the schema, which is automatically created (ala the include file) and all other classes inherit the schema class. However, that would require the use of variables:

schema.cls

 DEF PUBLIC PROPERTY NodeTypeOffSet AS INT INIT 23 NO-UNDO.

node.cls

CLASS Node INHERITS proparseclient.schema:
...
 NodeTypeNum  = GET-LONG(DataBlob,lv_RecordOffset + THIS-OBJECT:NodeTypeOffSet) 
...

What is interesting is that this mechanism is only 7% slower, (100,000 iterations of x = 1 vs x = this-object:y give 600ms vs 645ms), so this would probably satisfy the requirements of all in this thread :)


john's picture

how much slower

How much slower is this?:

NodeTypeNum  = GET-LONG(DataBlob,lv_RecordOffset + schema:NodeTypeOffSet)

...where every Node object has a reference to the session Schema object, and the session schema object has a public var for every field.


See my latest thread - I've

See my latest thread - I've created a static schema class and it's very fast ;)


John, do you have any

John, do you have any comments, or can I go ahead and convert parseunit and nodes to use "static" schema ?


john's picture

Re: my comments

I think that might be an over-optimization.

I had in mind that extremely transient Objects like Nodes would not actually read the blob directly - they would pass on the request to less transient Objects. By less transient, I mean ones that exist for the duration of the session.

class Node:
  def var offset as int. /* vars assigned in constructor */
  def var pu as ParseUnit.
  method public int getType():
    return pu:getSchema():getNodeType(pu, offset).
  end.

We shouldn't have to be too concerned that the method Node:getType() makes a call to another (less transient) class. Method calls like that should not be happening in really tight loops, except maybe in one-time scripts where performance isn't such a big consideration.

Instead of calling node:getType() in a tight loop, there should be a query function within a less transient Object which has local variables for all field offsets and that Object would operate directly on the blob. It would build and return a result set. Ex:

class ProparseSchema:
  method public ObjectCollection queryNodeType(
      pu as ParseUnit, nodeType as int):
    ...
  end.

I have to disagree. For

I have to disagree. For something that will hardly ever change (the schema) I think that there is a level of complication that is simply not needed. The current code is thus:


class Node:
  DEF PUBLIC PROPERTY NodeTypeNum  AS INT NO-UNDO GET . PRIVATE SET
end class.

However, I was not talking about node types, nor any node property - I was talking about the record schema, and the field offsets of the record. At the moment they are all variables, and what I was suggesting was that we turn those into literals so there is no need to refer to the schema class to get the values. Instead of

 NodeTypeNum  = GET-LONG(DataBlob,lv_RecordOffset + ProSchema:getOffset(Classname,"Type") 

we would have

 NodeTypeNum  = GET-LONG(DataBlob,lv_RecordOffset + {&NodeTypeOffSet}) 

which speeds things up y at least 2-3 times


jurjen's picture

re: Schema and speed

The record layout won't change often, at least never between proparse version downloads.

If this optimization is succesful also depends on how many times these calculations are repeated. Currently it happens in ParseUnit:RefreshData() which is called once each time you parse a sourcefile. That is not very often already, but it could be further reduced to just once in a Progress session. Perhaps consider making blobutils.cls a singleton so it is re-used in every ParseUnit instance, and move RefreshData() to blobutils.cls and have this method called exactly once. Logically I think it is not even a bad design when blobutils is the class that is responsible for owning all knowledge about the blob including its schema.

just my 5 cents..


My point is that when you

My point is that when you are creating a node, either you have to pass all schema information into the constructor so that the node class knows the field offsets, or we have to do what I am doing now, pass the temp-table buffer across to the constructor. If the offsets were fixed at compile time, the node constructor could simply be passed the record offset and get the field values directly. This would be much more efficient.

If speed is our primary concern, then having a separate class to handle and hold the schema details is not very efficient.

As you mentioned, if the fact that the record layout would only change between proparse versions then I would suggest that we do away with the schema entirely in the blob, as it can be constructed from include files and therefore be compiled "inline" into the appropriate classes.

At the moment we are either having to define and maintain variables for each field of each record, or access blobutilities properties for each field assignment which is very inefficient when speed is of the essence.


New schema would only be

New schema would only be available if the proparse.jar file has changed, right ?

When the proparse class starts, it could request a schema version number from the server, and from the schema class . If these numbers are different, then the schema class requests the new schema from proparse , generates the appropriate schema include files and recompiles the affected classes.


jurjen's picture

chicken and egg

As soon as you new proparse.cls, Progress will compile ParseUnit.cls and the other referenced classes because of strong typing. That means: compile happens before proparse.cls gets a chance to write a new includefile to be used in ParseUnit.cls :-)


The point is moot - we both

The point is moot - we both agree that there is no need for "checking the schema", as the new schema would be part of a new proparse version.

However! being a pedantic person, I was thinking that the SchemaVersion Property of the schema class would be an include file

schema.cls:

DEF PUBLIC PROPERTY foo ...
[snip]
DEF PUBLIC PROPERTY SchemaVersion AS CHAR INIT "{schemaversion.i}" GET . PRIVATE SET .
[snip

schemaversion.i:
2.0

proparse.cls:

If <> NE Proschema:Schemaversion THEN
DO:
ProSchema:WriteNewSchema(). /* writes new schema include files, changes schemaversion to <>
COMPILE schema.cls SAVE.
DELETE OBJECT ProSchema.
ProSchema = NEW Schema().
END.

See, told you i was pedantic.


jurjen's picture

You win! I was hoping to

You win! I was hoping to find some error in your sample, just so I could beat you at being pedantic, but you're the man :-)


LOL. that has really made my

LOL. that has really made my day.


jurjen's picture

I bet users won't just

I bet users won't just download a new proparse.jar but rather a complete proparse product zip including the jar, the api sourcefiles you are now writing, maybe some scripts, maybe some changelog and readme's, the license, etc.
I see no reason why the nodeschema.i could not be distributed along with that, as a static includefile right in the proparseclient directory, instead of having to generate it dynamically.


I agree entirely with you -

I agree entirely with you - see my comments to your first comment