Some interesting timings

Fri, 2008-08-15 19:07 — jmls

Getting a list of nodes with something like parseUnit:queryType("DEFINE") is very common in Prolint.

I just did some interesting calculations:

On 100,000 nodes, a run through the data blob finding all nodes of type 23 (for example) takes approx 100ms on my machine. And it is almost linear (50000 nodes = 51ms).

If I create a temp-table, it takes 1500ms to create the records (as an initial hit), BUT always less than 1ms to find all records of type 23 - almost 100 times faster

on 50,000 nodes, the timings are

datablob search: 52 ms
tt: 800ms to create, < 1ms to search (at least 50x faster, perhaps 75x)

So, if this type of "getting a list of nodes" is quite common, then perhaps the class-based api would offer some benefits ..

Proparse discussion

Sat, 2008-08-16 00:42 — john

I'd say using a temp-table

I'd say using a temp-table index of node types to node offsets is worthwhile for the back-end implementation for the queries that Prolint does.

I reduced the temp-table to only be indexed on the int node type, and that reduced the creation time. I reduced the test to 20,000 nodes, to look a bit more like an average large-ish parse unit.

I was getting 350ms to create the temp-table, vs. 50ms to scan the blob. So, after 7 queries, the temp-table pays for itself.

(The cost of the FOR EACH tt, over 100 matching records, was zero or one millisecond.)

Sat, 2008-08-16 17:38 — jurjen

These sure are interesting timings!
But let's not forget that we (Prolint) typically don't search "EACH node WHERE nodetype=23", but often we try to scope the search to a particular code block. For example, when an internal procedure starts at node 1500 and has its matching END at node 2100, then the query would be more like "EACH node WHERE nodetype=23 AND node>=1500 and node<=2100". (which index would be best now?).
Now figuring out where the END node is (e.g. 2100), may be something you'd have to determine first, by doing a node tree walk. I don't know, maybe you have a smarter way to determine the end. But if a tree walk is necessary, then the walker may as well look at the nodetype while it is there.

The new feature (searching for nodetext) is a nice one. There have been places where Prolint would have used this feature if it would have existed, but nodetext is only interesting where nodetype="ID" (variablenames, fieldnames, methods, functions, procedures).

Sat, 2008-08-16 22:16 — john

Re: Cool, although....

Oh, I thought it was a lot more common for Prolint rules to search for all nodes of a given type, in the entire compile unit.

Interesting thought to combine the temp-table indexes so that you could match on a range of node# as well as specify node type.

Since each node has a record of its parent, I think it's possible to use straight looping without recursion to calculate the end node of a block. I've never actually used this trick before, so I don't have an example. I was going to work on an example of it into the new tree2text.p I was working on (for showing depth by indenting), but I haven't got around to it yet.

It's an interesting question though, which would be faster. Recursive-descent, which would skip very quickly to the last node, or looping, which would have to iterate every node. I've attached the recursive-descent logic for finding lastDescendent (below).

Since the scope of the search is smaller (just a block rather than the entire parse unit), the overhead of looping through all the nodes in the block is much smaller, and further optimization would probably be overkill.

/** Find the last child of the last child of the... */
public static JPNode getLastDescendant(JPNode top) {
	JPNode child = top.firstChild();
	if (child==null)
		return top;
	JPNode temp = child;
	while (temp!=null) {
		child = temp;
		temp = temp.nextSibling();
	}
	return getLastDescendant(child);
}

Sat, 2008-08-16 20:43 — jmls

Isn't this where John can

Isn't this where John can come in .. the parser could store the "Start Block node" of each "End" (case / procedure / function etc) and possibly store the "End block node" of each "start". If that were not possible, wouldn't it be possible to store the "node level" in each record

level 1: FOR EACH foo: #100
 level 2: DO:          #101
 level n: [stuff]      #102-#120
 level 2: END.         #121
level 1: END.          #122

therefore, if we know that node #100 is where we want to start, the next node with a level of 1 is the end block

find next node where level eq 1 and nodeID > 100 would take you straight to node 122.

Having said of all this, isn't the Parent node of #100 and #122 the same ? And the parent node of #101 and 121 the same ? If we indexed the parent, wouldn't this give us a range of nodes of the parent ? IOW, is Parent the same as level ?

Sat, 2008-08-16 22:31 — john

The shape of the tree tree

The shape of the tree tree gives us this information. For example (maybe a little oversimplified), in a DO..END block, the DO is the parent of the END. The END is the last direct child of the DO. To find the END of a DO, we get the DO node's firstChild(), then iterate nextSibling() to the END. The DO will only have as many direct child nodes as there are statements in the DO block (i.e. not all that many nodes; usually only a dozen or two at most).

In fact, thinking more about my earlier post... I'm guessing that finding the last descendent is going to be faster using recursive-descent than with looping, because with recursive-descent, we skip past the vast majority of the nodes in between the start/end of the block. Recursive-descent would only be slower if OE's method-call performance was atrocious to the extreme, and I don't think it's as bad as all that.

Sat, 2008-08-16 06:01 — jmls

I reduced the temp-table to

I reduced the temp-table to only be indexed on the int node type,

Unfortunately we have to have the index on node num as well, as the tt is the holder for the node class

 message parseUnit:Node(2):txt

finds the tt record of node 2, and returns the node class.

Don't forget, that the initial creation of the tt record and nodes is a one-off for the parseUnit (when it starts). As the parseUnit is reusable for other cu's (and therefore the temp-table and node classes) then the benefits increase. i.e it's not after 7 queries per cu it's 7-10 queries per parseUnit initialised.

Sat, 2008-08-16 00:19 — john

So that we have a vague idea

So that we have a vague idea of what to budget for, I ran a script on a smallish ERP system:

Num units: 6209
Max nodes: 19996
Avg nodes: 1839
Total nodes: 11419508

Fri, 2008-08-15 20:00 — john

Interesting. So at about 15

Interesting. So at about 15 queries, the temp-table would pay for itself.
Can you show us the code?

Fri, 2008-08-15 20:09 — jmls

Sure thing: def var maxn as

Sure thing:

def var maxn as int init 100000 no-undo.

def var a as memptr no-undo.

def temp-table tt no-undo
  field i as int
  field t as int
  index i is primary
    i
    
  index t 
   t
.
   
def var k as int no-undo.
def var j as int no-undo.
def var s as int no-undo.

s = mtime.

do k = 1 to maxn:
 create tt.
 assign tt.i = k
        tt.t = random(1,1000).
end.

s = mtime - s.

message "Creating TT Timing:" s view-as alert-box.

set-size(a) = maxn * 30. /* 30 bytes per record */

do k = 1 to maxn: /* store the same "type" values in the blob */
 find tt where tt.i = k no-lock.
 
 put-long(a,(k - 1) * 30 + 2) = tt.t.
end.

s = mtime.

for each tt where tt.t eq 123 no-lock k = 1 to k + 1:
end.

s = mtime - s.

message "TT Timing:" k skip s view-as alert-box.

s = mtime.

do k = 1 to maxn:
 if get-long(a,(k - 1) * 30 + 2) EQ 123 then assign j = j + 1.
end.

s = mtime - s.

message "Blob Timing:" j skip s view-as alert-box.


set-size(a) = 0.

Fri, 2008-08-15 19:09 — jmls

The OpenEdge Hive

More Navigation