Getting a list of nodes with something like parseUnit:queryType("DEFINE") is very common in Prolint.
I just did some interesting calculations:
On 100,000 nodes, a run through the data blob finding all nodes of type 23 (for example) takes approx 100ms on my machine. And it is almost linear (50000 nodes = 51ms).
If I create a temp-table, it takes 1500ms to create the records (as an initial hit), BUT always less than 1ms to find all records of type 23 - almost 100 times faster
on 50,000 nodes, the timings are
datablob search: 52 ms
tt: 800ms to create, < 1ms to search (at least 50x faster, perhaps 75x)
So, if this type of "getting a list of nodes" is quite common, then perhaps the class-based api would offer some benefits ..
I'd say using a temp-table
I'd say using a temp-table index of node types to node offsets is worthwhile for the back-end implementation for the queries that Prolint does.
I reduced the temp-table to only be indexed on the int node type, and that reduced the creation time. I reduced the test to 20,000 nodes, to look a bit more like an average large-ish parse unit.
I was getting 350ms to create the temp-table, vs. 50ms to scan the blob. So, after 7 queries, the temp-table pays for itself.
(The cost of the FOR EACH tt, over 100 matching records, was zero or one millisecond.)
Cool, although....
These sure are interesting timings!
But let's not forget that we (Prolint) typically don't search "EACH node WHERE nodetype=23", but often we try to scope the search to a particular code block. For example, when an internal procedure starts at node 1500 and has its matching END at node 2100, then the query would be more like "EACH node WHERE nodetype=23 AND node>=1500 and node<=2100". (which index would be best now?).
Now figuring out where the END node is (e.g. 2100), may be something you'd have to determine first, by doing a node tree walk. I don't know, maybe you have a smarter way to determine the end. But if a tree walk is necessary, then the walker may as well look at the nodetype while it is there.
The new feature (searching for nodetext) is a nice one. There have been places where Prolint would have used this feature if it would have existed, but nodetext is only interesting where nodetype="ID" (variablenames, fieldnames, methods, functions, procedures).
Re: Cool, although....
Oh, I thought it was a lot more common for Prolint rules to search for all nodes of a given type, in the entire compile unit.
Interesting thought to combine the temp-table indexes so that you could match on a range of node# as well as specify node type.
Since each node has a record of its parent, I think it's possible to use straight looping without recursion to calculate the end node of a block. I've never actually used this trick before, so I don't have an example. I was going to work on an example of it into the new tree2text.p I was working on (for showing depth by indenting), but I haven't got around to it yet.
It's an interesting question though, which would be faster. Recursive-descent, which would skip very quickly to the last node, or looping, which would have to iterate every node. I've attached the recursive-descent logic for finding lastDescendent (below).
Since the scope of the search is smaller (just a block rather than the entire parse unit), the overhead of looping through all the nodes in the block is much smaller, and further optimization would probably be overkill.
Isn't this where John can
Isn't this where John can come in .. the parser could store the "Start Block node" of each "End" (case / procedure / function etc) and possibly store the "End block node" of each "start". If that were not possible, wouldn't it be possible to store the "node level" in each record
therefore, if we know that node #100 is where we want to start, the next node with a level of 1 is the end block
find next node where level eq 1 and nodeID > 100 would take you straight to node 122.
Having said of all this, isn't the Parent node of #100 and #122 the same ? And the parent node of #101 and 121 the same ? If we indexed the parent, wouldn't this give us a range of nodes of the parent ? IOW, is Parent the same as level ?
The shape of the tree tree
The shape of the tree tree gives us this information. For example (maybe a little oversimplified), in a DO..END block, the DO is the parent of the END. The END is the last direct child of the DO. To find the END of a DO, we get the DO node's firstChild(), then iterate nextSibling() to the END. The DO will only have as many direct child nodes as there are statements in the DO block (i.e. not all that many nodes; usually only a dozen or two at most).
In fact, thinking more about my earlier post... I'm guessing that finding the last descendent is going to be faster using recursive-descent than with looping, because with recursive-descent, we skip past the vast majority of the nodes in between the start/end of the block. Recursive-descent would only be slower if OE's method-call performance was atrocious to the extreme, and I don't think it's as bad as all that.
I reduced the temp-table to
Unfortunately we have to have the index on node num as well, as the tt is the holder for the node class
finds the tt record of node 2, and returns the node class.
Don't forget, that the initial creation of the tt record and nodes is a one-off for the parseUnit (when it starts). As the parseUnit is reusable for other cu's (and therefore the temp-table and node classes) then the benefits increase. i.e it's not after 7 queries per cu it's 7-10 queries per parseUnit initialised.
So that we have a vague idea
So that we have a vague idea of what to budget for, I ran a script on a smallish ERP system:
Interesting. So at about 15
Interesting. So at about 15 queries, the temp-table would pay for itself.
Can you show us the code?
Sure thing: def var maxn as
Sure thing:
Searching for a string also
Searching for a string also gives < 1ms search time on a temp-table. I would expect that using the blob would be more than 150ms, as we have to do a few calculations to get the string before we can compare.
At the moment, proparse.dll
At the moment, proparse.dll doesn't even have any API function for doing a query based on node.text. So, Prolint really doesn't do much of that. :) Probably none at all.
lol. Hey! a new feature !
lol. Hey! a new feature !