New blob api early example

I created a new repository called 'codeparse', added a directory called 'wip' for shared efforts that are work-in-process or experimental, and added a subdirectory called 'proparseclient'.

svn://oehive.org/codeparse/wip/proparseclient

For those interested in building and using the API, you should be able to run client.p in that directory, which reads the two binary files 'header.bin' and 'blob.bin'. The program 'tree2text.p' is small and reasonably easy to understand as a starting point. If you don't know which API we're talking about, see http://www.oehive.org/node/1247

(SVN is running slow for me today, which is unusual. I hope it's just my local ISP.)

As my next task, I think I'll add a data transfer from Proparse so that we can fetch its internal data about node types and keywords.


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
john's picture

svn rss

Can we all watch the RSS for 'codeparse' on svn?

http://websvn.oehive.org/rss.php?repname=codeparse

That's easier than having to post here every time we make a code change.


jurjen's picture

please let us split the thread

It's getting too difficult to locate the new comments. Please let's lock this thread and start new ones.


john's picture

larger test blob

I posted a larger 'blob2.bin' test file to svn, along with 'header2.bin' and changes to client.p.


re: larger test blob

I made some tweaks (optimize out the blobutilities) and can load this blob with all temp-table nodes created (not the node classes) in 450ms.

That's 10762 nodes.

we can delay the creation of the temp-table if required, that takes the time down to 30ms. However, it does then mean that we can't have the benefits of being able to loop through the records easily.

I am going to make the building of the tt records an option for now - at least then John, Jurjen and Julian are happy ;)

How often are you going to have to rebuild the whole file ? Isn't .5 of a second a price worth paying for a file this size ?

BTW, John - what was the size of the .p or .w that generated this file ?.

With the optimizations, the original header.bin file loads in 75ms with all temp-tables created.

I'll upload a little later.


john's picture

re: larger test blob

Interesting, I don't think I've paid attention to the number of nodes in the past. This sample is still a pretty small sample (!) and it is not uncommon to have much, much larger compile units (10x or more).

Yes, I'm very happy that front-loading the temp-table is optional. :) Half a second is an awful lot of time for just one small compile unit, especially for an interactive tool like Prolint.

Is the temp-table created on demand, and available BY-REFERENCE through a 'getter' method? (This would allow any number of requests for the temp-table, without any additional overhead.)

I uploaded the .w to svn - it is 'dordlne.w', the biggest sample .w from the old sports2000 application. The .w is 10kb, but I think it references lots of adm include files.


tamhas's picture

Should we perhaps create a

Should we perhaps create a forum for this discussion? It is getting hard to follow what's new.


john's picture

Re: Should we perhaps create a

I thought this was a forum. ;-) Try the comment viewing options. They are pretty cool, and pretty helpful.
Jurjen did wonder about the same thing too, I'm curious what he thinks now that we have one topic with a lot of threaded comments.

Edit: Actually, we don't need a new forum, we should maybe start writing some new topics with their own comment lists, within the forum.


john's picture

UML for 'Xferable'

I added 'xferable.eap.zip' to Jurjen's directory:
svn://oehive.org/codeparse/wip/uml
It's a very quick import from Java source code into Enterprise Architect. I don't know if it will be helpful or not. :)


New versions of proparse.cls

Another new version. This version fixes a problem with the Children, Parent and Sibling nodes (they were out by 1). I've also added an example of how to recursively get a tree of the nodes. See classclient.p


john's picture

very short comments

Sorry, I wish I had time to get involved in this thread today. I'll be at my office tomorrow and have proper time to review and respond.

Here are some really quick notes. Yes, there is a low-level and a higher level. The low level deals with the bytes from the blob, and it has to be very fast, and it can be DLC9 or else easily ported to it.

The code that I'd written so far was just examples, and very DLC9 like. I was still playing with the lower level, and was looking forward to working on classes and the lazy-load mechanisms. (Object factories, I guess.)

The higher level still needs to be fast, but it's not as critical. Yes, creating thousands of objects is too slow, which is why my intention was for objects to be created in a 'lazy load' fashion - only created when needed.

The reason we need to use Objects rather than temp-table records is because the tree is hierarchical, and programming with the syntax tree in this sort of fashion:
type = node:firstChild():firstChild():nextSibling():type
is very common. Doing that same thing with table records is too painful of an API for working with a hierarchy (tree).

Actually, I was thinking that a temp-table would be used just as a list of Node objects that were created, so that we'd have the list for deleting all the objects once done with them.

Yes, there are quite a few classes from Proparse that I/we will want to mirror in the ABL side. See JPNode in the javadoc. (I have to update this too, it's a bit old.)
http://www.joanju.com/analyst/javadoc/
http://www.joanju.com/analyst/javadoc/org/prorefactor/core/JPNode.html
JPnode is the most commonly used node type, but you can see that it has several sub-types as well.
The bytes are written to the blob such that the super-types fields are written first, which will make for less code in the ABL side as well. (The sub-types will first read bytes for the fields of the super type.) That's probably clear as mud, but it will make sense once there's a bit of ABL code to look at.


re: very short comments

as a small test, I created the node class, and ran a client program that created 50,000 nodes. This took around 8 seconds. We would probably need to load all the nodes as classes, as if you did the a:b:c:d mechanism, all 4 nodes *must* be a valid object otherwise the chain crashes. However, we also may be able to make the (for example) FirstChild a property - and map that to an internal var pointing to the node. If this internal variable is not set, then create the new node ...


john's picture

performance and overhead

That's very similar to the results I got. I think it took about 4 seconds on my machine to create 10,000 objects, where each object had several int values assigned from random() to simulate at least a little bit of attribute loading. That's when I decided it would be best to work with lazy-loaded objects, since 4 seconds of overhead per CU would not be appropriate for Prolint.

Answering offline questions...

Since the time we started on this back in 2001, Jurjen and I played a lot with ways to make Prolint (and the Proparse API) reasonably fast. The indexed blob was Jurjen's idea, after he'd evaluated other methods (like XML) for creating the object hierarchies quickly.

Keys to performance included avoiding large numbers of function calls, and avoiding the creation of large numbers of objects (ex: XML nodes).

Jurjen used Delphi for Prolint rules which could not avoid large numbers of calls to Proparse's DLL API. That shouldn't be necessary with the whole blob there in OpenEdge memory.

Based on this background, I never considered posting each node (or any other objects) to the socket individually as they were created. There's no way it could perform well enough. It is very, very fast to create the blob from all the Java objects, and very fast to fetch the entire blob in one shot over the socket.


jurjen's picture

re: performance and overhead

Hi, you are every productive, I cannot keep up! :-)

Now that we have a good start, we can optimize the code for speed. It will make the code look ugly, but I think that is ok because it happens deep inside the guts of the API implementation...

Like you said, one thing to avoid is a large number of function calls. OE10 may be faster than DLC9 I don't know, but look at the blobutilities: function stringFromIndexAt calls two other functions (stringFromData and recordOffsetFromIndexOffset). I think this can be replaced by some complicated looking formula without nested function calls.

The current implementation reads data from the memptr to a temp-table record, the buffer is passed to an other method which reads the data from the buffer using the buffer::fieldname syntax. That is dynamic (at run-time) field referencing, which is slower than static (compile-time) referencing: the Node class would work faster if the tt definition would be included.

The data is now 3 times in memory: (1) in memptr, (2) in temp-table, (3) in Node objects. The temp-table takes a lot of memory, because of its data and also because of its 2 indexes. Do we really need the TT? I doubt it. If we can skip the TT, and let the Node class get its data directly from the memptr instead of from the TT, I think we have saved a bunch of time and a bunch of memory which also affects performance.
Also, some of the property data does not need to be copied from the memptr to the class property, especially character properties. Instead the property getter can read the values directly from the memptr when asked.

Right now the Nodes are not lazy loaded yet. I mean, sure they are not new'ed before they are referenced, but all their properties are SET before they GET. The NodeText is set by the constructor, while a typical client program is seldomly interested in NodeText anyway, so this property should not be read from the blob until requested by the client.

Do I make sense, so far?


re: performance and overhead

You do make sense - but we have also to try to keep things in perspective. If the process of reading the blob from file, parsing it, and creating all the temp-tables records take 60 milliseconds, is *anyone* going to notice a speed increase of 50% , to take the time down to 30ms ?

I suppose what would be useful if John could supply a blob binary from a large source file, let's say 20x the size of the demo c-win. If the performance is linear, then I would expect that it would take around 1.2 seconds to *completely* reload and reparse a huge file.

I want to keep the temp-table for a couple of reasons - if you wanted to find all the nodes of type 12, it would be easy to do. If you wanted to find all nodes with the word "foo" somewhere in the text it would be easy. you can reorder or resort the nodes however you want.

Once the data is loaded into the temp-table we could zero the blob to reduce the memory consumption.

As for passing the tt as a buffer - sure, I could make that process a _lot_ faster by directly setting the node properties directly. I'll just experiment on that so see what speed improvements it would make. As you say, it would make the code look really ugly ;) UPDATE: No it doesn't make it faster. It goes slower !

test0: create all node objects as well as tt records takes 130ms (this is parseunit class as it stands)
test1: as test0, but moved property settings from node class to GetNode method (i.e. not passing a buffer) 138ms
test2: as test0, but Set node properties directly from memptr : 131ms

So, even using direct memptr access the time taken is roughly the same - I presume that the biggest overhead is the creation of the node classes itself, rather than the setting of properties, however we do it.

>> the current implementation reads data from the memptr to a temp-table record,

this is only done once. After the initial load the memptr is not needed any more.

>> the buffer is passed to an other method which reads the data from the buffer using the buffer::fieldname syntax

this is only done once, when the node object is referenced for the very first time.

>>Right now the Nodes are not lazy loaded yet. I mean, sure they are not new'ed before they are referenced, but all their properties are SET before they GET. The NodeText is set by the >>constructor, while a typical client program is seldomly interested in NodeText anyway, so this property should not be read from the blob until requested by the client.

the setting of the text property 10000 times take less than 5ms !!! IOW we could set it 2000 times per millisecond. What performance goals are you trying to set ? :)


john's picture

Jurjen wrote: "Hi, you are

Jurjen wrote:
"Hi, you are every productive..."
By 'you', I'm sure you mean Julian! Since my original post on Saturday, I haven't committed any new client side code at all.

Your comments make sense to me. The attributes of the Objects like Nodes can be fetched from the memptr on demand, (for example by the getText() method for node:getText()) and that's what I had in mind. There's no need for the node's text to be stored in the Node object itself, since it can be fetched very fast from the memptr.

The approach of loading fields into temp-tables does have a certain appeal though: It's a very comfortable environment for OE developers who want to work with code like this:

  FOR EACH nodeRecord
      WHERE nodeRecord.text = "DISPLAY"
    ...
  END.

Done correctly, we should be able to work with a mix of all these, all referencing the same Node objects! There's the very low level that deals with the memptr, there's the class hierarchy, and then at the very top of the API, there could be methods to return temp-tables with all of the nodes:

  DEF TEMP-TABLE nodeRecord NO-UNDO
      FIELD text AS CHARACTER
      FIELD node AS CLASS proparseclient.Node
      ...

The temp-table gives developers a very slow, but very easy to work with collection of Node objects.


jurjen's picture

Ok maybe, but I would only

Ok maybe, but I would only create the temp-table if/when a client programmer asks for it, not always, not as an intermediary between the blob and the node object tree. Because in that case it would have been faster if Java would have output a tt.d file immediately.


Jurjen wrote: "Hi, you are

See my reply to Jurjen - I've just proved by testing that it is actually very very fast doing it this way - nodes accessing the properties from the memptr shows no speed improvement at all ... most of the time goes on the actual creation time of the class itself. I really see no need to complicate things by accessing the memptr directly if, for a tiny tiny performance hit we can have all the advantages of classes and temp-tables.

The temp tables are *not* very slow at all. We could expose the temp-table as an xml source or even a dataset through the api.


very short comments

One of the problems we face is that ABL does not yet support collections, so we can't have the node:firstChild():firstChild():nextSibling():type thing that you are wanting

you would have to say

def var node as class node .

node = parseunit:Node(27).
node = node:FirstChild().
node = Node:firstChild()
node = Node:NextSibling().

message node:Text view-as alert-box.

which is really no different from

def var node as int.

node = parseunit:FirstChild(27).
node = parseunit:FirstChild(Node).
node = parseunit:NextSibling(Node).

message parseunit:NodeText(node) view-as alert-box.

the client does not need to know anything about the temp-table.


re: very short comments

oh .... unless what you mean is that NextSibling() is _not_ an iterator, but literally a simple chain

Node1: NextSibling points to node 2
Node2: NextSibling points to node 3

etc

rather than

Node1:NextSibling() points to node 2
Node1:NextSibling() points to node 3 (i.e Node 1 maintains a list of siblings)

if it is the first case it would be simple enough to implement the classes. The code would look like this though:

node:firstChild:firstChild:nextSibling:type


john's picture

nodes referencing nodes

The Node class needs to implement two patterns. First, they are lazy-loaded: they are not created until they are requested. Second, they are a facade. They do not contain any of their own data. OK, three patterns. :) There is a NodeFactory.

  class Node:
    def var thisOffset as int.
    def var nodeFactory as class NodeFactory.
    method public Node firstChild():
      return nodeFactory:firstChildOf(thisOffset).
    end.
  end.

  class NodeFactory:
    def temp-table nodeRecord
        field offset as int
        field node as class Node
        index idx1 offset.
    method public Node firstChildOf(parentOffset as int):
      def var childOff as int.
      childOff = offsetOfChildForNodeAt(parentOffset).
      return getNodeForOffset(childOff).
    end.
    method public Node getNodeForOffset(offset as int):
      find first nodeRecord
          where nodeRecord.offset = childOff
          no-error.
      if available(nodeRecord)
          return nodeRecord.node.
      return createAndStoreNodeForOffset(childOff).
    end.
  end. 

Schema Classes

Looking through the schema, it seems as there is 16 different types of class names. Are these used at all, or is it just "org.prorefactor.core.JPNode" ? What are the other classes used for ?


New versions of proparse.cls

I have uploaded new version of the proparse class and associated demo code. I have not yet moved to the model, as there are a couple of things that need to get sorted (the socket code seems iffy to me)

I just wanted to get the latest code up for people to run and play with. This new class uses temp-teable records to manage the collection of nodes

I did create a class that did not use temp-tables at all - it just used the recordoffset to find the appropriate text / type for the supplied node. However, this was less than 10% faster (55ms vs 62ms to run the client demo), and does not give any of the potential benefits that using a temp-table does.

For example, we could create a method called GetNodes with a string parameter that defines a query:

METHOD PUBLIC VOID GetNodes(p_QueryString AS CHAR,p_CallBack AS HANDLE):
[snip]
repeat:
GetNextQuery
RUN MatchingNodeFound IN p_CallBack(NodeId,NodeText,NodeType_.
end.

END METHOD.

in the client:

ParseUnit:GetNodes("NodeType EQ 27") or
ParseUnit:GetNodes("NodeText BEGINS 'foo'") o
ParseUnit:GetNodes("NodeID < 100")

etc etc


jurjen's picture

added UML for API design

Added to the svn repository: "wip/uml/proparseclient.eap".
You need Enterprise Architect to open this filetype.

I have drawn a very quick and dirty class diagram, of course it is not complete. What I try to illustrate here is that there is one singleton "proparse" which does all the talking to the proparse.jar server and also manages a collection of ParseUnit instances. A ParseUnit instance reads the blob for one parsed compilation unit - a normal application will probably need no more than one ParseUnit instance, but you can have multiple of you need to. A ParseUnit has methods to get the topnode, query functions, and does the management (=construction and destruction) of Node instances. A Node instance reveals data about a node and has methods for navigating to related nodes.

This is just a beginning. I have probably missed parameters, made mistakes in naming, it is very incomplete and who knows what more I have missed. Right now I just wanted to sketch the main structure. Please comment and improve.

Keep in mind that this UML is a binary file, so merging edits from multiple users is impossible. You'd better lock the file with SVN when you are about to edit, and please commit and release the lock asap.


re: added UML for API design

Looking at the diagram, should not the proparse singleton be purely a collection manager of parseunits - each parseunit should have the read / write methods for it's own data structure. Otherwise we will be needing to not only pass the memptr to proparse, but also manage the thing as well. That should be the job of parseunit.


jurjen's picture

re: added UML for API design

Purely a collection manager for ParseUnits, yes that was the intention.

That, and also maintaining connection with the proparse.jar server. That involves trying to connect to proparse.jar, launch a new proparse.jar server if one is not running yet, determine a service port, stuff like that.

I had in mind that proparse.cls would be the only one who communicates with proparse.jar, so proparse.cls would receive the blob and pass it to a ParseUnit instance. Is it better if each ParseUnit has its own socket to proparse.jar?

What does "receive the blob" mean by the way? In the current example, it seems as if the client receives all data on a socket and writes it to a binary file, and then the binary file is copied to a memptr. Why is there a file in between and when is that file deleted? Why doesn't the proparse server just write the file and pass the filename back to the client? I think some optimizations are possible here.


john's picture

intermediate file

Julian is correct - I only stored the intermediate binary files so that I could easily post the downloadable files to test against and play with.


re: added UML for API design

Yes, Proparse should be the only unit talking to the proparse.jar server. I think that John's example was just showing the ways of getting data into the client, and there are a load of optimizations that could be made.

What I would see is :

Client requests a new parseunit from proparse, and gets the class.
client then asks parseunit to load a source unit ("c-win.w")
parseunit passes the source unit to proparse, which gets the blob and passes it back to parseunit. There is no need for an intermediate file.


john's picture

View diffs?

I bought EA a long time ago, but I don't really use it. Do you know how I would see the differences from one version of the model to the next?


jurjen's picture

re: View diffs

if you have an XMI export of the old version of a package, then you can compare it to the current version: right-click on a package to popup its context-menu, choose option "Package Control", choose sub-option "Compare with XMI file".


john's picture

Free UML viewer

Sparx has a free Enterprise Architect Viewer (look for "EA Viewer") on their download page:
http://www.sparxsystems.com.au/products/ea/downloads.html


New demo classes uploaded

Converted John's examples over to a class based system. Still seems to run very fast. Have a look, let me know.


jurjen's picture

proparse:node property?

There is something about the design of proparse.cls that does not feel good to me: the Node property. It sets and reads the current node.

I think there should not be a current node here. The way I see it, a proparse.cls instance manages one particular blob (=parsed compilation unit) and exposes methods to navigate through its nodes.
A client of this proparse.cls instance may need many handles to nodes, and the client may actually be a swarm of classes/persistent procedures which all have a handle to this same proparse.cls instance but each have a different "current node". More realistic, they all have one or more handles to nodes, so even from the perspective of a single client there is no current node.
Anyway, the notion of "current node" should not exist in proparse.cls, because it prevents the class from being re-entrant.

Instead, there could be a "node.cls" class of which many instances can exist, referenced by the clients. Each node.cls instance represents the data of one node. If there is such a class, then there should also be a factory for node class instances, to coordinate that a single node should not be be represented by more than one node.cls instance because that would be a loss of memory. A factory is also a convenient place to destroy all instances when the parse job is over. I suppose this factory, or should be say instance manager, is part of proparse.cls or a delegate of it.


proparse:node property?

I added a "node" class, and the associated temp-table collection manager and timed things. It is 4x slower to load. That in itself is not so much of an issue (120 ms instead of 30ms) and the extra memory overhead. But in order to get to the node you want, you still have to run a method to get the node

mynewmode = proparse:GetNode(200).
message mynewnode:nodetext

so now the classclient is responsible for garbage disposal as well

we have also to manage the collection.

In terms of efficiency I am sorry to say that I fail to see what advantage this design gives us. It may be better from a purists point of view, but as each node is only a number, type and text this could also be handled simply by using a temp-table within the proparse class. However, all this information is already stored in the blob, and my initial pass simply extracts the node information from the blob as and when it is needed rather than duplicating the data in memory


jurjen's picture

re: proparse:node property

ok, a network of node objects would have been nice, but if it is too slow and memoryconsuming then we should probably forget about it, and make navigation methods (getparent, getfirstchild, getnextsibling) much like the "old" proparse api.

By the way I would not use a temp-table for the Node collection, but a memptr. After all, the max number of nodes is known so its size can be allocated, and the handle to node N can simply be stored at offset (N - 1) * sizeof(handle) + 1. That would have been faster than a temp-table and less memory consuming.

Having said that, most of my initial comment was about the "current node" property in proparse.cls. I think there can't be a current node, just like a database can't have a current record. You have not responded to that yet, but do you agree?


re: proparse:node property

I'm going to make one last attempt at not having a collection of node classes. I think that maintaining a temp-table of nodes is a better idea for a couple of reasons:

1) It is easy to manage a temp-table record in the parseunit
2) It is very easy to run through all temp-table records of a certain type, within a range, having a certain text pattern
3) It is easy to insert / add / remove "nodes"
4) A temp-table of 2000 records is more efficient than 2000 node classes in memory
5) We would not need to have a collection manager for the list of nodes

The temp-table could consist of

INT: NodeID : unique node id
INT: ParentNodeID : parent Node ID of this child
INT: NextSibling : Node ID of next sibling
INT: PrevSibling : Node ID of previous sibling
INT: NodeType : type of the node

CHAR: NodeText : The node text

etc


re: proparse:node property

I'm not sure that I do agree. ;)

The navigation methods, sure thing. That would be easy to do.

However, I see proparse.cls as equivalent to a database buffer, not the database. The database is the binary file loaded into proparse, and proparse provides the "query" and "navigation" to the database.

So, when we read a database record (find customer where custid eq 200), we do have a current record. We then access the members by custid, name etc (message customer.name). Why not the same with proparse (proparse:node = 200), message proparse:nodetext).

I want to try to avoid having the client deal with classes - the only class it needs to load is the proparse class. All other node information, navigation etc is handled by the proparse.

I do not have a background (quite obviously!) in OO programming, but to me the way proparse.cls is designed at the moment makes getting node information very easy.

Also, please forgive my ignorance, but why would proparse need to be re-entrant ?


jurjen's picture

re: proparse:node property

I am not good with explaining things... let's try anyway:
Let's try the analogy with database and buffers. There is only one database, served by only one database server process. Many procedures (persistent procs, internal procs, user defined functions) will be defining a buffer. A buffer is actually a cursor into an index, it points to a "current record". The database server process does not have any notion of "current record", because if it would, it would force that all clients are referencing that same current record. Instead, each procedure has its own local buffer, in other words its own notion of what is "current" in its own scope. Some procedures even have more than one buffers, each pointing to a different record.

Likewise, the blob is the database, proparse.cls is the server (because it loads the blob into memory and exposes methods to navigate through it) and many procedures are referencing proparse.cls. Are all these procedures always watching the same node at the same time? Absolutely not. One procedure may be looking at a Field_ref node, while it invokes a function that descents down that Field_ref node to figure out what the fieldname is. If there would only be one current node (at the level of the server) then the first proc would no longer be looking at the Field_ref node, because the function has caused proparse.cls to navigate away from it. "current node" cannot be shared among procedures, just as a database buffer cannot be shared among procedures...

I hope this helps?


re: proparse:node property

Ohhhhhhhhhhhhhh, that makes much more sense ;)

Now, if you would care to comment on my comment regarding the uml, I will redesign and recode the classes. I just don't like the idea of the client procedures having to deal with the node class itself - the scope for memory leak is a problem, as well as having to create and maintain the node class. I want to be able to say:

MyText = ParseUnit:Nodes:Item(27):NodeText
or
ParseUnit:Nodes(27):NodeText

or

NextNodeID = ParseUnit:Node(27):NextSibling

IOW the client only deals with the proparse and parseunit classes, and integer numbers for the nodes.

I can change the above properties to be methods.

MyText = ParseUnit:GetNodeText(27)

and

NextNodeID = ParseUnit:GetNextSibling(27).

etc.

What do you think ?


jurjen's picture

re: proparse:node property

Ok I understand the part after the "IOW". It is exactly what the current API of proparse.dll looks like.
The first part looks nicer, but if I understand the syntax it looks to me like Node(27) returns an object with a NodeText property - so there is still a Node object... It looks nice and userfriendly, but isn't a Node object just what you were trying to prevent?


re: proparse:node property

Yes, ok, I'm being an idiot. The ParseUnit would be responsible for creating and maintaining the node class - it is just passed to the client as a reference.

Client:

DEF VAR Node AS CLASS proparseclient.node.

ASSIGN Node = ParseUnit:Node(27).

Message Node:Text view-as alert-box.

ParseUnit would create the new node if required.


re: proparse:node property

>> but if I understand the syntax it looks to me like Node(27) returns an object with a NodeText property - so there is still a Node object.

yes, but it is not maintained or stored at the client level - the parseunit chains it all together so the client does not have to declare a variable of class node, "new" it and release it.


jurjen's picture

re: proparse:node property

I suppose both are fine with me. The function-style API is somewhat more low-level and easy to port to DLC9, and someone can always put an OO facade on top of that.
So yes maybe the function-style API is preferred. I'd like to hear John's opinion!?!?


re proparse:node property?

I originally had thought of having a class per node, but the numbers seem to play against: The simple example has over 400 notes. The average file would have around 1500 / 2000 nodes. That is a *huge* number of classes to be hanging around in memory.

From a purists point of view, I agree it's the best thing. From a practical point of view, it's not.

There is also the view that we work on a node at a time, and it's up to the controlling program (clientclass.p) to track the node data (it's only text and type and node) if it wants to referece more than one node at once.


jurjen's picture

re: proparse:node property?

Ok, but you do agree that proparse:Node needs to be removed from proparse.cls?

I too have concerns that ABL might not be optimal for large numbers of small objects. Is 2000 objects a lot? I think so, but not sure until we've tried.

If a file has 2000 nodes, we don't really need 2000 instances of the Node class. An instance would not have to be new'ed until it is first referenced.
Having a Node class is still convenient, because it makes the code in the client program easier - it encapsulates the funny memptr offset calculations. If we don't want too many instances in memory, then the client just needs to "delete object node" more often.


jurjen's picture

does not compile in 10.1C

I am trying in OE10.1C, but line 195 in schema.cls does not compile. It says "cannot reference private member AddRecord off of an object reference."


re: does not compile in 10.1C

Check out the latest source. It should now compile


john's picture

do?

Jurjen asked "What else can I do? Review an API proposal? Or collab on some function/method implementations?"

Most important will be your thoughts and input with regard to Prolint. For a new version of Prolint, using the new proparse.jar, have you decided what you want to do about OE version?

I think the API is small enough that there could be a few variants for a few major versions of OpenEdge, but I'd want to focus on an API that will be useful both in Prolint and also with other parsing projects.

The server side isn't ready for packaging yet (it's only running in a test environment right now), and I'm still adding bits and pieces between the server and the client. I'm adding new pieces as I find I need them when I work on the examples.

The nice thing is that I was able to dump out some binary data that we can play with and build classes around.

I don't have an API proposal - I'm open to suggestions! It would be great to have your collaboration, both on design and implementation.

Assuming we're going to use classes, we'll need some that can be persistent through the OE session (like the xfer schema), maybe one or two that would be created with each call to the parser, and then there would be a bunch of classes (and sub-classes) that mirror the records coming out of Proparse.


jurjen's picture

re: do?

Concerning Prolint: I think we can't give up Progress version 9 yet, so that would mean an includefile/persistentproc for DLC9 and state-of-the-art OOABL for 10.1C. If there will be a DLC9 API and an API optimized for OE10, then both API's will be very different, so that would lead to 2 different Prolint branches. Hmm, that sucks, in that case it is perhaps better to just keep *current* Prolint when you're on DLC9 and only upgrade when you're using OE10. Unless you want current Proparse DLL to disappear. Unfortunately, people often have DLC9 projects (in maintenance) and OE10 projects, so that would be two different Prolint environments too.

Prolint is not using too many Proparse functions; when the new Proparse API comes fairly close to the old one then it can't be too difficult to modernize the Prolint code. What I cannot control is custom rules at sites I don't know about, so those will stop working. I don't have any numbers on users, or numbers on custom rules, so I don't know how bad that is. I guess you're not planning to release a tool that refactors old Proparse clients to new Proparse clients, do you :-)


john's picture

re: do?

That's kind of what I thought; there are always older versions of Prolint available that will always run on the older versions of OpenEdge.

Existing custom rules can continue to run on existing versions of Prolint. It doesn't sound too terribly bad (to me) to run old and new Prolint in parallel during a transition period.

Another possibility is a drop-in replacement for the old proparse.i and proparse.p that came with the DLL. It would have to imitate the API that existed in C++ in the DLL. I keep hoping this won't be necessary, but given the large number of Prolint rules that come with Prolint, it might be.


jurjen's picture

re: do?

Given that Prolint does not use very many Proparse funtions, it might actually be possible to create such a drop-in API facade for Prolint. But in that case we're not really moving forward and are only making configuration management more complicated.
Let's be practical: first design the most desirable API, define what is desirable if you can start all over again from scratch. After that maybe negociate compatibility issues if there are any.