Retired (historical) pages

Parent page for retired (no longer used) Proparse pages.

Compiling the C++ DLL

If you have everything all set up perfectly, then Proparse can be compiled with a simple Gnu make command. Getting everything set up can take a lot of effort though, so if you are adventurous and want to try compiling it yourself, be prepared for some challenges. The source requires Gnu C++, Gnu Make, perl, and probably other things I've forgotten. For performance reasons I usually compile the Windows DLL with MSVC++, but I found that by installing the right packages with Cygwin, I'm able to compile with that. Doing a build on Linux isn't too hard. The Gnu and other tools are usually easy to find and install for other unix platforms too.

Boost
Compiling Proparse also requires the C++ Boost libraries, which are large, so I haven't included them in the SVN repository. Download it from www.boost.org (I'm using version 1.33.1), unzip it somewhere, then copy its 'boost' subdirectory into the openproparse directory (alongside antlr, java2, spider, etc).

Antlr
The Antlr libraries are included, and the sources are under version control because I've had to hack a few of the C++ source files.

JDK
Proparse compiles with a built in Java-native interface, so it requires a couple of C header files from the JDK. The Makefile, by default, looks for the jdk in /progfile/jdk/, but of course you can change that.

Sample Session
Here is an example session compiling Proparse on Linux. If you aren't a 'vi' user, use your own editor. Note the necessary copy of the 'boost' directory. See the Makefile for necessary (but configurable) jdk directories and files.

$ svn export svn://oehive.org/proparse/archives/proparse-cpp
$ cp -a boost openproparse
$ cd openproparse
$ vi Makefile
$ export LD_LIBRARY_PATH=.
$ make

New proparse.jar API

Proparse is ported to Java and can be launched as a little server which listens on a TCP port. It receives a program name, finds and parses the program's source code, and then returns a BLOB containing records of all the objects in the syntax tree and symbol tables.

The blobs will be read by ABL routines to generate ABL objects (temp-tables, data sets, Objects) as needed.

This new Proparse API will be used by another project right away. In the coming days you can look for “GUI BOM” here on the hive.

There isn't any timetable for a new version of Prolint using this new API.

API Goals

The API will be a facade for the byte data in the blob, and the API should generate ABL objects only on demand. For example, consider node:firstChild(). This should return a Node object. The internals of the firstChild() function will check to see if the object has been created yet, create it if necessary, and then return it.

As another example, node:getText() (or just node:text using PROPERTIES) won't actually return data from the Node object itself. The method will look up the necessary records in the blob to return the node's text.

Another example: Functions which query the syntax tree (for example: find all DEFINE nodes) should not operate by recursive-descent of all the objects in the tree, because that would force the creation of a Node object for every node in the tree. Instead, such query functions should work by examining the bytes in the blob directly, and creating those Node objects as needed for the result set. (The API will have to keep track of which Node objects have already been created.)

OpenEdge Version

It is probably reasonable to point out that you can use older versions of Prolint with older versions of OpenEdge. We will write some of the new API using ABL classes, but it wouldn't be a big task if anyone wanted to back-port the API so it works with earlier releases. We'll keep in mind that, when there's no significant disadvantage, we'll use a plain old .p so that back-porting would be a little bit less work.

New blob api early example

I created a new repository called 'codeparse', added a directory called 'wip' for shared efforts that are work-in-process or experimental, and added a subdirectory called 'proparseclient'.

svn://oehive.org/codeparse/wip/proparseclient

For those interested in building and using the API, you should be able to run client.p in that directory, which reads the two binary files 'header.bin' and 'blob.bin'. The program 'tree2text.p' is small and reasonably easy to understand as a starting point. If you don't know which API we're talking about, see http://www.oehive.org/node/1247

(SVN is running slow for me today, which is unusual. I hope it's just my local ISP.)

As my next task, I think I'll add a data transfer from Proparse so that we can fetch its internal data about node types and keywords.

New proparse.jar BLOB Internals

Most developers working with Proparse would use an existing API, and these notes would not be of interest. These notes are only for developers working directly with the bytes from the blob, for example, when developing a new API.

The layout for the blob had one over-riding design consideration: It had to be indexed for fast random access of the records and fields within it. Loading all of the records into ABL objects would take too long for an interactive tool like Prolint.

As a result, the blob ended up looking somewhat like a miniature read-only database.

In order to understand what is inside the blob, it is probably easiest to start by understanding the goals.

Fixed Length Records (mostly)

One goal was to be able to access record fields (examples: node.text, node.type, node.line) by their offset in the record. As a result, most types of records are fixed length. The exceptions are strings, lists, and maps (i.e.: 'collections'). Collection records all have a similar layout: the collection's size, followed by the data.

So how does a 'node' record, for example, have a fixed length if it contains variable length data like strings? A string field in the record doesn't contain the string. It only contains a reference to a string record. The same is true for references to lists, maps, and any other record. The only fields that actually have their data there at the field position in the record are boolean and integer fields.

The Index

Since records (other than collections) are fixed-length, we couldn't use the usual serialization technique of writing one record right at a field position within a referencing record. But since we don't know the byte offset of each record within the blob until the record is written, how do we reference the record? Each record is referenced with an integer index number.

The index is an array of integer offsets. So, if we know that we want the record with index number 2, we get the offset by looking at position number two in the index.

The index is written at the end of the blob, after all records have been written, and their index numbers (and offsets) have been tallied.

The Record Schema

A special problem was presented by the fact that the Java code inside Proparse might change. We might want to rename fields. Fields might be inserted into the middle of field lists – especially troublesome in cases where fields get added to super classes.

To avoid problems with fields names and positions changing over time, each blob will contain a 'schema' of the records inside it. The schema is safe to re-use from one blob to the next, as long as the same version of proparse stays running as the server. (I.e. restart your clients if you upgrade the server.)

The schema for a class is very simple: an index to the name of the class in Proparse (ex: “org.prorefactor.core.JPNode”), followed by a pair of numbers for each field. One number is the field's record type, and the other number is the index of the string record of the field's name.

Using proparse.jar as a Server

Proparse.jar may be launched as a server which listens on a TCP socket. For example:

rem proparse.bat
set JAVA_PATH= proparse.jar
set JAVA_OPTS= -cp %JAVA_PATH% -Xss2M
java %JAVA_OPTS% proparse.Server

As of early August 2008, the New API for using this server is still work-in-process.

Optional
There are a few server server options which can be specified in an optional file named 'proparseserver.properties' in the server's working directory:

# Configuration file for running proparse.jar as a server.
# This file must be in your current working directory.
# (i.e.:  ./proparseserver.properties)
# Remove the leading hash mark '#' to uncomment a setting
# for property = value.


# port (optional)
# The port to listen on. If no port is specified, then 55001 is used.
# port = 55001


# project (optional)
# The project settings directory to load from ./prorefactor/projects/.
# If no project is specified, then Proparse will simply load
# from the first directory it finds.
# There does not have to be a project configured before launching the
# server, because the client can take care of that.
# project = sports2000

Using Proparse

Download

You can find Proparse pre-built libraries on Joanju's Proparse page at
joanju.com/proparse/.

Configuration for Project Settings

See Project Config. Dump for a description of how to generate the configuration files needed by Proparse.

New AST

These pages are for exploring ideas and designing a new AST and/or "Intermediate Representation" tree for further semantic analysis of 4GL/ABL source code.

The output of Proparse and ProRefactor will be used to generate an IR tree which is more appropriate for control flow, data flow, and other kinds of analysis.

Proparse creates a syntax tree which is great for lint, code documentation, and search-and-replace tasks. It is also very usable for resolving symbols and determining scopes - which is exactly what ProRefactor does.

It is common in compilers to find the use of an Intermediate Representation (IR), which is part way between parsing and generating target code (machine code, byte code, etc). Before the target code is written, the IR undergoes a series of "semantic preserving" transformations, such as constant propagation, dead code removal, and other optimizations.

Many of these optimizations on the IR require control flow and data flow analysis to be performed. These are the things that we are interested in.

This topic belongs to the Codeparse group, and we will use its forum/mailing-list for discussions.

Links and Reference Materials

Appel's book Modern Compiler Implementation in Java (second edition), is a likely candidate for a guideline for our new IR. It also leads directly into some of the analysis techniques that we will want to get into, including the use of SSA form (see below). The MiniJava project source code, especially the "Tree" package, may be a useful starting point.

Static Single Assignment form is an IR form which is ideal for the kinds of flow analysis that we want to perform.

Parrot is a virtual machine project, which is intended to become the virtual machine for Perl version 6, as well as be a viable virtual machine as a target for many other languages. Some of that project's documentation may be of interest to us, as they also have plans for providing an IR for compiler writers.

phc is a project to build a compiler with native Linux binaries (x86) as its target. Although its target is of no interest to us, some of the parallels between that project and some of our own projects are interesting. Their list of spinoffs looks like our marketing pages for Proparse. :) Note, especially, that "semantic checker" is just another term for "lint". Their whatsinstore page dives right into IR and what they plan to do with it. They plan to design their own IR, which may or may not be of interest to us. See this post in their mailing list for their motivation.

The Gnu Compiler Collection uses a tree SSA intermediate form for their compilers. This would be of a lot of interest to us, if it wasn't for the GPL. Anything we do here will have to be business friendly. Er, let me rephrase that. Anything *I* do here will be business friendly. Others, of course, are welcome to use these efforts here for playing with GCC as a target if they want. :)

Motivation

So why, exactly, do we want to do all this weird stuff?

The motivation comes from a surprisingly large number of goals, all of which depend on more thorough analysis of control flow and data flow in existing applications.

Prolint certainly comes to mind. There are many types of static code checking that could be done with more complete control and data flow analysis. For example, checking for problematic transaction scopes that span multiple compile units is difficult or impossible without a complete grasp of control flow.

Reverse engineering existing 4GL/ABL applications can hardly be considered complete without full data flow analysis. Work flow for data through an application is key to how well that application can be understood, yet there just aren't any tools which gather this information in a complete and detailed fashion.

Finally, re-engineering projects cannot be automated to any significant extent without better semantic understanding. When an OpenEdge application needs that "complete rewrite", there aren't yet enough tools out there to make this task anything less than enormous. The existing application needs to be reverse-engineered, optimized and cleaned, re-architected, and then finally re-generated for the new OpenEdge framework (OO, OERA, etc). There are plenty of techniques which already exist for optimizing and cleaning, but most of those depend on the existence of a suitable Intermediate Representation.

Statements and Functions

This section is speculative, and subject to change.

Here, for discussion sake, I'd like to describe the semantics of an OpenEdge application in terms of two broad categories.

The first is basic expressions and control flow. These are things that can be represented by any language, and includes looping and conditional branching. These will be represented explicitly in our IR.

The second is platform runtime support. Most statements and functions in 4GL/ABL are not simple expressions or control flow, but instead are direct references to the OpenEdge platform runtime support. Those statements and functions will likely be represented in our IR by calls to stubs, that is, calls to unimplemented subroutines.

To some extent, the two are intertwined. Consider the FOR EACH construct, which provides basic expression evaluation and looping, but also uses platform runtime support for buffer iteration over rows in tables. In cases such as these, our tree transform will have to tease the two apart, so that basic expressions and control flow are represented explicitly, and the use of platform runtime support is represented by calls to subroutines.

Statements in 4GL/ABL can be enormous. In OpenEdge itself, the statement is compiled to an AST, and that AST is interpreted by the runtime. If, in our IR, we treat most statements as calls to subroutines, it is tricky to come up with an appropriate representation for passing all of the data contained in the statement's AST branch to the subroutine. There needs to be a happy medium between implementing the semantics of the subroutine itself, which we don't want to do, and just passing the entire AST as an argument to the subroutine stub, which would defeat our goal of using the IR for data flow analysis. Update: I'm pretty sure I've come up with a good way to handle this. More to come soon!