Idea for XREF and Code Analysis Project in ABL

Sun, 2008-02-17 20:47 — john

People working with 4GL/ABL are always looking for better documentation and XREF data about their source code.

A number of thoughts came together recently, which might trigger some interest in a new project written in ABL. This might also trigger interest in extensions to existing projects.

I found that it is trivial to dump from ProRefactor to XML: the entire syntax tree, the symbol tables, all of the include file and preprocessor information, etc. This is for one compile unit (CU) at a time.

The XML for a single CU, as you might expect, is huge – on the scale of tens of megabytes. Interestingly though, I found that one sample 59MB XML file compressed down to 900KB. Wow.

Let's call it 'cu.xml', and let's also assume that we store cu.xml.zip on the disc, and we only unzip it when we want to look at it again.

The general idea is to augment existing tools which slurp XREF into a database. There would be a process which would scan cu.xml (for each CU in the application), and store additional index information in the database.

Additional tools and scripts would allow for more detailed analysis. Let's say you found 200 compile units which matched your initial query against the xref database. The scripts would then open up cu.xml for each of those, and perform additional search or documentation tasks.

Such a project would involve: some XML and DB work in ABL, some brief reference to the Proparse docs and the ProRefactor javadocs, and a some questions sent my way.

The bottom line benefit is that all the useful information generated by ProRefactor would become accessible from ABL scripts.

If you are interested, please post here. I would build the scripts which dump from ProRefactor to XML. I wouldn't be working on the ABL part, but I'd be here to answer questions about the content of the XML.

Code Parse discussion

Fri, 2008-03-14 18:07 — tamhas

What to Collect

It seems to me that there are two sides to the what to collect question. One is what is available and the other is what do we want.

On the available side there is going to be a lot of detail which could be of interest at some point, but indexing absolutely every possible node and tag seems to be of questionable utility. Baring a tool for converting code into abstracted business logic, there is a point where one will just have to go to the code for it to make much sense.

On the what do we want side, let me through out a quick list and see what others have to suggest for expanding or refining it. Not all of this is possible from just the Prorefactor source, of course, but I think it is useful to lay out what we would like, if we could get it.

1. Schema
1.1. Tables and Columns and their Properties
1.2. Indexes
1.3. Joins and Foreign Keys
1.4. Physical Layout
2. Code Modules
2.1. Variables
2.2. Internal Runable Units
2.3. Block Structure
2.4. Transaction Scope
2.5. Control Flow Branching
2.6. Actual Business Logic
3. Code and Data Relationships
3.1. Tables and Columns – Where Used & Usage Mode
3.2. Table Access – Actual Where Clauses
3.3. Table Access – Index Usage
4. Code to Code Relationships
4.1. Control Flow – Procedure, IP, Function, Super, Persistent
4.2. Super and Persistent, Where Instantiated
4.3. Shared Variable/Frame/Buffer, etc. Relationships
4.4. Include Files – Where Used
4.5. Parameters Passed

p.s., Should this discussion get moved to successive forum entries instead of comments on one forum entry? Should it get its own node and become a project?

Fri, 2008-04-04 16:38 — tamhas

Difference tool

I wonder if another project which might get bundled in here might be to create a difference engine based on the syntax tree so that it could accurately indicate differences in code that had been reformatted.

Thu, 2008-02-21 22:46 — john

Code to XML

OK, I've re-arranged some of the ProRefactor docs a little bit, and I've posted Code to XML as a new node with a new groovy script. Seems to work a treat, at least for me.

Now what we need is an ABL programmer to build a tool to read the *.xml.zip files, and turn that confusing XML into some simpler ABL data structures like TEMP-TABLEs or whatever.

Mon, 2008-02-18 22:25 — tamhas

Some other things to consider

After an exchange with Carlo (Minollo) Innocenti of DataDirect, I have a couple pieces to throw into the stew.

1. DataDirect's ZQuery product does not support the OpenEdge database as either a relational or XML data source. Bummer.

2. XQuery does support a number of XML-specific databases (not recommended) and a number of databases like Oracle that have an XML datatype. Were the former recommended, they might be interesting, but as it is it seems like none of this is useful.

3. The most likely scenario, then, is storage of the XML in the filesystem, indexed by the OpenEdge database which is used to resolve all queries which we thought of in advance. XQuery *will* handle zipped XML files, but unless storage is a premium, that isn't particularly recommended since there is a performance hit as XQuery needs to unzip the files anyway.

4. An alternative might be processing the XML files with Fast Infoset which XQuery will process directly.

Mon, 2008-02-18 23:25 — tamhas

One more

One possible XML database is Berkeley DB XML (Sleepycat), which is now an Oracle open source product. They do support indexing and XQuery.

Sun, 2008-02-17 22:54 — timk519

Count me in

I've already got a working XREF -> DB system going, although finishing it has been held up by other work. I think a proparse -> db configuration is an excellent idea, and I'd like to participate in some level.

Mon, 2008-02-18 20:43 — tamhas

What to collect

With COMPILE XREF, there is only so much data to collect, so it is fairly apparent that one should collect it all and do what one can to fudge trying to collect more. In the case of this proposal, though, the potential amount of information is huge ... not that it covers everything one might want to know, particularly thanks to dynamic references, but still there is a huge amount of data available. So, it seems to me that one of the first steps is to compile a list of all possible information and then to make a pass at subdividing that into priority categories -- obvious must have, might be useful, imaginably useful, probably not useful, are you kidding?, etc. That would then lead to selection of what to do in an initial pass and a data structure design.

The other big question I have is about storing the full XML, regardless of what is extracted from it. On the one hand, it seems like one might be able to do some interesting things with XQuery and this dataset, but the dataset is likely to be huge. As John notes, it compresses well, but I don't know if it can be used in a compressed state and I don't know how long it would take to recreate it from scratch.

And, of course, one of the questions is whether the XQuery type inquiries would be directed at a subset identified by querying the database or would one need to use XQuery on the whole thing.

Mon, 2008-02-18 22:32 — john

XML stored on disc

I knew there was a reason for the XML to be stored on disc, and I just remembered what that is. The XML is being generated from a Java process, which means that if an OE process wanted to generate the XML on the fly (rather than read from stored XML files on the disc) then it would have to spawn a java process to write temporary XML files. That would be an expensive operation. (One might suggest servers and TCP communications, but that would be overkill.)

Mon, 2008-02-18 17:05 — john

Contributed builds are always welcome. :)

The OpenEdge Hive

More Navigation