Gorp.NET - a new library for creating reversible templates to extract data from structured text

Gorp.NET is a new library for creating reverse templates for extracting data from structured text, based on the existing Salesforce Gorp code base.

In this publication, I’ll talk a little about how to use the library to parse structured text called Gorp (one of the examples of tools that are sometimes called reverse-engineered template systems).
What is a reversible pattern in general? Suppose that we have a certain system that allows us to generate the text we need based on the initial data we have determined according to strict rules defined by the syntax of the templates. Now let’s imagine a task that is opposite in meaning - we have a text that has some structural integrity that could be achieved by using a system based on the templates from the previous example. Our goal is to extract from this text the source data on the basis of which it was formed. If we try to come up with a certain generalized syntax for solving this problem, supplied to the corresponding parser, which parses the input text into separate elements, this will be an example of syntax for implementing the concept of reversible templates.

Why did I decide to write specifically about Gorp ? The fact is that I decided to take this particular system as the basis for finalizing my own project - about the history of the project itself, including some details of all the changes I made to the original Gorp project, you can read in the previous article . Here we will focus precisely on the technical part, including regarding the use of a modified version of the engine. For convenience, I will continue to call it Gorp.NET , although in reality it is not a version of Gorp that is not ported to .NET, but only a slightly polished and finalized version of it, all in the same Java. Another thing is that the add-on above the Gorp library itself (in my version) in the form of a managed DLL called BIRMA.NET uses its own special assembly - the very Gorp.NET , which you yourself can easily get if you run the source text ( The address of his repository is https://github.com/S-presso/gorp/ ) through the IKVM.NET utility.

I’ll note right now that for all kinds of tasks of extracting data from any structured text, the Gorp.NET tools themselves will be quite enough for you — at least if you have a little command of Java or at least know how to call methods from external Java modules in your projects on .NET Framework, as well as include various types from the standard JVM libraries there (I achieved this through the same IKVM.NET , which, however, now already has the status of an unsupported project). Well, what will you do next with the extracted data - this, as they say, is your personal business. Gorp and Gorp.NET alone provide only a bare framework. Some groundwork for further processing of all such data contains the aforementioned BIRMA.NET . But the description of the BIRMA.NET functionality in itself is already a topic for a separate publication (although I already managed to mention something in my previous comparative historical review of BIRMA technologies ). Here, looking ahead, I will allow myself a somewhat bold statement that the technology used to describe reversible templates used in Gorp.NET (and, accordingly, in BIRMA.NET ) is somewhat unique among other crafts of this kind (I say “ crafts ”, as large companies somehow have not yet been seen by me in promoting their own frameworks for these purposes - well, perhaps, only Salesforce itself with its original implementation of Gorp ).

For the most complete disclosure of the concept and technical aspects that underlie the template description system used in Gorp, I just leave a link here to the original documentation in English . Everything that is stated in it, you can safely apply in relation to Gorp.NET . And now I’ll tell you a little about the essence.

So, the description of the template is a kind of text document (perhaps even presented as a single large line, which can be passed to the corresponding method for processing). It consists of three parts containing sequential descriptions of the three most important entities: patterns , patterns and samples (extracts).

The lowest level block here are patterns - they can only consist of regular expressions and references to other patterns. The next level of the hierarchy is occupied by templates , the description of which also contains links to patterns, which can also be named, as well as inclusions in the form of text literals, links to nested templates and extractors. There are also parametric patterns that I will not touch on right now (in the source documentation there are few examples of their use). Well, and finally, there are samples that specify specific syntax rules that associate named patterns with specific occurrences from the source text.

As I understand it, the original goal set by the creators of Gorp was to parse the sequences of data contained in the report files (or log files). Consider a simple example of a specific application of the system.

Suppose we have a report containing the following line:

<86> 2015-05-12T20: 57: 53.302858 + 00: 00 10.1.11.141 RealSource: "10.10.5.3"


Let's compose an example template for parsing it using Gorp tools:

pattern %phrase \\S+
pattern %num \\d+\n
pattern %ts %phrase
pattern %ip %phrase

extract interm {
template <%num>$eventTimeStamp(%ts) $logAgent(%ip) RealSource: "$logSrcIp(%ip)"
}


Note that the template assignment block is even omitted here, since all the necessary templates are already included in the final selection. All the templates used here are named, their contents are indicated in parentheses after their name. As a result, a text data set will be created with the names eventTimeStamp , logAgent and logSrcIp .

We will now write a simple program for extracting the necessary data. Suppose that the template we created is already contained in a file called extractions.xtr .
 import com.salesforce.gorp.DefinitionReader; import com.salesforce.gorp.ExtractionResult; import com.salesforce.gorp.Gorp; // ... DefinitionReader r = DefinitionReader.reader(new File("extractions.xtr")); Gorp gorp = r.read(); final String TEST_INPUT = "<86>2015-05-12T20:57:53.302858+00:00 10.1.11.141 RealSource: \"10.10.5.3\""; ExtractionResult result = gorp.extract(TEST_INPUT); if (result == null) { // no match, handle throw new IllegalArgumentException("no match!"); } Map<String,Object> properties = asMap(); // and then use extracted property values 


Another example of a simple parsing template:

# Patterns
pattern %num \d+
pattern %hostname [a-zA-Z0-9_\-\.]+
pattern %status \w+

# Templates
@endpoint $srcHost(%hostname): $srcPort(%num)

# Extraction
extract HostDefinition {
template @endpoint $status(%status)
}


Well, I think the point is clear. It will also not be amiss to mention that for the extract method there is also a definition with two input parameters, the second of which has a logical type. If you set it to true , then when it is executed, the method will iterate over all potential data sets - until it meets a suitable one (you can also replace the method call with extractSafe - already without the second parameter). The default value is false , and the method may “swear” at the discrepancy between the input data and the template used.
I note at the same time that Gorp.NET also introduced a new extended implementation of the extract method: now there is a version with two subsequent parameters of a logical type. Using an abbreviated call to the extractAllFound view , we set both of them to true by default. The positive value of the third parameter gives us even greater scope for variations: from now on, we can analyze text with any inclusions of arbitrary characters in the intervals between the desired, already structured samples (containing sets of extracted data).

So, the time has come to answer the question: what exactly can be unique in this modification of the basic version of Gorp , in addition to the extension of the extract method?
The fact is that when I several years ago already created a kind of my own tool for extracting the required data from text (which was also based on the processing of certain templates with their own specific syntax), it worked on slightly different principles. Their main difference from the approach implemented in Gorp and all derived frameworks is that each text element to be extracted was set simply by listing its left and right borders (each of which in turn could either be part of the element itself, or simply separate it from all subsequent or previous text). At the same time, in fact, in the general case, the structure of the source text itself was not analyzed, as is the case in Gorp , but only the necessary pieces were singled out. As for the content of the text that is enclosed between them, it could not have succumbed to any structural analysis at all (it could well be incoherent character sets).

Is it possible to achieve a similar effect in Gorp ? In its initial version - perhaps not (correct me if I am mistaken about this). If we simply write an expression like (. *) , Followed immediately by a mask to specify the left border of the next element to be searched, then by using the “greed” quantifier, the entire subsequent text will be captured. And we cannot use regulars with “non-greedy” syntax in existing Gorp implementations.
Gorp.NET allows you to smoothly circumvent this problem by introducing two special types of patterns - (% all_before) and (% all_after) . The first of them, in fact, is an alternative to the "non-greedy" version (. *) , Suitable for use in compiling your own templates. As for (% all_after) , it also looks at the source text until the first occurrence of the next part of the described pattern - but already relying on the search result of the previous pattern. Everything that is between them will also fall into the extractable substring of the current element. In a sense (% all_after) “looks back”, and (% all_before) , on the contrary, “looks forward”. I note that the missing left border in the description of the element served as a kind of analogue for (% all_before) in the first version of BIRMA , and emptiness instead of the right border served as the analogue (% all_after) . If both boundaries are not set when describing the next element, then the parser obviously captures all the subsequent text! However, all this then implementation of BIRMA now has purely historical significance (you can read a little more about it in my report of that time ).
Hidden text
The source codes have never been laid out anywhere because of their extremely low quality - in truth, they could serve as a monument to the poor design of software systems.


Let's look at the features of using service patterns (% all_before) and (% all_after) using the example of the task of extracting specific user data from a specific website. We will parse the Amazon site, and specifically, this page: https://www.amazon.com/B06-Plus-Bluetooth-Receiver-Streaming/product-reviews/B078J3GTRK/ ).
Hidden text
An example is taken from a test task for a developer’s vacancy with a specialization in data parsing, sent by my company, which, unfortunately, has not responded to my proposed solution to the problem. True, they only asked me to describe the general solution process - without a specific algorithm, and in response I already tried to refer to Gorp templates, while my own extensions at that time existed only, as they say, “on paper” ".
For the sake of curiosity, I will allow myself to cite one fragment from my reply letter, which, apparently, is the first mention of Gorp.NET , albeit of a private nature.
“To make the above list of regular expressions used by me for solving this problem more visual, I compiled a ready-made template on its basis (attached it to the letter), which can be used to extract all the necessary data by applying my own development of a more universal nature, just designed to solve this type of problem. Its code is based on the github.com/salesforce/gorp project, and on the same page there is a general description of the rules for compiling such templates. Generally speaking, such a description in itself implies the assignment of both concrete regular expressions and the logic of their processing. The most difficult point here is that for each data sample, we must fully describe through the regulars the whole structure of the text containing them, and not just the individual elements themselves (as could be done when writing our own program that searches sequentially in a loop, like I previously described). ”

The initial task was to collect the following data from the above page:



Well, now I’ll just give you a template compiled by me, which allows you to quickly and efficiently complete this task. I think the general meaning should be quite obvious - perhaps you yourself can offer a more concise solution.
Hidden text
It was on this example that, in general, I debugged the functionality of my own extensions for Gorp (already without any aim for employment, but rather based on the ideology of “Proof of Concept”).


pattern %optspace ( *)
pattern %space ( +)

pattern %cap_letter [AZ]
pattern %small_letter [az]
pattern %letter (%cap_letter|%small_letter)
pattern %endofsentence (\.|\?|\!)+
pattern %delim (\.|\?|\!\,|\:|\;)
pattern %delim2 (\(|\)|\'|\")

pattern %word (%letter|\d)+
pattern %ext_word (%delim2)*%word(%delim)*(%delim2)*

pattern %text_phrase %optspace%ext_word(%space%ext_word)+
pattern %skipped_tags <([^>]+)>

pattern %sentence (%text_phrase|%skipped_tags)+(%endofsentence)?

pattern %start <div class=\"a-fixed-right-grid view-point\">

pattern %username_start <div class=\"a-profile-content\"><span class=\"a-profile-name\">
pattern %username [^\s]+
pattern %username_end </span>

pattern %user_mark_start <i data-hook=\"review-star-rating\"([^>]+)><span class=\"a-icon-alt\">
pattern %user_mark [^\s]+
pattern %user_mark_end ([^<]+)</span>

pattern %title_start data-hook=\"review-title\"([^>]+)>(%skipped_tags)*
pattern %title [^<]+
pattern %title_end </span>

pattern %span class <span class=\"[^\"]*\">

pattern %date_start <span data-hook="review-date"([^>]+)>
pattern %date ([^<]+)
pattern %date_end </span>

pattern %content_start <span data-hook=\"review-body\"([^>]+)>(%skipped_tags)*
pattern %content0 (%sentence)+
pattern %content (%all_after)
pattern %content_end </span>

template @extractUsernameStart (%all_before)%username_start
template @extractUsername $username(%username)%username_end
template @extractUserMarkStart (%all_before)%user_mark_start
template @extractUserMark $user_mark(%user_mark)%user_mark_end
template @extractTitleStart (%all_before)%title_start
template @extractTitle $title(%title)%title_end
template @extractDateStart (%all_before)%date_start
template @extractDate $date(%date)%date_end
template @extractContentStart (%all_before)%content_start
template @extractContent $content(%content)%content_end

extract ToCEntry {
template @extractUsernameStart@extractUsername@extractUserMarkStart@extractUserMark@extractTitleStart@extractTitle@extractDateStart@extractDate@extractContentStart@extractContent
}



That's probably all for today. About the third-party tools that I have implemented, in which this framework has already been fully involved, I may tell you another time.

Source: https://habr.com/ru/post/476778/


All Articles