Gorp.NET is a new library for creating reverse templates for extracting data from structured text, based on the existing Salesforce Gorp code base.In this publication, I’ll talk a little about how to use the library to parse structured text called
Gorp (one of the examples of tools that are sometimes called
reverse-engineered template systems).
What is a
reversible pattern in general? Suppose that we have a certain system that allows us to generate the text we need based on the initial data we have determined according to strict rules defined by the syntax of the templates. Now let’s imagine a task that is opposite in meaning - we have a text that has some structural integrity that could be achieved by using a system based on the templates from the previous example. Our goal is to extract from this text the source data on the basis of which it was formed. If we try to come up with a certain generalized syntax for solving this problem, supplied to the corresponding parser, which parses the input text into separate elements, this will be an example of syntax for implementing the concept of reversible templates.
Why did I decide to write specifically about
Gorp ? The fact is that I decided to take this particular system as the basis for finalizing my own project - about the history of the project itself, including some details of all the changes I made to the original
Gorp project, you can read
in the previous article . Here we will focus precisely on the technical part, including regarding the use of a modified version of the engine. For convenience, I will continue to call it
Gorp.NET , although in reality it is not a version of
Gorp that is not ported to .NET, but only a slightly polished and finalized version of it, all in the same Java. Another thing is that the add-on above the
Gorp library
itself (in my version) in the form of a managed DLL called
BIRMA.NET uses its own special assembly - the very
Gorp.NET , which you yourself can easily get if you run the source text ( The address of his repository is
https://github.com/S-presso/gorp/ ) through the
IKVM.NET utility.
I’ll note right now that for all kinds of tasks of extracting data from any structured text, the
Gorp.NET tools themselves will be quite enough for
you — at least if you have a little command of Java or at least know how to call methods from external Java modules in your projects on .NET Framework, as well as include various types from the standard JVM libraries there (I achieved this through the same
IKVM.NET , which, however, now already has the status of an unsupported project). Well, what will you do next with the extracted data - this, as they say, is your personal business.
Gorp and
Gorp.NET alone provide only a bare framework. Some groundwork for further processing of all such data contains the aforementioned
BIRMA.NET . But the description of the
BIRMA.NET functionality in
itself is already a topic for a separate publication (although I already managed to mention something in my previous
comparative historical review of BIRMA technologies ). Here, looking ahead, I will allow myself a somewhat bold statement that the technology used to describe reversible templates used in
Gorp.NET (and, accordingly, in
BIRMA.NET ) is somewhat unique among other crafts of this kind (I say “ crafts ”, as large companies somehow have not yet been seen by me in promoting their own frameworks for these purposes - well, perhaps, only
Salesforce itself with its original implementation of
Gorp ).
For the most complete disclosure of the concept and technical aspects that underlie the template description system used in Gorp, I just leave a link here to the
original documentation in English . Everything that is stated in it, you can safely apply in relation to
Gorp.NET . And now I’ll tell you a little about the essence.
So, the description of the template is a kind of text document (perhaps even presented as a single large line, which can be passed to the corresponding method for processing). It consists of three parts containing sequential descriptions of the three most important entities:
patterns ,
patterns and
samples (extracts).
The lowest level block here are
patterns - they can only consist of regular expressions and references to other patterns. The next level of the hierarchy is occupied by
templates , the description of which also contains links to patterns, which can also be named, as well as inclusions in the form of text literals, links to nested templates and extractors. There are also
parametric patterns that I will not touch on right now (in the source documentation there are few examples of their use). Well, and finally, there are
samples that specify specific syntax rules that associate
named patterns with specific occurrences from the source text.
As I understand it, the original goal set by the creators of
Gorp was to
parse the sequences of data contained in the report files (or log files). Consider a simple example of a specific application of the system.
Suppose we have a report containing the following line:
<86> 2015-05-12T20: 57: 53.302858 + 00: 00 10.1.11.141 RealSource: "10.10.5.3"
Let's compose an example template for parsing it using
Gorp tools:
pattern %phrase \\S+
pattern %num \\d+\n
pattern %ts %phrase
pattern %ip %phrase
extract interm {
template <%num>$eventTimeStamp(%ts) $logAgent(%ip) RealSource: "$logSrcIp(%ip)"
}
Note that the template assignment block is even omitted here, since all the necessary templates are already included in the final selection. All the templates used here are named, their contents are indicated in parentheses after their name. As a result, a text data set will be created with the names
eventTimeStamp ,
logAgent and
logSrcIp .
We will now write a simple program for extracting the necessary data. Suppose that the template we created is already contained in a file called
extractions.xtr .
import com.salesforce.gorp.DefinitionReader; import com.salesforce.gorp.ExtractionResult; import com.salesforce.gorp.Gorp;
Another example of a simple parsing template:
# Patterns
pattern %num \d+
pattern %hostname [a-zA-Z0-9_\-\.]+
pattern %status \w+
# Templates
@endpoint $srcHost(%hostname): $srcPort(%num)
# Extraction
extract HostDefinition {
template @endpoint $status(%status)
}
Well, I think the point is clear. It will also not be amiss to mention that for the
extract method there is also a definition with two input parameters, the second of which has a logical type. If you set it to
true , then when it is executed, the method will iterate over all potential data sets - until it meets a suitable one (you can also replace the method call with
extractSafe - already without the second parameter). The default value is
false , and the method may “swear” at the discrepancy between the input data and the template used.
I note at the same time that
Gorp.NET also introduced a new extended implementation of the
extract method: now there is a version with two subsequent parameters of a logical type. Using an abbreviated call to the
extractAllFound view , we set both of them to true by default. The positive value of the third parameter gives us even greater scope for variations: from now on, we can analyze text with any inclusions of arbitrary characters in the intervals between the desired, already structured samples (containing sets of extracted data).
So, the time has come to answer the question: what exactly can be unique in this modification of the basic version of
Gorp , in addition to the extension of the extract method?
The fact is that when I several years ago already created a kind of my own tool for extracting the required data from text (which was also based on the processing of certain templates with their own specific syntax), it worked on slightly different principles. Their main difference from the approach implemented in
Gorp and all derived frameworks is that each text element to be extracted was set simply by listing its left and right borders (each of which in turn could either be part of the element itself, or simply separate it from all subsequent or previous text). At the same time, in fact, in the general case, the structure of the source text itself was not analyzed, as is the case in
Gorp , but only the necessary pieces were singled out. As for the content of the text that is enclosed between them, it could not have succumbed to any structural analysis at all (it could well be incoherent character sets).
Is it possible to achieve a similar effect in
Gorp ? In its initial version - perhaps not (correct me if I am mistaken about this). If we simply write an expression like
(. *) , Followed immediately by a mask to specify the left border of the next element to be searched, then by using the “greed” quantifier, the entire subsequent text will be captured. And we cannot use regulars with “non-greedy” syntax in existing
Gorp implementations.
Gorp.NET allows
you to smoothly circumvent this problem by introducing two special types of patterns -
(% all_before) and
(% all_after) . The first of them, in fact, is an alternative to the "non-greedy" version
(. *) , Suitable for use in compiling your own templates. As for
(% all_after) , it also looks at the source text until the first occurrence of the next part of the described pattern - but already relying on the search result of the previous pattern. Everything that is between them will also fall into the extractable substring of the current element. In a sense
(% all_after) “looks back”, and
(% all_before) , on the contrary, “looks forward”. I note that the missing left border in the description of the element served as a kind of analogue for
(% all_before) in the first version of
BIRMA , and emptiness instead of the right border served as the analogue
(% all_after) . If both boundaries are not set when describing the next element, then the parser obviously captures all the subsequent text! However, all this then implementation of
BIRMA now has purely historical significance (you can read a little more about it
in my report of that time ).
Hidden textThe source codes have never been laid out anywhere because of their extremely low quality - in truth, they could serve as a monument to the poor design of software systems.
Let's look at the features of using service patterns
(% all_before) and
(% all_after) using the example of the task of extracting specific user data from a specific website. We will parse the Amazon site, and specifically, this page:
https://www.amazon.com/B06-Plus-Bluetooth-Receiver-Streaming/product-reviews/B078J3GTRK/ ).
Hidden textAn example is taken from a test task for a developer’s vacancy with a specialization in data parsing, sent by my company, which, unfortunately, has not responded to my proposed solution to the problem. True, they only asked me to describe the general solution process - without a specific algorithm, and in response I already tried to refer to Gorp templates, while my own extensions at that time existed only, as they say, “on paper” ".
For the sake of curiosity, I will allow myself to cite one fragment from my reply letter, which, apparently, is the first mention of
Gorp.NET , albeit of a private nature.
“To make the above list of regular expressions used by me for solving this problem more visual, I compiled a ready-made template on its basis (attached it to the letter), which can be used to extract all the necessary data by applying my own development of a more universal nature, just designed to solve this type of problem. Its code is based on the
github.com/salesforce/gorp project, and on the same page there is a general description of the rules for compiling such templates. Generally speaking, such a description in itself implies the assignment of both concrete regular expressions and the logic of their processing. The most difficult point here is that for each data sample, we must fully describe through the regulars the whole structure of the text containing them, and not just the individual elements themselves (as could be done when writing our own program that searches sequentially in a loop, like I previously described). ”
The initial task was to collect the following data from the above page:
- Username
- Rating
- Review Title
- The date
- Text
Well, now I’ll just give you a template compiled by me, which allows you to quickly and efficiently complete this task. I think the general meaning should be quite obvious - perhaps you yourself can offer a more concise solution.
Hidden textIt was on this example that, in general, I debugged the functionality of my own extensions for Gorp (already without any aim for employment, but rather based on the ideology of “Proof of Concept”).
pattern %optspace ( *)
pattern %space ( +)
pattern %cap_letter [AZ]
pattern %small_letter [az]
pattern %letter (%cap_letter|%small_letter)
pattern %endofsentence (\.|\?|\!)+
pattern %delim (\.|\?|\!\,|\:|\;)
pattern %delim2 (\(|\)|\'|\")
pattern %word (%letter|\d)+
pattern %ext_word (%delim2)*%word(%delim)*(%delim2)*
pattern %text_phrase %optspace%ext_word(%space%ext_word)+
pattern %skipped_tags <([^>]+)>
pattern %sentence (%text_phrase|%skipped_tags)+(%endofsentence)?
pattern %start <div class=\"a-fixed-right-grid view-point\">
pattern %username_start <div class=\"a-profile-content\"><span class=\"a-profile-name\">
pattern %username [^\s]+
pattern %username_end </span>
pattern %user_mark_start <i data-hook=\"review-star-rating\"([^>]+)><span class=\"a-icon-alt\">
pattern %user_mark [^\s]+
pattern %user_mark_end ([^<]+)</span>
pattern %title_start data-hook=\"review-title\"([^>]+)>(%skipped_tags)*
pattern %title [^<]+
pattern %title_end </span>
pattern %span class <span class=\"[^\"]*\">
pattern %date_start <span data-hook="review-date"([^>]+)>
pattern %date ([^<]+)
pattern %date_end </span>
pattern %content_start <span data-hook=\"review-body\"([^>]+)>(%skipped_tags)*
pattern %content0 (%sentence)+
pattern %content (%all_after)
pattern %content_end </span>
template @extractUsernameStart (%all_before)%username_start
template @extractUsername $username(%username)%username_end
template @extractUserMarkStart (%all_before)%user_mark_start
template @extractUserMark $user_mark(%user_mark)%user_mark_end
template @extractTitleStart (%all_before)%title_start
template @extractTitle $title(%title)%title_end
template @extractDateStart (%all_before)%date_start
template @extractDate $date(%date)%date_end
template @extractContentStart (%all_before)%content_start
template @extractContent $content(%content)%content_end
extract ToCEntry {
template @extractUsernameStart@extractUsername@extractUserMarkStart@extractUserMark@extractTitleStart@extractTitle@extractDateStart@extractDate@extractContentStart@extractContent
}
That's probably all for today. About the third-party tools that I have implemented, in which this framework has already been fully involved, I may tell you another time.