gmSL Service Classes: Parser Character String Analysis

The Parser Service Class

The service class Parser analyses character strings in a known computer language, according to the rules of that language. At the present time there are two language types processed by this class, Basic and the class of contemporary OOP languages referred to here simply as Java.

The field commun

Prototype

UBYTE* commun;

The commun field contains current statement being parsed.

The field fileType

Prototype

int fileType;

The fileType field specifies the type of file being processed.

The field icol

Prototype

int icol;

The icol field specifies the current character column.

The field ierc

Prototype

int ierc;

The ierc field specifies the character starting column of last token.

The field iorec

Prototype

int iorec;

The iorec field specifies the current record number within the source file.

The field iostate

Prototype

int iostate;

The iostate field specifies the record number of current statement start.

The field lcol

Prototype

int lcol;

The lcol field specifies the last position in the current input record.

The field token

Prototype

UBYTE token[];

The token field contains the current input token in character form. The maximum length of a token is 8190 characters.

The field tokenType

Prototype

int tokenType

The tokenType field specifies the type of the current input token.

The field Parser_scriptType

Prototype

int Parser_scriptType

The Parser_scriptType field specifies the dialect of current language being processed. It is initialized with the value of LNG_BASIC.

The method Parser_CompareBounded

Prototype

int Parser_CompareBounded(CONST char* oper,int length);

The Parser_CompareBounded method compares the current input to a bounded symbol. Given a particular bounded symbol — i.e., any sequence of nonblank characters with an explicit length specified — this method checks the next sequence of nonblank characters at the current location in the coded record to determine if those characters match the symbol. If the SkipWhiteSpace attribute is on, then any whitespace characters encountered in the input record are skipped. All alphabetic characters in the symbol must be specified in lower case. The criteria for a match varies depending upon the setting of the flag WhiteSpaceBoundary. If this flag is off, a match does not occur with a symbol ending in an identifier character if the immediately following character in the coded record is also an identifier character. If there is an AbbeviationSymbol defined and used in the symbol being checked for, then match up to that point is sufficient. Its parameters are:

Parameter	Description
oper	Contains the symbol being checked for. It need not be null-terminated.
length	Specifies the length of symbol

If a match is found, then this method returns the length of the symbol and updates the current location in the coded record so that it points to the first character beyond the end of the match. If no match is found, then a zero is returned and the current location is left unchanged. If a match is found but there are additional identifier characters following immediately after then a minus one is returned.

The method Parser_CompareSymbol

Prototype

int Parser_CompareSymbol(CONST char* oper);

The Parser_CompareSymbol method compares the current input to a null-terminated symbol. Given a particular symbol — i.e., any sequence of nonblank characters — this method checks the next sequence of nonblank characters at the current location in the coded record to determine if those characters match the symbol. All alphabetic characters in the symbol must be specified in lower case. This method is a simplified interface to the method Parser_CompareBounded. See the discussion there for the criteria for a match. Its parameter is:

Parameter	Description
oper	Contains the symbol being checked for in null-terminated form.

The method Parser_Expression

Prototype

int Parser_Expression(int* parents,int expType,int isSet,int level)

The Parser_Expression method is the primary control method for compiling expressions in the source languages. It uses a generalized table-driven recursive descent algorithm driven by an internal operation information table.

           Token                                            precedence */
lexeme       VB   JS  opcode     subcode         type    args  VB  JS    */
---------- ---- ----  ------     --------        ----    ---- -----------*/
MIN        -    -     OPC_NEG  BIN_Arithmetic  TYP_VOID     1  10  11
NOT        not  !     OPC_NOT  BIN_Arithmetic  TYP_VOID     1   3  11
POW        ^          OPC_EXP  BIN_Arithmetic  TYP_VOID     2  11   0
MUL        *    *     OPC_MUL  BIN_Arithmetic  TYP_VOID     2   9  10
DIV        /    /     OPC_DIV  BIN_Arithmetic  TYP_VOID     2   9  10
IDV        \          OPC_IDV  BIN_Arithmetic  TYP_VOID     2   8   0
MOD        mod  %     OPC_MOD  BIN_Arithmetic  TYP_VOID     2   7  10
ADD        +    +     OPC_ADD  BIN_Arithmetic  TYP_VOID     2   6   9
SUB        -    -     OPC_SUB  BIN_Arithmetic  TYP_VOID     2   6   9
CAT        &    +     OPC_CAT  BIN_Arithmetic  TYP_VOID     2   5   9
NEQ        <>   !=    OPC_NEQ  BIN_Arithmetic  TYP_BOOLEAN  2   4   6
GTE        >=   >=    OPC_GTE  BIN_Arithmetic  TYP_BOOLEAN  2   4   7
LTE        <=   <=    OPC_LTE  BIN_Arithmetic  TYP_BOOLEAN  2   4   7
EQL        =    ==    OPC_EQL  BIN_Arithmetic  TYP_BOOLEAN  2   4   6
GTR        >    >     OPC_GTR  BIN_Arithmetic  TYP_BOOLEAN  2   4   7
LTH        <    <     OPC_LTH  BIN_Arithmetic  TYP_BOOLEAN  2   4   7
IOR        or   ||    OPC_IOR  BIN_Arithmetic  TYP_VOID     2   2   1
AND        and  &&    OPC_AND  BIN_Arithmetic  TYP_VOID     2   2   2
LIKE       like ===   OPC_LIK  BIN_Arithmetic  TYP_VOID     2   4   6
ISA  is    i-of       OPC_ISA  BIN_Arithmetic  TYP_VOID     2   4   7
XOR  xor   ^          OPC_XOR  BIN_Arithmetic  TYP_VOID     2   2   4
EQV  equ              OPC_XOR  BIT_Equiv       TYP_VOID     2   2   0
IMP  imp              OPC_XOR  BIT_Implies     TYP_VOID     2   2   0
BWA        &          OPC_AND  BIN_BitWise     TYP_VOID     2   0   5
BWO        |          OPC_IOR  BIN_BitWise     TYP_VOID     2   0   3

Though hardwired here, this table could easily be specified and stored in a language file. Each row corresponds to an operator symbol defined via its lexeme code. The table assumes that unary minus has the lowest lexeme code and that the codes for the other operators follow it sequentially. For each operator the table contains its opcode and subcode to be emitted; the type of its result, relational operators produce a boolean result while other result types depend upon the types of the arguments; the number of operator arguments, unary or binary; and its precedence order.

The parameters of the method are:

Parameter	Description
parents	contains the number of parents controlling the current expression and the root offsets of those parents.
expType	specifies the expected binary type of the expression.
isSet	specifies the language context of the current expression
level	specifies the current hierarchy level within the expression. It controls the recursive descent of this method and must always be set at -1 when called externally.

The method returns the status of the factor and of the operator either preceding or following it when the end of the expression was reached:

Code	Meaning
+i	The operation code of unary operator encountered immediately or the operation code of a binary operator immediately following a processed quantity expression.
0	the expression consisted of a simple l-value not followed by an operator.
-1	the expression consisted of a constant not followed by an operator.
-2	the expression consisted of a complex “parenthetical” expression not followed by an operator.

The method Parser_ExtractToken

Prototype

int Parser_ExtractToken(char** Position,UBYTE* Lexeme);

The Parser_ExtractToken method extracts the next lexeme from character string and returns its token. A lexeme is a string of connected characters that is the lowest level syntactic unit in a programming language and a token is a syntactic category that can encompass many difference lexemes but often only defines a single one. Lexemes are the words and the punctuation of the programming language. There are a set of tokens that are identified generically based primarily one the character types defined within the Character class. Languages are distinguished by assigning different characters different types using methods like Character_SetIdent, Character_SetQuote, etc. The generic tokens are defined in the ParserToken enumeration and are as follows:

Token	Description of meaning
EndOfRecord	The end of the current record was encountered — i.e., a null-byte was encountered. In this case the lexeme-length is set to zero.
Identifier	The lexeme read was a valid identifier or keyword — i.e., it begins with an identifier character and continues until a character that is neither an identifier nor a digit is encountered.
Integer	The lexeme read was an integer constant.
Float	The lexeme read was a floating-point constant. At this time ANSI-C format is assumed for floating point constants.
Quoted	The lexeme was a quoted string.
Special	The lexeme was some character that could not be classified as part of one of the above lexemes. In this the lexeme-length is one and the lexeme-value is the character value.

Other tokens are defined via the two standard lists ParserReservedWords and ParserReservedSymbols. The parameters of the method are:

Parameter	Description
Position	Contains a pointer to the pointer to the current position in the character string being parsed. When this method returns this parameter is updated to point to the character position immediately after the end of the lexeme, or to the null-byte in the case where the end-of-record is encountered.
Lexeme	Receives the actual character form of the lexeme encountered in n-string form — i.e., Lexeme[0] receives the length and Lexeme[1..] receive the characters making up the lexeme. To ensure compatibility the character sequence is then null-terminated as well. Lexemes longer than 255 characters are returned with a length of 255. In this case it is up to the caller to compute the actual length of the lexeme.

Parameter

Description

Position

Contains a pointer to the pointer to the current position in the character string being parsed. When this method returns this parameter is updated to point to the character position immediately after the end of the lexeme, or to the null-byte in the case where the end-of-record is encountered.

Lexeme

Receives the actual character form of the lexeme encountered in n-string form — i.e., Lexeme[0] receives the length and Lexeme[1..] receive the characters making up the lexeme. To ensure compatibility the character sequence is then null-terminated as well. Lexemes longer than 255 characters are returned with a length of 255. In this case it is up to the caller to compute the actual length of the lexeme.

The method returns the token value of the lexeme which might either be one of the generic values or might be a value taken from one of the two lists.

The method Parser_FindSymbol

Prototype

int Parser_FindSymbol(UBYTE* oper);

The Parser_FindSymbol method finds the current input in a StandardList of symbols. Often the next symbol in the coded record may be one of many symbols which are specified in a standard list representation. This method checks the next sequence of characters at the current location in the coded record to determine if those characters match one of a list of symbols. This method is a simplified interface to the method Parser_CompareBounded. See the discussion there for the criteria for a match. Its parameter is:

Parameter	Description
oper	Contains the list of symbols in the StandardList form.

If a match is found, then this method returns the offset in the symbol list of the start of the symbol information associated with the matched symbol and updates the current location in the coded record so that it points to the first character beyond the end of the match. If no match is found, then a zero is returned and the current location is left unchanged.

The method Parser_GetReservedWords

Prototype

UBYTE* Parser_GetReservedWords(int active);

The Parser_GetReservedWords method returns the handle to the standard list of reserved words associated with a specified language type. It parameter is:

Parameter	Description
active	Specifies the language. A setting of LNG_BASIC specifies VB6, Visual Basic; a setting of zero specifies none; and any other setting specifies the generic OOP, Java here.

The method Parser_GetSymbol

Prototype

int Parser_GetSymbol(char* Record);

The Parser_GetSymbol method gets the next symbol from a character string. Once the caller has found the start of a symbol in a character string, this method can be used to find its extent. In this content a symbol is defined as a sequence of non-null characters none of which have are defined as being whitespace characters. Its parameter is:

Parameter	Description
Record	Contains a null-terminated string which is assumed to begin with a non-whitespace character.

The method returns the offset in the string of the first character in the string which is null or which has the whitespace attribute. A return value of zero, indicates that the string does not begin with a valid non-whitespace character.

The method Parser_GetIdentifier

Prototype

int Parser_GetIdentifier(UBYTE* Record,int nRecord);

The Parser_GetIdentifier method isolates an identifier in a character string. Once the caller has found the start of an identifier in a string, this method can be used to find its extent. In this content an identifier is defined as a sequence of characters all of which are classified as being identifier characters. Its parameters are:

Parameter	Description
Record	Contains a string which may begin with an identifier character.
nRecord	Specifies the length of the string.

The method returns the offset in the string of the first character in the string which does not have the identifier attribute. A return value of zero, indicates that the string does not begin with an identifier character.

The method Parser_GetToken

Prototype

int Parser_GetToken(void);

The Parser_GetToken method gets the next token from the current statement and stores its value in the global token value buffer. The method returns the type code of the token. This method is simply a specialized access point to the Parser_ExtractToken method which does the bulk of the processing.

The method Parser_LookAhead

Prototype

int Parser_LookAhead(void);

The Parser_LookAhead method gets the following token in the current statement without actually changing the cursor position within the statement. The value of the following token is stored in the global token value buffer. The method returns the type code of that token. This method is simply a specialized access point to the Parser_GetToken method.

The method Parser_GetBuffer

Prototype

UBYTE* Parser_GetBuffer(void);

The Parser_GetBuffer method returns the value of the global field token which Contains the current input token in character form. It is primarily intended for use by gmSL and gmNI which do not have access to the global fields. The method has no parameters.

The method Parser_ResetInput

Prototype

int Parser_ResetInput(void);

The Parser_ResetInput method resets the starting position of the next token to the starting position of the current token. It is primarily intended for use by gmSL and gmNI which do not have access to the global fields. The method has no parameters and returns the new starting position.

The method Parser_SetDoubleQuotes

Prototype

void Parser_SetDoubleQuotes(int status);

The Parser_SetDoubleQuotes method sets the double quote status for the statements currently being processed. Its parameter is:

Parameter	Description
status	Specifies if the status is to be on or off. A zero value, or LBC_False, sets the status off, while a nonzero value, or LBC_True sets it on.

The method Parser_SetExternalLanguage

Prototype

void Parser_SetExternalLanguage(int status);

The Parser_SetExternalLanguage method sets the external language status for the statements currently being processed. Its parameter is:

Parameter	Description
status	Specifies if the status is to be on or off. A zero value, or LBC_False, sets the status off, while a nonzero value, or LBC_True sets it on.

The method Parser_SetNumericIdentifiers

Prototype

void Parser_SetNumericIdentifiers(int status);

The Parser_SetNumericIdentifiers method sets the numeric identifiers status for the statements currently being processed. Its parameter is:

Parameter	Description
status	Specifies if the status is to be on or off. A zero value, or LBC_False, sets the status off, while a nonzero value, or LBC_True sets it on.

The method Parser_SetInput

Prototype

void Parser_SetInput(int position);

The Parser_SetInput method sets the starting positions of the current and next token to a specified value. It is primarily intended for use by gmSL and gmNI which do not have access to the global fields. Its parameter is:

Parameter	Description
position	Specifies the starting position for the current and next input token.

The method Parser_SetReserved

Prototype

void Parser_SetReserved(int active);

The Parser_SetReserved method sets the tokens to be associated with the reserved words and symbols in the particular language being parsed. Most contemporary languages contain reserved words and symbols which have special unique meaning to the language and which may not be used for any other purpose. To recognize these when the language statements are being initially parsed, simplifies the later work of the parser. Its parameter is:

Parameter	Description
active	Specifies the type of language being processed. A setting of LNG_BASIC Specifies VB6, Visual Basic; a setting of zero specifies none; and any other setting specifies the generic OOP, Java here.

Each of the two language types has two lists associated with it: words and symbols. The words list contains the reserved words to be recognized. A word must begin with an identifier character and may contain only identifier and digit characters. In this implementation reserved words are not case sensitive. The symbols list contains the reserved symbols to be recognized. A symbol must begin with a non-identifier character. There are no other restrictions, but they are case sensitive if they contain alphabetic characters (which most do not).

The method Parser_SetStatement

Prototype

void Parser_SetStatement(char* record);

The Parser_SetStatement method sets the content of the global communications buffer. It is primarily intended for use by gmSL and gmNI which do not have access to the global fields. Its parameter is:

Parameter	Description
record	Contains the string in null-terminated form to be copied into the communications buffer.

The method Parser_SetWhiteSpaceBoundary

Prototype

void Parser_SetWhiteSpaceBoundary(int status);

The Parser_SetWhiteSpaceBoundary method sets the whitespace boundary status for the statements currently being processed. Its parameter is:

Parameter	Description
status	Specifies if the status is to be on or off. A zero value, or LBC_False, sets the status off, while a nonzero value, or LBC_True sets it on.

The method Parser_SetSkipWhiteSpace

Prototype

void Parser_SetSkipWhiteSpace(int status);

The Parser_SetSkipWhiteSpace method sets the skip whitespace status for the statements currently being processed. Its parameter is:

Parameter	Description
status	Specifies if the status is to be on or off. A zero value, or LBC_False, sets the status off, while a nonzero value, or LBC_True sets it on.

The method Parser_SetAbbreviationSymbol

Prototype

void Parser_SetAbbreviationSymbol(char symbol);

The Parser_SetAbbreviationSymbol method sets the abbreviation symbol for the statements currently being processed. Its parameter is:

Parameter	Description
symbol	Specifies the abbreviation symbol value.

The method Parser_StringExpression

Prototype

void Parser_StringExpression(int* Parents,int exp,char* strValue,int langType)

The Parser_StringExpession is used to process nested expressions within other complex contexts like default value specifications or gmPL attribute values. Its parameters are:

Parameter	Description
parents	contains the number of parents controlling the current expression and the root offsets of those parents.
exp	specifies the expected binary type of the expression.
strValue	contains the actual expression to be parsed.
langType	specifies the dialect of current statements being compiled.

Download:

This page Markdown PDF

This section Markdown PDF