Lexical Analysis

lexical analyzer

The very first stage of the compiler design process is lexical analysis. The changed source code that is written in the form of sentences is fed into a lexer. It transforms the input, which was originally just a string of letters, into a list of distinct tokens, including string and number constants, variable names, and programming language keywords. By eliminating any whitespace or source code comments, the lexical analyzer converts these syntaxes into a list of tokens.

An error is produced if a token is determined to be invalid by the lexical analyzer. A table of regular expressions and their related actions, which are expressed as C code fragments, make up the source code for a lex program. Therefore, if identification is discovered in the program, the action corresponding to the identifier is performed. In this situation, perhaps some details might be added to the symbol table. If a word like ‘if’ is detected, a different course of action would be pursued.

The syntax analyzer and lexical analyzer collaborate frequently. When the syntax analyzer requests it, it pulls character streams from the source code, verifies for legal tokens, and then sends the information.

Roles of Lexical Analyzer

The Lexical Analyzer completes the following tasks:

  • Aids in locating the token in the symbol table.
  • Eliminates blank lines and comments from the original program.
  • Relates error messages to the program’s source
  • Extends the macros for you if they are present in the source program.
  • The source program’s input characters for reading

Tokens

Lexemes are described as a token’s collection of (alphanumeric) characters. Every lexeme must conform to a set of preset rules in order to be recognized as a legitimate token. By using a pattern, grammar rules establish these rules. What constitutes a token is described by a pattern, and these patterns are defined using regular expressions.

Programming language tokens include words like “keywords,” “constants,” “identifiers,” “strings,” “numbers,” “operators,” and “punctuation symbols.”

After reading the input, this analyzer is in charge of sending tokens back to your program. By avoiding dealing with the specific characters that make up the input, you may focus on understanding its meaning. For instance, a straightforward lexical analyzer may provide tokens that are numbers that are formatted correctly. The input would then not need to be validated by your application before you tried to convert it.

It is possible to define precisely what values the tokens may take using regular expressions. Some tokens, such as if, else, and for, are merely keywords. Others, such as identifiers, can consist of any combination of letters and numbers as long as they don’t match a keyword and don’t begin with a digit. 

Here is an example of a code the analyzer receives.

#include <stdio.h>

    int maximum(int x, int y) {

        // This will compare 2 numbers

        if (x > y)

            return x;

        else {

            return y;

        }

    }

 

It will create the following tokens:

LexemeToken
intKeyword
maximumIdentifier
(Operator
intKeyword
xIdentifier
,Operator
intKeyword
YIdentifier
)Operator
{Operator
IfKeyword

Lexical Errors

Lexical errors are character sequences that cannot be translated into any legitimate token. Relevant details on the linguistic error:

  • Lexical mistakes are not very prevalent, but a scanner should handle them
  • Identifiers, operators, and keywords that are misspelled are regarded as lexical mistakes.
  • A lexical error typically results from the occurrence of an illegal character, usually at the start of a token.

Error Recovery in Lexical Analyzer

Here is a handful of the most popular mistake recovery methods:

  • Taking one character out of the remaining input.
  • Up until we achieve a well-formed token, we always disregard the following characters in panic mode.
  • Including the missing character with the input that is still present.
  • Swapping one character for another.
  • Swapping two serial characters around.

Token specifications

Let’s examine how language theory handles the following words.

Alphabets

Any finite collection of symbols can be divided into binary and hexadecimal alphabets (0, 1, 2, 3, 5, 6, 7, 8, 9), hexadecimal alphabets (0, 1, 2, 3, 5, 6, 7, 8, 9), and English alphabets (a–z, A–Z).

Strings

A string is any finite collection of alphabets (characters). The total number of times an alphabet appears in a string is its length. For example, the string has a length of 14 and is represented by |tutorialspoint| = 14. The symbol for an empty string, which is a string with no alphabets and zero length, is.

Language

An alphabetically limited set of strings is referred to as a language. Computer languages are seen as finite sets, and set operations can be applied to them theoretically. Regular expressions can be used to characterize finite languages.

Regular Expressions

Only a limited number of valid strings, tokens, and lexical items specific to the language at hand must be scanned and identified by the lexical analyzer. It looks for the pattern that the language rules have established.

By specifying a pattern for finite strings of symbols, regular expressions can express finite languages. Regular grammar is the grammar that regular expressions define. A regular language is a language that conforms to regular grammar.

For describing patterns, a regular expression is a crucial notation. Regular expressions act as names for a collection of strings since each pattern matches a particular set of strings. Regular languages can be used to describe programming language tokens. An illustration of a recursive definition is the definition of regular expressions.

Operations

The numerous language operations include:

Languages L and M are combined to form

L U M = s | where s is in either L or M.

L and M are two languages that have been combined.

If s is in L and t is in M, then LM = st

For example, a language L writes Kleene Closure as

L* = Language L used 0 or more times.

Using regular expression to represent a language’s legitimate tokens

The representation valid tokens of a language in regular expression looks like:

X must be a regular expression if:

x* denotes the absence of any instances of x.

It can produce e, x, xx, xxx, xxxx, etc.

x+ denotes the occurrence of x once or more.

In other words, it can produce either x.x* or x, xx, xxx,...

x? denotes a maximum of one instance of x

i.e., it can generate either {x} or {e}.

[a-z] is all lower-case alphabets of English language.

[A-Z] is all upper-case alphabets of English language.

[0-9] is all natural digits used in mathematics.

 

Advantages of Lexical Analyzer

  • Programs like compilers, which can take the parsed data from a programmer’s code to build a built binary executable code, use the lexical analyzer method.
  • Web browsers utilize it to format and display a web page using data that has been processed from JavaScript, HTML, and CSS.
  • You can create a specialized processor for the task that is potentially more effective with the aid of a separate lexical analyzer.

Moreover, the following aspects allow the Lexical analyzer to be preferred over parser analysis:

  • The design’s simplicity: By removing unnecessary tokens, it simplifies the lexical analysis and syntactic analysis processes.
  • To make compilers more effective: aids in increasing compiler effectiveness
  • Specialization: To enhance the lexical analysis process, specialized techniques might be used.
  • Only the scanner is necessary for communication with the outside world due to portability.
  • Greater portability: Lexer-only quirks relating to input devices.

Disadvantages of Lexical Analyzer

  • You must invest a lot of time reviewing the source code and splitting it into tokens.
  • Compared to PEG or EBNF rules, some regular expressions are relatively difficult to understand.
  • The lexer and its token descriptions require more work to build and test.
  • Lexer tables must be created, and tokens must be produced, adding more runtime overhead.

Conclusion

Lexical analysis feeds changed source code into a lexer, which transforms the input into tokens. Here, we discussed a lexical analyzer, its main roles, and its advantages and disadvantages. Furthermore, we explained lexical errors and how to recover them. For more guidance on similar processes crucial for your secure software development, check out our other articles on the topic.

Leave a Comment

Your email address will not be published. Required fields are marked *

Related Posts

Regex is a useful string of characters that defines a search pattern. In the security world, regular
When developing an application or any other software, the security of the finalized product is one of
Sometimes the development teams employ unconventional practices to fix bugs or add new features without realizing the
Scroll to Top