Extracting ontological knowledge from Java source code using

Attribution alone 4.0 License.

Open Computer Science 2019; 9:181–199

Research Article Open Access

Azanzi Jiomekong*, Gaoussou Camara, and Maurice Tchuente

Extracting ontological knowledge from Java

source code using Hidden Markov Models

https://doi.org/10.1515/comp-2019-0013

Received April 29, 2019; accepted July 25, 2019

Abstract: Ontologies have become a key element since

many decades in information systems such as in epidemi-

ological surveillance domain. Building domain ontologies

requires the access to domain knowledge owned by do-

main experts or contained in knowledge sources. However,

domain experts are not always available for interviews.

Therefore, there is a lot of value in using ontology learn-

ing which consists in automatic or semi-automatic extrac-

tion of ontological knowledge from structured or unstruc-

tured knowledge sources such as texts, databases, etc.

Many techniques have been used but they all are limited

in concepts, properties and terminology extraction leav-

ing behind axioms and rules. Source code which naturally

embed domain knowledge is rarely used. In this paper,

we propose an approach based on Hidden Markov Models

(HMMs) for concepts, properties, axioms and rules learn-

ing from Java source code. This approach is experimented

with the source code of EPICAM, an epidemiological plat-

form developed in Java and used in Cameroon for tuber-

culosis surveillance. Domain experts involved in the eval-

uation estimated that knowledge extracted was relevant

to the domain. In addition, we performed an automatic

evaluation of the relevance of the terms extracted to the

medical domain by aligning them with ontologies hosted

on Bioportal platform through the Ontology Recommender

tool. The results were interesting since the terms extracted

were covered at 82.9% by many biomedical ontologies

such as NCIT, SNOWMEDCT and ONTOPARON.

Keywords: Knowledge Extraction, Ontology Learning,

Hidden Markov Models, Java Source Code, Viterbi

*Corresponding Author: Azanzi Jiomekong: University of

Yaounde I, Faculty of Science, Yaounde, Cameroon; IRD, Sorbonne

Université, UMMISCO, F-93143, Bondy, France;

E-mail: jio[email protected]om

Gaoussou Camara: LIMA, Université Alioune Diop de Bambey,

Sénégal; IRD, Sorbonne Université, UMMISCO, F-93143, Bondy,

France; E-mail: [email protected]u.sn

1 Introduction

Studer et al. [1] dened an ontology as "A formal, explicit

specication of a shared conceptualization". In the con-

text of domain ontologies, conceptualization refers to the

abstract model of the domain which is machine readable,

and where all the elements are explicitly dened and ac-

cepted by the members of a group. Several domain ontolo-

gies dene and organize relevant knowledge about activi-

ties, processes, organizations and strategies, in order to fa-

cilitate information exchange between machines and, be-

tween a human and a machine [2, 3]. Building domain on-

tologies requires the access to domain knowledge owned

by domain experts or contained in knowledge sources

[2, 4]. However, domain experts are not always available

for interviews. And in case they are available, the knowl-

edge provided is often incomplete and subjective. In ad-

dition, as the domain evolves, the knowledge provided by

the experts is likely to be out of date. Therefore, there is a

lot of added value in creating domain ontologies from ex-

isting knowledge sources such as structured and unstruc-

tured documents of the domain: texts [5–8], databases [9–

12], XML les [13], existing ontologies [14–16], UML/Meta-

model diagrams [17–19], and source code [12, 20–24]. Al-

though source code is often used to extract concepts and

relations, its full potential is not exploited to extract, for

example, axioms and rules [21, 22]. Indeed, source code

is any fully executable description of a software designed

for a specic domain such as medical, industrial, military,

communication, aerospace, commercial, scientic, etc. It

can be used for the collection, organization, storage and

communication of information. It is designed to facilitate

repetitive tasks or to process information quickly. In soft-

ware design process, a set of knowledge related to the do-

main are captured and integrated in the source code.

The extraction of knowledge from structured (rela-

tional databases, XML) and unstructured (text, docu-

Maurice Tchuente: University of Yaounde I, Faculty of Science,

Yaounde, Cameroon; IRD, Sorbonne Université, UMMISCO, F-93143,

Bondy, France; E-mail: Maurice.T[email protected]

182 | Azanzi Jiomekong, Gaoussou Camara, and Maurice Tchuente

ments, images) sources is also known as ontology learn-

ing [25–27] that consists in applying statistical techniques,

symbolic techniques or both to (semi-)automatically ex-

tract the ontological knowledge from knowledge sources.

Several authors have proposed the use of symbolic tech-

niques [12, 20, 28] and statistical techniques [23, 29] to ex-

tract generally concepts and properties from source code.

In this paper, we propose an approach for extract-

ing ontological knowledge from Java source code using

Hidden Markov Models (HMMs). Our approach is experi-

mented on the EPICAM source code. The EPICAM project

aims at building an integrated platform for epidemiologi-

cal surveillance of tuberculosis in Cameroon. The project

started in 2012 and involves partners from the dierent

area: academy (University of Yaounde 1 in Cameroon),

clinic (fty hospitals in Cameroon), epidemiology (Epi-

demiology and Public Health department of the Centre

Pasteur of Cameroon, and the National Tuberculosis Con-

trol Program), and industry (MEDES in France).

The rest of this paper is organized as follows. In sec-

tion 2, we present an overview of ontology learning. Our

approach is detailed in section 3. In section 4, we provide

the results of the experimentation. The section 5 presents

the evaluation of the knowledge extracted. Related works

are discussed in section 6. We conclude and present future

works in section 7.

2 Ontology Learning

Acquiring knowledge for building an ontology from

scratch, or for rening an existing ontology is costly in

time and resources. Ontology learning techniques are used

to reduce this cost during the knowledge acquisition pro-

cess. Ontology learning refers to the extraction of onto-

logical knowledge from unstructured, semi-structured or

fully structured knowledge sources in order to build an

ontology from them with little human intervention [3, 25,

26, 30]. In this section, we present the basic ontological

knowledge, knowledge sources generally used for ontol-

ogy learning, some ontology learning techniques and on-

tology learning evaluation.

2.1 Basic ontological knowledge

An ontology is composed of these basic components [2]:

– Concept, also called Class, represents a category of

objects. For instance ” Health_facility” is the concept

of all health facilities including health centers and

clinics;

– Individual is an instance of a concept and corre-

sponds to a concrete object. For example, from the

concept ”Person”, ”Bob” is an individual;

– Property is used to describe the characteristics of in-

dividuals of a concept. They are composed of Dat-

aProperties and ObjectProperties. DataProperties are

properties whose values are data types. For instance,

” age” of type ”Integer” can be a property of an in-

stance of the concept ”Person". ObjectProperties are

special attributes whose values are individuals of con-

cepts. For instance, ”examined_in” denes a rela-

tionship between the concept ”Person” and the con-

cept ”Health_facility” ("A person is examined in a

health facility");

– Class/Property hierarchy is one of the most impor-

tant relation used to organize concepts and prop-

erties in the ontology. It is used to organize con-

cepts/properties through which inheritance mecha-

nisms can be applied. For instance, ”Patient” is sub-

ClassOf ”Person” is a hierarchical relation between

these two classes. Class/Property taxonomies are gen-

erally used to construct the so called lightweight on-

tologies or taxonomies;

– Axiom is used to model statements that are always

true. Heavyweight ontologies add axioms and con-

straints to lightweight ontologies. Axioms and con-

straints clarify the intended meaning of the terms in

the ontology. For example, the assertion "the concepts

” Men” and ”Women” are disjoint" is an axiom;

– Rule is a statement in the form

,...,P

, this means

that if the statement P is true, then, the statements

, ..., P

are true. Rules are used for knowledge in-

ference purposes.

2.2 Knowledge sources for ontology

learning

The process of developing an ontology requires knowledge

acquisition from any relevant sources. There are several

possible sources of knowledge: domain experts or unstruc-

tured, semi-structured, and structured sources [4].

2.2.1 Domain experts

A domain expert is a person knowledgeable of a domain.

To get knowledge from domain experts, a knowledge en-

gineer conducts interviews. This process might lead to

Knowledge extraction from source code | 183

knowledge loss or even worse, introduce errors because

misunderstandings that arises frequently in human com-

munication.

2.2.2 Unstructured knowledge sources

Unstructured knowledge sources contain knowledge that

do not have a pre-dened organization. These are all kinds

of textual resources (Web pages, manuals, discussion fo-

rum postings, specications, analysis and conception doc-

uments, source code comments) and multimedia contents

(videos, photos, audio les) [3, 5, 6, 8, 23, 25, 26]. Unstruc-

tured sources are the most recurrent and can permit us to

extract a more complete knowledge. However, the unstruc-

tured sources are easily accessible to human information

processing only. For example, extracting formal specica-

tions from arbitrary texts is still considered a hard prob-

lem because sentences might be ambiguous and, in some

cases, no unique correct syntactic analysis is possible [31].

2.2.3 Structured knowledge sources

Structured knowledge sources contain knowledge de-

scribed by a schema. It is advantageous to use these

knowledge sources because they contain directly accessi-

ble knowledge [31]. Some structured knowledge sources

include:

– Ontologies: Before constructing an ontology from

scratch, one may look at other ontologies that could

be reused [4, 15, 16];

– Knowledge bases: In knowledge bases, one can gen-

erate discovered rules as input to develop a domain

ontology [25, 32];

– Database : Terms to be used to build an ontology can

be extracted from a database schema [9–12, 25].

2.2.4 Semi-structured knowledge sources

Semi-structured knowledge sources contain knowledge

having a structure that already reects part of the seman-

tic interdependencies. This structure facilitates the extrac-

tion of a schema [31]. Some examples of semi-structured

knowledge sources are:

– Folksonomies/thesaurus: It is advantageous to extract

knowledge from folksonomies or/and thesaurus to

build an ontology because they reect the vocabulary

of their users [33, 34];

– XML (Extensible Markup Language): The aim of XML

data conversion to ontologies is the indexing, inte-

gration and enrichment of existing ontologies with

knowledge acquired from XML documents [13];

– UML/meta-model: To learn an ontology from UML

or/and meta-model, one approach is to extract OWL

classes and properties from diagrams or to use On-

tology UML Prole (OUP) which, together with Ontol-

ogy Denition Meta-model (ODM), enable the usage of

Model Driven Architecture (MDA) standards in onto-

logical engineering [18];

– Entity-relation diagram: They can be used to learn on-

tologies because they are used to describe the infor-

mation managed by the databases [35];

– Source code [12, 21–23, 28]: Generally, in source code,

the names of data structures, variables, functions are

close to the terms of the domain.

A lot of work has been done on the extraction of ontologi-

cal knowledge from texts, databases, XML les, vocabular-

ies, and the use of ontologies to build or enrich other on-

tologies. This has resulted in a wide range of models, tech-

niques and tools for the generation of knowledge structure

that can be considered as an intermediate process when

constructing ontologies. It should be noted that few works

go beyond extracting concepts and properties from source

code whereas axioms and rules are also key elements of

ontologies.

2.3 Ontology learning techniques

To extract knowledge from knowledge sources, many tech-

niques are used [3, 25, 26, 36]. Shamsfard and Barforoush

[26] proposed a classication of these techniques by con-

sidering symbolics, statistics and multi-strategies.

2.3.1 Symbolic techniques

In symbolic techniques, the extraction process consists

of examining text fragments that match some predened

rules, looking for lexico-syntactic patterns corresponding

for instance to taxonomic relations or scanning for vari-

ous types of templates related to ontological knowledge.

A symbolic method can be rule-based, linguistic-based or

pattern-based.

1. Rule-based models are represented as a set of rules

where each rule consists of a condition and an action

[30].

184 | Azanzi Jiomekong, Gaoussou Camara, and Maurice Tchuente

– Logical rules may be used to discover new

knowledge by deduction (deduce new knowledge

from existing ones) or induction (synthesize new

knowledge from experience). For example, induc-

tive logic programming can be used to learn new

concepts from knowledge sources [5, 25, 26, 37];

– Association rules aim at nding correlations be-

tween items in a dataset. This technique is gen-

erally used to learn relations between concepts

[5, 8, 25, 26] and can be used to recognize a tax-

onomy of relations [25] or to discover gaps in con-

ceptual denitions [5, 26, 38].

2. Linguistic approaches (syntactic analysis, morpho-

syntactic analysis, lexico-syntactic pattern parsing,

semantic processing and text understanding) are used

to derive knowledge from text corpus [25, 26]. This

technique can be used to derive an intentional de-

scription of concepts in the form of natural language

description [38].

3. Pattern/Template-driven approach allows to search for

predened keywords, templates or patterns. Indeed,

a large class of entity extraction tasks can be accom-

plished by the use of carefully constructed regular ex-

pressions [39].

Although very powerful for particular domains, symbolic

techniques are inexible because of their strong depen-

dency on the structure of the data. Symbolic techniques

are precise and robust, but can be complex to implement,

and dicult to generalize [26].

2.3.2 Statistic-based techniques

Statistic analysis for ontology learning is performed from

input data to build a statistical model [3, 25, 26, 30]. Sev-

eral statistical methods for extracting ontological knowl-

edge have been identied in the literature:

1. Co-occurrence or collocation detection identies the

occurrence of some words in the same sentence, para-

graph or document. Such occurrences hint a potential

direct relation between words [40]. These techniques

can be used to discover terms that are siblings to each

other [24].

2. Clustering can be used to create groups of similar

words (clusters) which can be regarded as repre-

senting concepts, and further hierarchically organize

these as clusters. This technique is generally used for

learning concepts by considering clusters of related

terms as concepts and learning taxonomies by orga-

nizing these groups hierarchically [5]. Ontology align-

ment can use agglomerative clustering to nd candi-

date groups of similar entities in ontologies [38].

3. Hidden Markov Models (HMMs) dene a generative

statistical models that are able to generate data se-

quences according to rather complex probability dis-

tributions and that can be used for classifying sequen-

tial patterns [41–43]. Zhou and Su [44] have used HMM

for Named Entity Recognition; Maedche and Staab [8]

have used the n-gram models based on HMMs to pro-

cess documents at the morphological level before sup-

plying them to term extraction tools. Labsky et al. [29]

present the use of HMMs to extract information on

product oered by companies from HTML les.

2.3.3 Multi-Strategy learning

Multi-Strategy learning techniques leverage the strengths

of the above techniques to extract a wide range of ontolog-

ical knowledge from dierent types of knowledge sources

[25, 26, 30]. for example, Maeche and Staab [8] present the

use of clustering for concept learning and association rules

to learn relations between these concepts.

2.4 Ontology learning evaluation

After the extraction process, the evaluation phase permits

to know whether the knowledge extracted is accurate and

to conclude on the quality of the knowledge source. The

evaluation of ontological knowledge is coined by several

authors in the literature [45, 46]. Dellschaft and Staab [46]

have proposed two ways to evaluate ontological knowl-

edge: (1) In manual evaluation by human experts, the

knowledge is presented to one or more domain experts

who have to judge to what extent it is correct; (2) The com-

parison of the knowledge to existing reference vocabular-

ies/ontologies to ensure that it covers the studied domain.

3 Ontology learning from Java

source code using Hidden Markov

Models

Source code contains well-dened words in a language

that everyone understands (for example the elements gen-

erally found on the user interface), some statements with

a particular lexicon specic to the programming language

and to the programmer. For example, in Java programming

Knowledge extraction from source code | 185

language, the term "class" is used to dene a class, the

terms "if", "else", "switch", "case" are used to dene the

business rules (candidate to become rules). Other terms

dened by the programmer such as "PatientTuberculeux"

are used to represent the names of classes (candidate to be

concept); the term "examenATB" is used to dene the re-

lation (ObjectProperty) with cardinality (candidate to be-

come axiom) between the classes "PatientTuberculeux"

and "Examen"; and the group of terms "int agePatient" is

used to dene a property (DataProperty) of the class "Pa-

tientTuberculeux". This section shows how to dene, train

and use Hidden Markov Models (HMMs) for knowledge ex-

traction from Java source code.

3.1 Hidden Markov Models

A Markov Chain is a random process having a nite set of

states, and only the current state inuences where it goes

next [41]. Hidden Markov Models are particular types of

Markov Chain composed of a nite state automaton with

edges between any pair of states that are labeled with tran-

sition probabilities. It also describes a 2-stage statistical

process in which the behavior of the process at a given

time t is only dependent on the immediate predecessor

state. It is characterized by the probability between states

P(q

| q

, q

, ..., q

t−1

) = P(q

| q

t−1

) and for every state at

time t an output or observation o

is generated. The as-

sociated probability distribution is only dependent on the

current state q

and not on any previous states or observa-

tions: P(o

| o

, ..., o

t−1

, q

, ..., q

) = P(o

| q

) [41, 43, 47–

49]. HMMs are generally used for pattern recognition, au-

tomatic voice processing, automatic natural language pro-

cessing, character recognition [41].

A rst order HMM perfectly describes the source code

because it can be seen as a string sequence typed by a pro-

grammer in which the current word (corresponding to an

assign hidden state) depends on the previous word. In this

HMM, the observed symbol depends only on the current

state [41–43]. Equation 1 presents the joint probability of a

series of observations O

1:T

given a series of hidden states

1:T

. The HMM of Fig. 1 shows how the source code can be

modeled using a HMM. In this gure, the observations are

the words ("public", "class", "Patient", etc.) typed by the

programmers and each of these words are labeled by the

hidden states "PRE", "TARGET", "POST", and "OTHER".

P(O

1:T

, Q

1:T

) =

P(q

) P(o

| q

)



t=2

P(q

| q

t−1

) P(o

| q

) (1)

Filtering, smoothing, prediction, and the most likely

explanation are four uses of HMMs. The probability that a

string O is emitted by a HMM M is calculated as the sum of

all possible paths by the equation 2.

P(O | M) =



,...,q

l+1



k=1

P(q

k−1

→ q

) P(q

↑ o

) (2)

Where q

and q

l+1

are limited to q

and q

respectively and

l+1

is an end of word. The observable output of the system

is the sequence of symbols emitted by the states, but the

underlying state sequence itself is hidden.

In the most likely explanation, the goal is to nd the

sequence of hidden states V( O | M) that best explains

the sequence of observations (equation 3) [41–43]. To this

end, the sequence of states V(O | M) which has the

greatest probability to produce an observation sequence

is searched.

For example, in automatic translation, one may want

the most probable string sequence that corresponds to the

string to be translated. In this case, instead of taking the

sum of the probabilities, the maximum must be chosen

(equation 3).

P(O | M) =

max

...q

∈Q

l+1



k=1

P(q

k−1

→ q

) P(q

↑ o

) (3)

Before using the model, its parameters (transition

probabilities, emission probabilities and initial probabil-

ities) must be calculated using statistical learning, Baum-

Welch algorithm or Viterbi training [41].

3.2 Source code versus HMM

During software development, it is recommended to write

the source code according to good programming practices,

including naming conventions [50]. These practices in-

form programmers on how to name variables, organize

and present the source code. This organization can be used

to model source code using HMMs (see Fig. 1). For exam-

ple, from Java source code, we can say that at a time t, the

programmer enters a word (e.g. "public" at the beginning

of a Java source le). Thus, the keyword "public" at time t

conditions the next word at time t+1 which in this case can

be "class", "int", etc. We can say that PRE and TARGET are

the hidden states and "public" and "class" are respectively

their observations.

186 | Azanzi Jiomekong, Gaoussou Camara, and Maurice Tchuente

Figure 1: An example of HMM modeling the Java source code

Source code contains several types of les: les de-

scribing data, les processing data, user interface les and

conguration les.

3.2.1 Files describing data

These les describe the data to be manipulated and

equally, some constraints on this data (e.g., data types).

In Java EE for example, there are entities whose names are

close to the terms of the domain that will be transformed

into tables in the database. These les often contain cer-

tain rules to verify the reliability of the data. Thus, from

these les, we can retrieve concepts, properties, axioms

and rules.

3.2.2 Files containing data processing

Located between user interface les and data description

les is the data processing les of the source code consist-

ing of:

– Control: For example, restricting certain data from

certain users (e.g., only the attending physician has

the right to access the data), checking the validity of

a eld (checking whether the data entered in an "age"

eld is of type integer);

– Calculation: For example, converting a date of birth

into an age, determining the date of the next appoint-

ment of a patient, calculating the body mass index of

a patient based on his/her weight and height.

These are the algorithms implementing the business rules

to be applied to the data. They are thus good candidates

for axioms and rules extraction.

3.2.3 User interfaces les

The User interfaces are composed of les which describe

the information that will be presented to users for data

viewing or recording. Unlike the rst two les types, these

les contain the words of a human-readable vocabulary

that can be found in a dictionary. User interfaces usually

provide:

– Translations allowing navigation from one language

to another, control for users to enter the correct data;

– An aid allowing users to know for example, the role of

a data entry eld.

User Interfaces are therefore good candidates for concepts

and their denitions, properties, axioms and rules extrac-

tion.

3.2.4 Conguration les

These les allow developers to specify certain information

such as the type and path of a data source, dierent lan-

guages used by users, etc. For instance, from these les,

the languages labels (e.g. English, French, Spanish) for

terms can be extracted.

The les we just presented generally contain com-

ments that can be useful for knowledge extraction or on-

tology documentation. Knowledge extraction from user in-

terfaces/web interfaces has already been addressed in [12,

24], knowledge extraction from text has been presented in

[5, 7, 8, 23]. In this article, we will focus on knowledge ex-

traction from les describing data and their processing.

3.3 Knowledge extraction process

To extract knowledge from Java source code, we designed a

method divided into ve main steps: data collection, data

Knowledge extraction from source code | 187

preprocessing, entity labeling, formal language transla-

tion, and knowledge validation.

3.3.1 Data collection

The data collection step consist of the extraction of a

dataset necessary for the next steps. In Java les, state-

ments for importing third-party libraries and comments

are deleted. We proposed the denition of a regular expres-

sion that allow them to be identied.

3.3.2 Data preprocessing

The purpose of data preprocessing is to put data in a form

compatible with the tools to be used in the next steps.

During this phase, potentially relevant knowledge will be

identied and retrieved, and some entities will be recoded.

The problem of extracting knowledge from the source code

has been reduced to the problem of syntactic labeling. This

is to determine the syntactic label of the words of a text

[42]. In our case, it will be a matter of assigning a label to

all the words of the source code and extracting the words

marked as target words. This problem can be solved us-

ing HMMs [42, 43]. In the following paragraphs, we will

rst present the HMM structure for source code modeling.

Then, we will show how this HMM is trained and nally,

how it is used to extract the knowledge from Java source

code.

HMMs structure denition. To dene the structure

of the HMMs, we manually studied the organization of the

source code of Java language. Generally, data structures,

attributes, and conditions are surrounded by one or more

specic words. Some of these words are predened in ad-

vance in the programming language. To label the source

code, we have dened four labels, corresponding to four

hidden states of the HMM:

– PRE: Corresponding to the preamble of the knowl-

edge. This preamble is usually dened in advance;

– TARGET: The target, (i.e. the knowledge sought) may

be preceded by one or more words belonging to the

PRE set. The knowledge we are looking for are the

names of classes, attributes, methods, and the rela-

tionships between classes. They are usually preceded

by a meta-knowledge which describes them. For ex-

ample, the meta-knowledge "class" allows for concept

identication;

– POST: Any information that follows the knowledge

sought. In some cases, POST is a punctuation charac-

ter or braces;

– OTHER: Any other word in the source code that nei-

ther precedes nor follows the knowledge sought.

An example of HMM annotated with labels is given by Fig.

1. Concepts, properties, axioms, and rules are usually ar-

ranged dierently in the source code. We propose the def-

inition of two HMMs which permit them to be identied:

one to identify concepts, properties, axioms and the other

one to identify rules.

Learning Model Parameters. There are several tech-

niques to determine the parameters of a HMM: Statistical

learning on data, specialized algorithms such as Baum-

Welch or Viterbi training [41, 42]. In this paper, we have

chosen statistical learning on data to train the HMMs mod-

eled in the previous paragraphs. Thus, we assumed that

we have access to T source code les labeled f

knowing

that f

is not just a sequence of words, but a sequence of

words pairs with the word and its label (see Fig. 1) mod-

eled by the equation 4. To train the model, we assume that

we can dene the order in which the dierent words are en-

tered by the programmer. We assume that before entering

the rst word, the programmer reects on the label of that

word and as a function of it, denes the label of the next

word and so on. For example, before entering the word

public, the programmer knows that its label is PRE and

that the label of the next word is TARGET. Thus, the cur-

rent word depends only on the current label, the following

label depends on the previous label, and so on. The pro-

cess continues until the end of the le.

= [(w

, e

) , ..., (w

, e

)] ,

words(f

) = [w

, ..., w

] ,

labels(f

) = [e

, ..., e

] .

(4)

In the equation 4, w

and e

are words and labels of

les respectively. In practice, w

are words contained in

the source code (observations) and e

are the labels of w

used as hidden states.

From the training data, we can extract statistics on:

– The rst label P(q

) (equation 5). A priori probability

that the rst label is equal to the word

′

is the num-

ber of times the rst label in each le of the source code

is the word

′

divided by the number of source code

les.

P(Q

= a) =



freq(e

= a, f

)

(5)

– The relation between a word and its label P(O

| q

)

(equation 6). The conditional probability that the k

word is

′

, knowing that the label is

′

corresponds

to the number of times the word

′

associated with

188 | Azanzi Jiomekong, Gaoussou Camara, and Maurice Tchuente

the label

′

in the source code le f

normalized with

the fact that the label

′

is associated with any other

word in f

source code. For example, "Patient" can be

a concept, an attribute, but cannot be a rule.

P(O

= w | q

= b) =

α +



freq((w, b), f

)

β +



freq((

′

, b), f

)

(6)

To avoid zero probabilities for observations that do not

occur in the training data, we added smoothing terms

(α and β).

– The relation between the adjacent syntactic label is

P(q

| q

k+1

) (equation 7). The probability that q

k+1

is equal to label

′

knowing that q

is equal to label

′

(previous hidden state) is the number of times

′

follows

′

in the source code of the training data di-

vided by the number of times that

′

is followed by

any other label.

P(q

k+1

= a | q

= b) =

α +



freq(b, a), label(f

)

β +



freq(b,

′

) , label(f

)

(7)

To avoid zero probabilities for transitions that do not occur

in the training data, we added smoothing terms (α and β).

Let us consider the HMM in Fig. 1. Then, training data

to identify concepts and attributes would be: [("public",

PRE), ("class", TARGET), ("Patient", TARGET), ("extends",

TARGET), ("ImogEntityImpl", TARGET), ("{", OTHER), (...),

("int", TARGET), ("age", TARGET), ...]. Tab. 1 presents the

initial vector, which is the probability that the rst label is

PRE, TARGET, POST, or OTHER; Tab. 2 presents the transi-

tion vector containing the frequencies that a state follows

another state; and Tab. 3 presents the emission vector con-

taining the frequencies that a state emits an observation.

Knowledge extraction. The model previously de-

ned and trained can be applied to any Java source code

in order to identify TARGET elements. It will be necessary

to nd from the les f

, ..., f

, a sequence of states q

, ...,

that is plausible. For this, equation 3 will be used to

determine the most plausible string sequence. From this

string, the hidden states will be identied and the targets

(words that are labeled TARGET) will be extracted. In our

approach, we used Viterbi algorithm which provides an ef-

cient way of nding the most plausible string sequence of

hidden states [51, 52]. The algorithm 1 gives an overview of

the Viterbi Algorithm. More details can be found in [41].

Any source code can then be submitted to the HMM

trained and a table similar to Tab. 10 containing the prob-

ability for the hidden states to emit a word from the source

code is built.

Let M = (π, A, B) our HMM

With π the vector of start probabilities, A the

matrix of state-transition probabilities, and B the

matrix of observation probabilities

Let δ

( i) =

max

,...,q

t−1

P(O

, ..., O

, q

, ...q

t−1

, q

= i| M)

1. Initialization

( i) := π

( O

) ψ

( i) := 0

2. Recursion

For all times t, t

, ..., T − 1:

t+1

( j) := max

{ δ

( i)a

} b

( O

t+1

)

t+1

( j) := argmax

{ δ

( i)a

}

3. Termination

( O|M) = P(O, q

| M) = max

( i)

:= argmax

( j)

4. Back-Tracking of the Optimal Path

for all times t, t = T − 1, ..., 1 :

= ψ

t+1

( q

t+1

)

Algorithm 1: The Viterbi algorithm [41, 52]

Recoding variables. Programmers usually use ex-

pressions made up of words from a specic lexicon, some-

times encoded with "ad hoc" expressions, requiring spe-

cic processing to assign a new name or a label under-

standable by humans before using. These words are gen-

erally divided into words or groups of words according to

the naming conventions of the programming language. For

example, we can have "PatientTuberculeux" → "Patient

tuberculeux", "agePatient" → "Age Patient", "listeExam-

ens" → "liste Examens", etc. Therefore, during the recod-

ing, these names are separated in order to nd their real

sense in human understandable language.

3.3.3 Entities labeling

The extraction of relevant terms has yielded knowl-

edge and meta-knowledge. This knowledge and meta-

knowledge will permit us identify to which ontological

components they may belong to. For example, the code:

"class Patient extends Person int age", submitted to a

trained HMM to identify concepts and relations will iden-

tify three meta-knowledge ("class", "extends" and "int")

that will be used to identify two concepts (Patient and Per-

son), one attribute of type integer and a hierarchical rela-

tion between "Patient" and "Person". From the extracted

knowledge, two candidates to be concepts are related if

one is declared in the structure of the other. One may iden-

tify three types of relations:

Knowledge extraction from source code | 189

Table 1: The initial vector - probability to have a state as the rst label

f(PRE) f(TARGET) f(POST) f(OTHER)

Table 2: An example of a transition table

States PRE TARGET POST OTHER

PRE f(PRE,PRE) f(PRE,TARGET) f(PRE,POST) f(PRE,OTHER)

TARGET f(TARGET,PRE) f(TARGET,TARGET) f(TARGET,POST) f(TARGET,OTHER)

POST f(POST,PRE) f(POST,TARGET) f(POST,POST) f(POST,OTHER)

OTHER f(OTHER,PRE) f(OTHER,TARGET) f(OTHER,POST) f(OTHER,OTHER)

– ObjectProperty: If two classes ’A’ and ’B’ are candi-

dates to be concepts and ’b’ of type B is declared as

attribute of class ’A’, then classes ’A’ and ’B’ are re-

lated. The attribute ’b’ is an ObjectProperty having ’A’

as domain and ’B’ as range.

– DatatypeProperty: If a class ’A’ is a candidate to be a

concept and contains the attributes ’a’ and ’b’ of ba-

sic data types (integers, string, boolean, etc.), then, ’a’

and ’b’ are DatatypeProperty having the class ’A’ as

domain;

– Taxonomy (subClassOf): If two classes ’A’ and ’B’ are

candidates to be concepts and the class ’B’ extends

the class ’A’ (in Java, the keyword "extends" is used),

then, one can dene a taxonomic relation between the

classes ’B’ and ’A’.

3.3.4 Translation in a formal language

Once all relevant knowledge are identied in the previ-

ous phase, they are automatically translated to a machine

readable language. We use OWL for concepts, properties

and axioms, and SWRL for rules.

3.3.5 Knowledge evaluation

After the extraction process, the evaluation phase permits

us to know if this knowledge is relevant to the related do-

main and to conclude on the relevance in using source

code as a knowledge source. Given that the knowledge

extracted is ontological knowledge, two evaluation tech-

niques will be used: (1) Manual evaluation by human ex-

perts in which the knowledge extracted is presented to one

or more domain experts who have to judge to what extent

these knowledge are correct; (2) The comparison of the

knowledge extracted (alignment) to gold standards which

will be existing ontologies.

3.4 HMMs denition, training and use

To extract knowledge from Java source code, two HMMs

have to be dened and trained: a HMM for concepts, prop-

erties, and axioms identication, and a HMM for rules

identication. All the algorithms for HMMs training and

usage have been coded in Java

3.4.1 HMM structure for concepts, properties and

axioms

The HMM used to identify concepts, properties and axioms

is dened by:

1. PRE = { public, private, protected, static, nal},

the set of words that precedes TARGET;

2. TARGET = { package, class, interface, extends,

implements, abstract, enum, w

} , ∀i, w

i−1

∈ PRE ||

i−2

∈ PRE ∧ w

i−1

∈ PRE, the set of all words that we

are seeking;

3. POST = {{, ; , }}, the set of words that follow TARGET;

4. OTHER = {w

} , w

∉ PRE, ∧ w

∉ TARGET, ∧w

∉

POST, the set of all other words.

Each HMM state emitted a term corresponding to a word

from the source code. We have seen that the observation

emitted by the PRE set can be enumerated. However, the

observation of TARGET and OTHER sets cannot be enu-

merated because they depend on the programmer. Then,

we considered data to be all the observations emitted by

TARGET and other to be all the observations emitted by

OTHER. We obtained the HMM presented by an initial vec-

tor (e.g., Tab. 4) a transition vector (e.g., Tab. 5), and an

observation vector (e.g., Tab. 6).

190 | Azanzi Jiomekong, Gaoussou Camara, and Maurice Tchuente

Table 3: An example of an observation table

package pac ; public class patient ...

PRE f(PRE,package) f(PRE, pac) f(PRE,;) f(PRE,public) f(PRE,class) f(PRE,patient) ...

TARGET f(TARGET,package) f(TARGET, pac) f(TARGET,;) f(TARGET,class) f(TARGET,patient) ...

POST f(POST,package) f(POST, pac) f(POST,;) f(POST,public) f(POST,class) f(POST,patient) ...

OTHER f(OTHER,package) f(OTHER, pac) f(OTHER,;) f(OTHER,public) f(OTHER,class) f(OTHER,patient) ...

3.4.2 HMM structure for rules

Rules can be contained in conditions. Then, we will exploit

the structure of source code to extract the rules. For ex-

ample, the portion of code (if (agePatient> 21) {Patient =

Adult}) is a rule determining whether a patient is an adult

or not. It must therefore be extracted.

The HMM to identify the rules is composed of:

1. PRE = {”}”, ”; ”, ”{”}, the set of words that precede

one or more TARGET;

2. TARGET = {if , else, switch, w

} | ∃k, r ∈ N | w

i−k

∈

PRE ∧ wi + r ∈ POST: the set of all words that follow

PRE and precede POST;

3. POST = {”}”}, the end of the condition;

4. OTHER = {w

} | w

∉ PRE, TARGET, POST: the set

of all other words.

We can identify the beginning and the end of a condition

represented here by the sets PRE and POST respectively.

Note that all the observations emitted by TARGET and

OTHER sets cannot be fully enumerated. Therefore, we

have considered data to be all the observations emitted by

TARGET, and other to be all the observations emitted by

OTHER.

3.4.3 Statistical learning of the HMMs

LearnJava source code (composed of 59 les and 2663

statements) was downloaded from github

and from this

source code, we used statistical learning on data presented

in section 3.3.2 to calculate the values of the HMMs param-

eters

. Tabs 4, 5, 6, 7, 8, 9 present the initialization, tran-

sition and observation vectors respectively obtained after

the training step.

3.4.4 Knowledge extraction

Once the HMMs are built, we can apply them to the source

code of any Java applications in order to extract the knowl-

edge. To do this, the most likely state sequence (equation

3) that produced this source code is calculated. To calcu-

late the most likely state sequence, we have implemented

the Viterbi algorithm [41, 51, 52] in Java

. In fact, we have

exploited the structure of the HMM in the context of dy-

namic programming. It consists of breaking down the cal-

culations into intermediate calculations which are struc-

tured in a table. An example of Viterbi table is given by the

Tab. 10. Every element of the table is being calculated us-

ing the previous ones. From this table, the Viterbi path is

retrieved by getting the frame with the highest probability

in the last column and given this frame, to search all the

frames that were used to build it. All the elements whose

labels are TARGET are extracted as candidates.

4 Experimentation

This section presents the experimentation of the approach

described in section 3. This experimentation consists in ex-

tracting ontological knowledge from EPICAM source code

composed of 1254 Java les and 271782 instructions. Fig. 2

presents a screenshoot of some concepts from the EPICAM

source code.

4.1 Knowledge extraction from EPICAM

To extract ontological knowledge from EPICAM source

code, we proceeded step by step using the method pre-

sented in section 3.

4.1.1 Data collection

The source les of EPICAM platform are composed of state-

ments, imported libraries and comments. Data collection

involves removing the imported libraries and comments.

To this end, we dened the regular expression

import[u0000 − u ]

?; |(.)

 n|(

[ u0000−u ]

?

) to identify them. Once

identied, we wrote a Java program to delete them.

Knowledge extraction from source code | 191

Table 4: The initial vector of the HMM for concepts, properties and axioms extraction

PRE TARGET POST OTHER

0.0 1.0 0.0 0.0

Table 5: Transition vector of the HMM for concepts, properties and axioms extraction

PRE TARGET POST OTHER

PRE 0.1686 0.8260 0.0027 0.0027

TARGET 0.0008 0.7523 0.2461 0.0008

POST 0.0603 0.0033 0.0234 0.9130

OTHER 0.7364 0.1133 0.0025 0.1478

Table 6: Observation vector of the HMM for concepts, properties and axioms extraction

public private protected static nal data { ; } other

PRE 0.6417 0.1684 0.0053 0.1124 0.0722 0.0 0.0 0.0 0.0 0.0

TARGET 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

POST 0.0 0.0 0.0 0.0 0.0 0.0 0.6678 0.3256 0.0066 0.0

OTHER 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

Table 7: The initial vector of the HMM for rules extraction

PRE TARGET POST OTHER

0.0 0.0 0.0 1.0

Table 8: Transition vector of the HMM for rules extraction

PRE TARGET POST OTHER

PRE 0.0667 0.7999 0.0667 0.0667

TARGET 0.0010 0.9321 0.0659 0.0010

POST 0.0172 0.0172 0.0172 0.9484

OTHER 0.0072 0.0001 0.0001 0.9926

Table 9: Observation vector of the HMM for rules extraction

{ } ; if else

PRE 0.8462 0.0769 0.0769 0.0 0.0

TARGET 0.0 0.0 0.0 0.0185 0.0031

POST 0.0 1.0 0.0 0.0 0.0

OTHER 0.0 0.0 0.0 0.0 0.0

switch data other

PRE 0.0 0.0 0.0

TARGET 0.0010 0.9774 0.0

POST 0.0 0.0 0.0

OTHER 0.0 0.0 1.0

4.1.2 Data preprocessing

Data preprocessing consists in extracting the elements

likely to be relevant from the source code and recod-

ing them if necessary. We have used the HMMs dened

and trained in section 3.4. These HMMs were applied to

the source code of EPICAM by calculating the values of

the Viterbi table (see Tab. 10). Once the table is built,

we searched the Viterbi path by getting the frames with

the highest probability in the last column and using this

frame, we search all the frames that were used to build it.

Once the Viterbi path is identied, all the elements labeled

TARGET are extracted.

Fig. 3 presents the set of candidates for concepts, prop-

erties, and axioms identied and Fig. 4 presents the set of

candidates for rules identied.

192 | Azanzi Jiomekong, Gaoussou Camara, and Maurice Tchuente

Figure 2: An overview of the Java source code of the EPICAM project

Table 10: The Viterbi table (α table) built using EPICAM source code

package org.epicam ; public ... }

PRE 0 α(PRE, 2) α(PRE, 3) α(PRE, 4) ... α(PRE, t)

TARGET 1 α(TARGET, 2) α(TARGET, 3) α(TARGET, 4) ... α(TARGET, t)

OTHER 0 α(OTHER, 2) α(OTHER, 3) α(OTHER, 4) ... α(OTHER, t)

4.1.3 Recoding terms and rules

To recode the candidates extracted, we used Java nam-

ing conventions. All the candidates were browsed and for

the candidates containing the keywords of the program-

ming language, these keywords were removed. For exam-

ple, consider the term CasTuberculoseEditorWorkow that

was extracted from the source code; the terms Editor and

Workow are keywords of Google Web Toolkit, the technol-

ogy used to build the EPICAM platform. Then, the terms

Editor and Workow are removed and the term CasTuber-

culose is retained as candidate.

After the recoding, we moved to the next step which is

the translation into formal language.

4.1.4 Entities identication and translation into OWL

Data preprocessing phase produced a le containing only

the meta-knowledge (e.g "package", "class", "extends",

"if", "switch") and the knowledge (e.g "patientManage-

ment.Patient", "Patient" or "serology"). We wrote a Java

program to browse these les in order to identify the

knowledge that may be useful. Meta-knowledge allow

the identication of the candidates as concepts, prop-

erties and axioms. For example, if the string "pack-

age minHealth.Region.District.hospitals.patientRecord ...

class Patient extends Person ... int age ... List<Exam>

listExam" is extracted, then, the following ontological

knowledge is identied:

– "package minHealth.Region.District.hospitals.

patientRecord:" This is used to identify the class

hierarchy;

– "class Patient extends Person": This expression

means that "Patient" and "Person" are candidates that

will become concepts and there is a hierarchical rela-

tion between concepts "Patient" and "Person";

– "int age; List <Exam> listExam": This expression

means that "age" and "listExam" are properties of the

concept "Patient"; the following axiom is also dened:

a patient has only a single age (i.e. age is a functional

property).

Knowledge extraction from source code | 193

Figure 3: An excerpt of candidates extracted for concepts, properties and axioms

Figure 4: An excerpt of candidates extracted for rules identication

After the identication of entities, we proposed a sec-

ond Java program

to automatically translate them into an

OWL ontology

In the same way, rules were also extracted and trans-

lated into Semantic Web Rule Language

. An example of

a rule specifying the rights of a doctor on patient data is

given by:

194 | Azanzi Jiomekong, Gaoussou Camara, and Maurice Tchuente

doctorsRule = "Personnel (?pers) ∧ personnel_login

(?pers, login) ∧ personnel_passwd (?pers, passwd) ∧

Patient (?p) ∧ RendezVous (?rdv) ∧ hasRDV (?rdv, ?p)

∧ patient_nom (?p, ?nom) ∧ patient_age (?p, ?age) ∧

patient_sexe (?p, ?sexe) ∧ patient_telephoneUn (?p,

?telephone) ∧ rendezVous_dat eRendezVous (?rdv,

?datardv) ∧ rendezVous_honore (?rdv, ?honore) ∧ ren-

dezVous_honore (?rdv, Non) → sqwrl:select (?nom, ?age,

?sexe, ?telephone, ?datardv, ?honore)";

4.2 Analysis of the elements extracted

The extraction process produced a set of candidates (Figs 3

and 4), but also false positives (Tab. 11 presents the statis-

tics). The false positives consist of the set of candidates

that belong to the PRE, POST or OTHER sets that normally

should not be extracted as observations of TARGET. We

wrote a Java program to identify and delete them.

Tab. 11 presents the statistics of candidates/group of

candidates that were extracted. After the extraction pro-

cess, we obtained dierent types of candidates/group of

candidates:

– Irrelevant candidates/group of candidates: These

are utility classes and temporary variables. Utility

classes are classes that the programmer denes to per-

form certain operations. These classes usually contain

constants and methods. The names of these classes

are usually not related to the domain. Temporary vari-

ables (e.g., the variables used in a loop) are used tem-

porarily in the source code and are not related to the

domain.

– Relevant candidates/group of candidates: These

are knowledge found. These candidates are composed

of synonyms (candidates of identical meaning) and re-

dundancies (candidates that come up several times).

We wrote a Java program to identify and remove re-

dundancies candidates automatically.

We also extracted candidates conditions to be rules. As we

did with the candidates to be concepts, properties and ax-

ioms, false positives were identied and deleted. From the

rules extracted, we found:

– Irrelevant conditions: These are conditions that are

not really important. For example, testing whether a

temporary variable is positive or is equal to a certain

value. These conditions were the most numerous;

– Relevant conditions: Conditions corresponding to a

business rule (e.g., testing if a user has access right to

certain data).

Table 11: Statistics on candidates extracted

Candidates Relevant Irrelevant

Concepts 1840 (72.87%) 685 (27.13%)

Properties 38355 (81.42%) 8755 (18.58%)

Axioms 3397 (83.22%) 685 (16.78%)

Rules 1484 (07.89%) 17332 (92.11%)

5 Evaluation

The concepts, properties and axioms extracted were trans-

lated into an OWL ontology. The extracted rules are repre-

sented in SWRL. We used the Protege editor to provide a

graphical visualization of the ontology and rules to human

experts for their evaluation. Fig. 5 presents an overview of

the ontology obtained.

Three experts from the tuberculosis surveillance do-

main involved in the EPICAM project were invited to evalu-

ate the knowledge extracted. They are from three dierent

organizations in Cameroon (Centre Pasteur of Cameroon,

National Tuberculosis Control Program and a hospital in

Yaounde). The domain experts were asked to check rst if

the terms extracted are relevant to the tuberculosis clinical

or epidemiological perspectives. Second, they analyzed

the axioms and rules. First of all, they found that the ter-

minology was relevant to the tuberculosis. However, they

suggested to correct some typos caused by the names of the

classes and attributes given by programmers. Axioms and

rules were generally correct. Some rules were suggested to

be updated as the business rules have evolved (e.g. user

access to patient data has been improved taking into ac-

count their post such as epidemiologist, physician, nurse

or administrative sta).

In line with the experts validation, we evaluated the

coverage of the ontology terms by taking reference on

other ontologies in the biomedical domain. We used Bio-

Portal [53] as a biomedical ontology repository. BioPortal

contains more than 300 ontologies including a large num-

ber of medical terminologies such as SNOMED (System-

atized Nomenclature of Medicine) [54]. BioPortal has an

Ontology Recommender module that is used to nd the

best ontologies for a biomedical text or a set of keywords

[55]. This task is done according to four criteria: (1) the ex-

tent to which the ontology covers the input data; (2) the

acceptance of the ontology in the biomedical community;

(3) the level of detail of the ontology classes that cover the

input data; (4) and the specialization of the ontology to the

domain of the input data. We gave as input keywords to the

Recommender the set of terms (concepts and properties)

Knowledge extraction from source code | 195

Figure 5: An overview of the generated OWL ontology

of the ontology extracted by our HMM. Fig. 6 shows that

the ontology terms are covered by many biomedical on-

tologies. In the rst line of the recommended ontologies,

we could see that NCIT, SNOWMEDCT, ONTOPARON (ac-

cepted by the community with a score of 75.6%) cover the

terms from our ontology with a score of 82.9%, have a level

of details of 64% and the level of specialization of 40%. We

came to the conclusion that terms extracted by our HMM

are relevant to the biomedical domain.

At the end of the evaluation, we conclude that EPICAM

source code contains ontological knowledge that can be

used as a relevant basis to build and/or enrich an ontology

for the tuberculosis surveillance domain.

6 Related work

Despite the large amount of available source codes and

the fact that they may contain relevant knowledge of the

domain [12, 21–23] addressed by the software, the num-

ber of existing work on knowledge extraction from these

knowledge sources is quite low. Parser-based approach

and machine learning techniques are the commonly used

in knowledge extraction from source code.

6.1 Parser-based approach

A straightforward solution to extract knowledge from

source code is to use a parser. There are works in this di-

rection for generating knowledge base (RDF triples) or ex-

tracting ontological knowledge (concepts and properties)

from source codes using parsers. For instance, CodeOntol-

ogy [20, 56] parser is able to analyze Java source code and

serialize it into RDF triples. From these triples, highly ex-

pressive queries using SPARQL (SPARQL Protocol and RDF

Query Language) can be executed for dierent software

engineering purposes including the searching of specic

software component for reuse. Ganapathy and Sagayaraj

196 | Azanzi Jiomekong, Gaoussou Camara, and Maurice Tchuente

Figure 6: The Ontology Recommender output from the extracted ontology terms

[28] used QDox

generator to generate an ontology that

will further enable the developers to reuse source code ef-

ciently. QDox generator is a parser that can be used for

extracting classes, attributes, interfaces and method def-

inition from Java source code. In the approach proposed

by [12], the authors dened the components parts of the

source code and break down the source code into these

components. The source code is browsed and the dier-

ent components are analyzed in order to take an appro-

priate action which is the extraction of knowledge sought.

This knowledge can be used in supplementing and assist-

ing ontology development from database schemas.

Beyond RDF triples, terms, concepts and properties

extraction, existing parsers do not provide services for ax-

ioms and rules extraction. To overcome these limits, they

need to be improved. However, building and/or updating

parsers for programming languages is a non-trivial, labo-

rious and time-consuming task [57, 58].

6.2 Machine learning-based approach

Machine learning approaches are also proposed to extract

knowledge from source code.

Kalina Bontcheva and Marta Sabou [23] have pre-

sented an approach for ontology learning from software

artifacts such as software documentation, discussion fo-

rums and source code by using the language processing

NOTES | 197

facilities provided by GATE 2 platform

. GATE 2 is an Open

source software developed in Java for building and de-

ploying Human Language Technology application such

as parsers, morphology, tagging, Information Retrieval

tools, Information Extraction components, etc. To extract

concepts from source code, Kalina Bontcheva and Marta

Sabou used the GATE key phrase extractor, which is based

on TF.IDF (term frequency/inverted document frequency).

The TD.IDF approach is an unsupervised machine learn-

ing technique which consists of nding words/phrases

that are characteristic of the given text, while ignoring

phrases that occur frequently in the text simply because

they are common in the language as a whole. When using

TF.IDF on the source code, high frequency terms specic

to the programming language can be eliminated and only

terms specic to the given software project would be se-

lected as relevant to the domain (ontology concept). This

approach is used to extract concept. However, ontologi-

cal knowledge is also made up of properties, axioms and

rules.

Labsky et al. [29] presented an approach for informa-

tion extraction on product oered by companies from their

websites. To extract information from HTML documents,

they used Hidden Markov Models to annotate these docu-

ments. Tokens modelled by this HMM include words, for-

matting tags and images. The HMM is modelled using four

states: the target state (T) which is the slot to extract, the

prex and the sux state (P, S) which constitute the slot’s

context, and the irrelevant tokens modelled by a single

background state (B). This approach permitted the extrac-

tion of slots and the relation between nearby slots. For ex-

ample product image often follows its name. Unlike the

authors approach which consists of terms extraction, our

approach uses meta-data extracted from source code in

order to identify to which ontological component every

term/group of terms corresponds to.

7 Conclusion and future work

In this paper, we proposed an approach for knowledge ex-

traction from Java source code using Hidden Markov Mod-

els (HMMs). We experimented this approach by extract-

ing ontological knowledge from EPICAM, a tuberculosis

epidemiological surveillance platform developed in Java.

Evaluation by domain experts (clinicians and epidemiolo-

gists) permitted us to show the relevance of the knowledge

extracted. In line with the experts validation, we evaluated

the coverage of terms extracted by reference ontologies

in biomedical domain. We used Ontology Recommender

from BioPortal repository. The results of the evaluation

shows that the terms are well covered by many biomedi-

cal ontologies (e.g., NCIT, SNOWMEDCT, ONTOPARON).

Our goal in this paper was twofold: (1) to show that

source code contains ontological knowledge that could

be used in domain ontology engineering and (2) to show

how to dene, train and use HMMs to extract these knowl-

edge. Since we have used the statistical learning on data

approach to calculate the parameters of the HMMs, our

future work consists of experimenting the Baum-welch

and Viterbi training approaches. The performance of these

three approaches will be evaluated and compared to the

parser approach.

Notes

http://www.medes.fr/fr/nos-metiers/la-e-sante-et-l-

epidemiologie/la-tele-epidemiologie/projet-epicam.html

https://github.com/jiodelus/source2onto

https://github.com/mafudge/LearnJava

https://github.com/jiodelus/source2onto/blob/

master/code2onto-model/src/main/java/cm/uy1/training/

HMMTrainingData.java

https://github.com/jiodelus/source2onto/blob/

master/code2onto-model/src/main/java/cm/uy1/modelUse/

KnowledgeExtractionHMM.java

https://github.com/jiodelus/source2onto/blob/master/

code2onto-model/src/main/java/cm/uy1/helper/OWLHelper.java

https://github.com/jiodelus/ontologies/blob/master/epicam/

epicam.owl

https://github.com/jiodelus/ontologies/blob/master/epicam/

epicamrules.owl

https://github.com/paul-hammant/qdox

https://gate.ac.uk/

References

[1] Studer R., Benjamins V.R., Fensel D., Knowledge Engineering:

Principles and Methods, Data Knowl. Eng., 1998, 25(1-2), 161–

197, 10.1016/S0169-023X(97)00056-6

[2] Gómez-Pérez A., Fernández-López M., Corcho Ó., Ontologi-

cal Engineering: With Examples from the Areas of Knowl-

edge Management, e-Commerce and the Semantic Web, Ad-

vanced Informationand Knowledge Processing, Springer, 2004,

10.1007/b97353

[3] Konys A., Knowledge systematization for ontology learning

methods, in Knowledge-Based and Intelligent Information &

Engineering Systems, Proceedings of the 22nd International

Conference KES-2018, Belgrade, Serbia, 3-5 September 2018.,

2018, 2194–2207, 10.1016/j.procs.2018.07.229

[4] Suárez-Figueroa M.C., Gómez-Pérez A., Fernández-López M.,

The NeOn Methodology framework: A scenario-based method-

ology for ontology development, Applied Ontology, 2015, 10(2),

198 | NOTES

107–145, 10.3233/AO-150145

[5] Cimiano P., Ontology learning and population from text - al-

gorithms, evaluation and applications, Springer US, 2006,

10.1007/978-0-387-39252-3

[6] Ghosh M.E., Naja H., Abdulrab H., Khalil M., Ontology Learning

Process as a Bottom-up Strategy for Building Domain-specic

Ontology from Legal Texts, In Proceedings of the 9th Interna-

tional Conference on Agents and Articial Intelligence, ICAART

2017, Volume 2, Porto, Portugal, February 24-26, 2017., 2017,

473–480, 10.5220/0006188004730480

[7] Alexander M., Raphael V., The Ontology Extraction & Mainte-

nance Framework Text-To-Onto, In International Conference on

Data Mining (ICDM), San Jose, USA, November 29 - December 2,

2001, IEEE, Los Alamitos (CA), 2001

[8] Alexander M., Steen S., Semi-automatic engineering of ontolo-

gies from text, Proceedings of the 12th Internal Conference on

Software and Knowledge Engineering. Chicago, USA, 2000

[9] Cerbah F., Lammari N., Ontology Learning from Databases:

Some Ecient Methods to Discover Semantic Patterns in Data,

in A..I.P. Serie, ed., Perspectives in Ontology Learning, 2014, 30

[10] Cullot N., GhawiR., Yétongnon K., DB2OWL :A Toolfor Automatic

Database-to-Ontology Mapping, In Proceedings of the Fifteenth

Italian Symposium on Advanced Database Systems, SEBD 2007,

17-20 June 2007, Torre Canne, Fasano, BR, Italy, 2007, 491–494

[11] Idrissi B.E., Baïna S., Baïna K., Ontology Learning from Rela-

tional Database: How to Label the Relationships Between Con-

cepts?, In Beyond Databases, Architectures and Structures -

11th International Conference, BDAS 2015, Ustroń, Poland, May

26-29, 2015, Proceedings, 2015, 235–244, 10.1007/978-3-319-

18422-7_21

[12] Zhao S., Chang E., Dillon T.S., Knowledge extraction from web-

based application source code: An approach to database re-

verse engineering for ontology development, In Proceedings of

the IEEE International Conference on Information Reuse and In-

tegration, IRI 2008, 13-15 July 2008, Las Vegas, Nevada, USA,

2008, 153–159, 10.1109/IRI.2008.4583022

[13] Hacherouf M., Bahloul S.N., Cruz C., Transforming XML docu-

ments to OWL ontologies: A survey, Journal of Information Sci-

ence, 2015, 41(2), 242–259, 10.1177/0165551514565972

[14] Leung N.K.Y., Lau S.K., Tsang N., Reuse existing ontologies in

an ontology development process - an integration-oriented on-

tology development methodology, International Journal of Web

Science, 2014, 2(3), 159–180, 10.1504/IJWS.2014.066435

[15] Pinto H., Gómez-Pérez A., Martins J., Some Issues on Ontology

Integration, In Proceedings of the 16th International Joint Con-

ference on Articial Intelligence (IJCAI 99) Workshop: KRR5: On-

tologies and Problem-Solving Methods: Lesson Learned and Fu-

ture Trends, volume 18, 1999

[16] Smith B., Ashburner M., Rosse C., Bard J., Bug W., Ceusters

W., al., The OBO Foundry: coordinated evolution of ontologies

to support biomedical data integration, Nature biotechnology,

2007, 25(11), 1251–1255, 10.1038/nbt1346

[17] Bouihi B., Bahaj M., An UML to OWL based approach for

extracting Moodle’s Ontology for Social Network Analy-

sis, Procedia Computer Science, 2019, 148, 313 – 322,

https://doi.org/10.1016/j.procs.2019.01.039, the Second

International Conference on Intelligent Computing in Data

Sciences, ICDS2018

[18] Djuric D., Gasevic D., Devedzic V., Ontology Modeling and

MDA, Journal of Object Technology, 2005, 4(1), 109–128,

10.5381/jot.2005.4.1.a3

[19] Xu Z., Ni Y., He W., Lin L., Yan Q., Automatic extraction

of OWL ontologies from UML class diagrams: a semantics-

preserving approach, World Wide Web, 2012, 15(5-6), 517–545,

10.1007/s11280-011-0147-z

[20] Atzeni M., Atzori M., CodeOntology: RDF-ization of Source Code,

In The Semantic Web - ISWC 2017 - 16th International Semantic

Web Conference, Vienna, Austria, October 21-25, 2017, Proceed-

ings, Part II, 2017, 20–28, 10.1007/978-3-319-68204-4_2

[21] Azanzi F.J., Camara G., Knowledge Extraction from Source Code

Based on Hidden Markov Model: Application to EPICAM, In 14th

IEEE/ACS International Conference on Computer Systems and

Applications, AICCSA 2017, Hammamet, Tunisia, October 30 -

Nov. 3, 2017, 2017, 1478–1485, 10.1109/AICCSA.2017.99

[22] Azanzi F.J., Camara G., An Approach for Knowledge Extraction

from Source Code (KNESC) of Typed Programming Languages, In

Trends and Advances in Information Systems and Technologies

- Volume 1 [WorldCIST’18, Naples, Italy, March 27-29, 2018].,

2018, 122–131, 10.1007/978-3-319-77703-0_12

[23] Bontcheva K., Learning Ontologies from Software Artifacts: Ex-

ploring and Combining Multiple Choices., In J.Z. Pan, Y. Zhao,

eds., Semantic Web Enabled Software Engineering, volume 17

of Studies on the Semantic Web, IOS Press, 2014, 235–250

[24] Brunzel M., The XTREEM Methods for Ontology Learning from

Web Documents., In P. Buitelaar, P. Cimiano, eds., Ontology

Learning and Population: Bridging the Gap between Text and

Knowledge, volume 167 of Frontiers in Articial Intelligence and

Applications, IOS Press, 2008, 3–26

[25] Asim M.N., Wasim M., Khan M.U.G., Mahmood W., Abbasi H.M.,

A survey of ontology learning techniques and applications,

Database, 2018, 2018, bay101, 10.1093/database/bay101

[26] Shamsfard M., Barforoush A.A., The state of the art

in ontology learning: a framework for comparison, The

Knowledge Engineering Review, 2003, 18(4), 293–316,

10.1017/S0269888903000687

[27] Unbehauen J., Hellmann S., Auer S., Stadler C., Knowledge Ex-

traction from Structured Sources, in S. Ceri, M. Brambilla, eds.,

Search Computing: Broadening Web Search, volume 7538 of

Lecture Notes in Computer Science, Springer Berlin Heidelberg,

Berlin, Heidelberg, 2012, 34–52, 10.1007/978-3-642-34213-4_3

[28] Ganapathy G., Sagayaraj S., To Generate the Ontol-

ogy from Java Source Code, International Journal of Ad-

vanced Computer Science and Applications, 2011, 2(2),

10.14569/IJACSA.2011.020218

[29] Labský M., Svátek V., Sváb O., Praks P., Krátký M., Snásel V.,

Information Extraction from HTML Product Catalogues: From

Source Code and Images to RDF, in 2005 IEEE / WIC /

ACM International Conference on Web Intelligence (WI 2005),

19-22 September 2005, Compiegne, France, 2005, 401–404,

10.1109/WI.2005.78

[30] Zhou L., Ontology learning: state of the art and open issues, In-

formation Technology and Management, 2007, 8(3), 241–252,

10.1007/s10799-007-0019-5

[31] Hitzler P., Krötzsch M., Rudolph S., Foundations of Semantic

Web Technologies, Chapman and Hall/CRC Press, 2010

[32] Kharbat F., El-Ghalayini H., Building Ontology from Knowledge

Base Systems, Data Mining in Medical and Biological Research,

2008, 10.5772/6407

[33] García-Silva A., García-Castro L.J., Castro A.G., Corcho Ó., Build-

ing Domain Ontologies Out of Folksonomies and Linked Data,

NOTES | 199

International Journal on Articial Intelligence Tools, 2015, 24(2),

10.1142/S021821301540014X

[34] Wang S., Wang W., Zhuang Y., Fei X., An ontology evolution

method based on folksonomy, Journal of Applied Research and

Technology, 2015, 13(2), 177 – 187

[35] Fahad M., ER2OWL: Generating OWL Ontology from ER Dia-

gram, In Intelligent Information Processing IV, 5th IFIP Interna-

tional Conference on Intelligent Information Processing, Octo-

ber 19-22, 2008, Beijing, China, 2008, 28–37, 10.1007/978-0-

387-87685-6_6

[36] Hazman M., El-Beltagy S.R., Rafea A., A Survey of Ontology

Learning Approaches, International Journal of Computer Appli-

cations, 2011, 22(8), 36–43

[37] Lisi F.A., Learning Onto-Relational Rules with Inductive Logic

Programming, CoRR, 2012, abs/1210.2984

[38] Wróblewska A., Podsiadly-Marczykowska T., Bembenik R., Pro-

taziuk G., Rybinski H., Methods and Tools for Ontology Build-

ing, Learning and Integration Application in the SYNAT Project,

in R. Bembenik, L. Skonieczny, H. Rybinski, M. Niezgodka,

eds., Intelligent Tools for Building a Scientic Information Plat-

form, volume 390 of Studies in Computational Intelligence,

Springer Berlin Heidelberg, Berlin, Heidelberg, 2012, 121–151,

10.1007/978-3-642-24809-2_9

[39] Li Y., Krishnamurthy R., Raghavan S., Vaithyanathan S., Ja-

gadish H.V., Regular Expression Learning for Information Extrac-

tion, in 2008 Conference on Empirical Methods in Natural Lan-

guage Processing, EMNLP 2008, Proceedings of the Conference,

25-27 October 2008, Honolulu, Hawaii, USA, A meeting of SIG-

DAT, a Special Interest Group of the ACL, 2008, 21–30

[40] Kolesnikova O., Survey of Word Co-occurrence Measures for

Collocation Detection, Computación y Sistemas, 2016, 20(3),

327–344

[41] Fink G.A., Markov Models for Pattern Recognition: From The-

ory to Applications, Advances In Computer Vision and Pattern

Recognition, Springer-Verlag, London, 2 edition, 2014

[42] Russell S.J., Norvig P., Articial Intelligence - A Modern Ap-

proach, Third International Edition, Pearson Education, 2010

[43] Seymore K., Mccallum A., Rosenfeld R., LearningHidden Markov

Model Structure for InformationExtraction, In AAAI 99Workshop

on Machine Learning for Information Extraction, 1999, 37–42

[44] Zhou G., Su J., Named Entity Recognition using an HMM-based

Chunk Tagger, In Proceedings of the 40th Annual Meeting of

the Association for Computational Linguistics, July 6-12, 2002,

Philadelphia, PA, USA., 2002, 473–480

[45] Amith M., He Z., Bian J., Lossio-Ventura J.A., Tao C., Assessing

the practice of biomedical ontology evaluation: Gaps and op-

portunities, Journal of Biomedical Informatics, 2018, 80, 1–13,

10.1016/j.jbi.2018.02.010

[46] Dellschaft K., Staab S., Strategies for the Evaluation of Ontol-

ogy Learning, In Proceedings of the 2008 Conference on Ontol-

ogy Learning and Population: Bridging the Gap Between Text

and Knowledge, IOS Press, Amsterdam, The Netherlands, The

Netherlands, 2008, 253–272

[47] Eddy S.R., What is a hidden Markov model?, Nature Biotechnol-

ogy, 2004, 22(10), 1315, 10.1038/nbt1004-1315

[48] Franzese M., Iuliano A., Hidden Markov Models, in S. Ran-

ganathan, M. Gribskov, K. Nakai, C. SchAnbach, eds., Encyclo-

pedia of Bioinformatics and Computational Biology, Academic

Press, Oxford, 2019, 753 – 762, https://doi.org/10.1016/B978-

0-12-809633-8.20488-3

[49] Kouemou G.L., History and Theoretical Basics of Hidden Markov

Models, Hidden Markov Models, Theory and Applications, 2011,

10.5772/15205

[50] Binkley D., Davis M., Lawrie D., Morrell C., To camel-

case or under_score, in 2009 IEEE 17th International

Conference on Program Comprehension, 2009, 158–167,

10.1109/ICPC.2009.5090039

[51] Forney G.D., The Viterbi Algorithm: A Personal History, CoRR,

2005, abs/cs/0504020

[52] Viterbi A.J., Viterbi algorithm, Scholarpedia, 2009, 4(1), 6246,

10.4249/scholarpedia.6246

[53] Whetzel P.L., Noy N.F., Shah N.H., Alexander P.R., Nyulas C., Tu-

dorache T., Musen M.A., BioPortal: enhanced functionality via

new Web services from the National Center for Biomedical On-

tology to access and use ontologies in software applications,

Nucleic Acids Research, 2011, 39(Web-Server-Issue), 541–545,

10.1093/nar/gkr469

[54] Silva T.S.D., MacDonald D., Paterson G.I., Sikdar K.C., Cochrane

B., Systematized nomenclature of medicine clinical terms

(SNOMED CT) to represent computed tomography procedures,

Computer Methods and Programs in Biomedicine, 2011, 101(3),

324–329, 10.1016/j.cmpb.2011.01.002

[55] Romero M.M., Jonquet C., O’Connor M.J., Graybeal J., Pazos A.,

Musen M.A., NCBO Ontology Recommender 2.0: an enhanced

approach for biomedical ontology recommendation, Journal of

Biomedical Semantic, 2017, 8(1), 21:1–21:22, 10.1186/s13326-

017-0128-y

[56] Atzeni M., Atzori M., CodeOntology: Querying Source Code in a

Semantic Framework, In Proceedings of the ISWC 2017 Posters

& Demonstrations and Industry Tracks co-located with 16th In-

ternationalSemantic Web Conference (ISWC 2017), Vienna, Aus-

tria, October 23rd - to - 25th, 2017., 2017

[57] Fenwick M., Weatherby G., Ellis H.J.C., Gryk M.R., Parser Com-

binators: A Practical Application for Generating Parsers for NMR

Data, In Tenth International Conference on Information Technol-

ogy: New Generations, ITNG 2013, 15-17 April, 2013, Las Vegas,

Nevada, USA, 2013, 241–246, 10.1109/ITNG.2013.39

[58] Nierstrasz O., Kurs J., Parsing for agile modeling, Sci-

ence of Computer Programming, 2015, 97, 150–156,

10.1016/j.scico.2013.11.011