Open Access. © 2019 Azanzi Jiomekong et al., published by De Gruyter. This work is licensed under the Creative Commons
Attribution alone 4.0 License.
Open Computer Science 2019; 9:181–199
Research Article Open Access
Azanzi Jiomekong*, Gaoussou Camara, and Maurice Tchuente
Extracting ontological knowledge from Java
source code using Hidden Markov Models
https://doi.org/10.1515/comp-2019-0013
Received April 29, 2019; accepted July 25, 2019
Abstract: Ontologies have become a key element since
many decades in information systems such as in epidemi-
ological surveillance domain. Building domain ontologies
requires the access to domain knowledge owned by do-
main experts or contained in knowledge sources. However,
domain experts are not always available for interviews.
Therefore, there is a lot of value in using ontology learn-
ing which consists in automatic or semi-automatic extrac-
tion of ontological knowledge from structured or unstruc-
tured knowledge sources such as texts, databases, etc.
Many techniques have been used but they all are limited
in concepts, properties and terminology extraction leav-
ing behind axioms and rules. Source code which naturally
embed domain knowledge is rarely used. In this paper,
we propose an approach based on Hidden Markov Models
(HMMs) for concepts, properties, axioms and rules learn-
ing from Java source code. This approach is experimented
with the source code of EPICAM, an epidemiological plat-
form developed in Java and used in Cameroon for tuber-
culosis surveillance. Domain experts involved in the eval-
uation estimated that knowledge extracted was relevant
to the domain. In addition, we performed an automatic
evaluation of the relevance of the terms extracted to the
medical domain by aligning them with ontologies hosted
on Bioportal platform through the Ontology Recommender
tool. The results were interesting since the terms extracted
were covered at 82.9% by many biomedical ontologies
such as NCIT, SNOWMEDCT and ONTOPARON.
Keywords: Knowledge Extraction, Ontology Learning,
Hidden Markov Models, Java Source Code, Viterbi
*Corresponding Author: Azanzi Jiomekong: University of
Yaounde I, Faculty of Science, Yaounde, Cameroon; IRD, Sorbonne
Université, UMMISCO, F-93143, Bondy, France;
E-mail: jio[email protected]om
Gaoussou Camara: LIMA, Université Alioune Diop de Bambey,
Sénégal; IRD, Sorbonne Université, UMMISCO, F-93143, Bondy,
France; E-mail: [email protected]u.sn
1 Introduction
Studer et al. [1] dened an ontology as "A formal, explicit
specication of a shared conceptualization". In the con-
text of domain ontologies, conceptualization refers to the
abstract model of the domain which is machine readable,
and where all the elements are explicitly dened and ac-
cepted by the members of a group. Several domain ontolo-
gies dene and organize relevant knowledge about activi-
ties, processes, organizations and strategies, in order to fa-
cilitate information exchange between machines and, be-
tween a human and a machine [2, 3]. Building domain on-
tologies requires the access to domain knowledge owned
by domain experts or contained in knowledge sources
[2, 4]. However, domain experts are not always available
for interviews. And in case they are available, the knowl-
edge provided is often incomplete and subjective. In ad-
dition, as the domain evolves, the knowledge provided by
the experts is likely to be out of date. Therefore, there is a
lot of added value in creating domain ontologies from ex-
isting knowledge sources such as structured and unstruc-
tured documents of the domain: texts [5–8], databases [9–
12], XML les [13], existing ontologies [14–16], UML/Meta-
model diagrams [17–19], and source code [12, 20–24]. Al-
though source code is often used to extract concepts and
relations, its full potential is not exploited to extract, for
example, axioms and rules [21, 22]. Indeed, source code
is any fully executable description of a software designed
for a specic domain such as medical, industrial, military,
communication, aerospace, commercial, scientic, etc. It
can be used for the collection, organization, storage and
communication of information. It is designed to facilitate
repetitive tasks or to process information quickly. In soft-
ware design process, a set of knowledge related to the do-
main are captured and integrated in the source code.
The extraction of knowledge from structured (rela-
tional databases, XML) and unstructured (text, docu-
Maurice Tchuente: University of Yaounde I, Faculty of Science,
Yaounde, Cameroon; IRD, Sorbonne Université, UMMISCO, F-93143,
Bondy, France; E-mail: Maurice.T[email protected]
182 | Azanzi Jiomekong, Gaoussou Camara, and Maurice Tchuente
ments, images) sources is also known as ontology learn-
ing [25–27] that consists in applying statistical techniques,
symbolic techniques or both to (semi-)automatically ex-
tract the ontological knowledge from knowledge sources.
Several authors have proposed the use of symbolic tech-
niques [12, 20, 28] and statistical techniques [23, 29] to ex-
tract generally concepts and properties from source code.
In this paper, we propose an approach for extract-
ing ontological knowledge from Java source code using
Hidden Markov Models (HMMs). Our approach is experi-
mented on the EPICAM source code. The EPICAM project
1
aims at building an integrated platform for epidemiologi-
cal surveillance of tuberculosis in Cameroon. The project
started in 2012 and involves partners from the dierent
area: academy (University of Yaounde 1 in Cameroon),
clinic (fty hospitals in Cameroon), epidemiology (Epi-
demiology and Public Health department of the Centre
Pasteur of Cameroon, and the National Tuberculosis Con-
trol Program), and industry (MEDES in France).
The rest of this paper is organized as follows. In sec-
tion 2, we present an overview of ontology learning. Our
approach is detailed in section 3. In section 4, we provide
the results of the experimentation. The section 5 presents
the evaluation of the knowledge extracted. Related works
are discussed in section 6. We conclude and present future
works in section 7.
2 Ontology Learning
Acquiring knowledge for building an ontology from
scratch, or for rening an existing ontology is costly in
time and resources. Ontology learning techniques are used
to reduce this cost during the knowledge acquisition pro-
cess. Ontology learning refers to the extraction of onto-
logical knowledge from unstructured, semi-structured or
fully structured knowledge sources in order to build an
ontology from them with little human intervention [3, 25,
26, 30]. In this section, we present the basic ontological
knowledge, knowledge sources generally used for ontol-
ogy learning, some ontology learning techniques and on-
tology learning evaluation.
2.1 Basic ontological knowledge
An ontology is composed of these basic components [2]:
Concept, also called Class, represents a category of
objects. For instance Health_facility is the concept
of all health facilities including health centers and
clinics;
Individual is an instance of a concept and corre-
sponds to a concrete object. For example, from the
concept Person, Bob is an individual;
Property is used to describe the characteristics of in-
dividuals of a concept. They are composed of Dat-
aProperties and ObjectProperties. DataProperties are
properties whose values are data types. For instance,
age of type Integer can be a property of an in-
stance of the concept Person". ObjectProperties are
special attributes whose values are individuals of con-
cepts. For instance, examined_in denes a rela-
tionship between the concept Person and the con-
cept Health_facility ("A person is examined in a
health facility");
Class/Property hierarchy is one of the most impor-
tant relation used to organize concepts and prop-
erties in the ontology. It is used to organize con-
cepts/properties through which inheritance mecha-
nisms can be applied. For instance, Patient is sub-
ClassOf Person is a hierarchical relation between
these two classes. Class/Property taxonomies are gen-
erally used to construct the so called lightweight on-
tologies or taxonomies;
Axiom is used to model statements that are always
true. Heavyweight ontologies add axioms and con-
straints to lightweight ontologies. Axioms and con-
straints clarify the intended meaning of the terms in
the ontology. For example, the assertion "the concepts
Men and Women are disjoint" is an axiom;
Rule is a statement in the form
P
1
,...,P
n
P
, this means
that if the statement P is true, then, the statements
P
1
, ..., P
n
are true. Rules are used for knowledge in-
ference purposes.
2.2 Knowledge sources for ontology
learning
The process of developing an ontology requires knowledge
acquisition from any relevant sources. There are several
possible sources of knowledge: domain experts or unstruc-
tured, semi-structured, and structured sources [4].
2.2.1 Domain experts
A domain expert is a person knowledgeable of a domain.
To get knowledge from domain experts, a knowledge en-
gineer conducts interviews. This process might lead to
Knowledge extraction from source code | 183
knowledge loss or even worse, introduce errors because
misunderstandings that arises frequently in human com-
munication.
2.2.2 Unstructured knowledge sources
Unstructured knowledge sources contain knowledge that
do not have a pre-dened organization. These are all kinds
of textual resources (Web pages, manuals, discussion fo-
rum postings, specications, analysis and conception doc-
uments, source code comments) and multimedia contents
(videos, photos, audio les) [3, 5, 6, 8, 23, 25, 26]. Unstruc-
tured sources are the most recurrent and can permit us to
extract a more complete knowledge. However, the unstruc-
tured sources are easily accessible to human information
processing only. For example, extracting formal specica-
tions from arbitrary texts is still considered a hard prob-
lem because sentences might be ambiguous and, in some
cases, no unique correct syntactic analysis is possible [31].
2.2.3 Structured knowledge sources
Structured knowledge sources contain knowledge de-
scribed by a schema. It is advantageous to use these
knowledge sources because they contain directly accessi-
ble knowledge [31]. Some structured knowledge sources
include:
Ontologies: Before constructing an ontology from
scratch, one may look at other ontologies that could
be reused [4, 15, 16];
Knowledge bases: In knowledge bases, one can gen-
erate discovered rules as input to develop a domain
ontology [25, 32];
Database : Terms to be used to build an ontology can
be extracted from a database schema [9–12, 25].
2.2.4 Semi-structured knowledge sources
Semi-structured knowledge sources contain knowledge
having a structure that already reects part of the seman-
tic interdependencies. This structure facilitates the extrac-
tion of a schema [31]. Some examples of semi-structured
knowledge sources are:
Folksonomies/thesaurus: It is advantageous to extract
knowledge from folksonomies or/and thesaurus to
build an ontology because they reect the vocabulary
of their users [33, 34];
XML (Extensible Markup Language): The aim of XML
data conversion to ontologies is the indexing, inte-
gration and enrichment of existing ontologies with
knowledge acquired from XML documents [13];
UML/meta-model: To learn an ontology from UML
or/and meta-model, one approach is to extract OWL
classes and properties from diagrams or to use On-
tology UML Prole (OUP) which, together with Ontol-
ogy Denition Meta-model (ODM), enable the usage of
Model Driven Architecture (MDA) standards in onto-
logical engineering [18];
Entity-relation diagram: They can be used to learn on-
tologies because they are used to describe the infor-
mation managed by the databases [35];
Source code [12, 21–23, 28]: Generally, in source code,
the names of data structures, variables, functions are
close to the terms of the domain.
A lot of work has been done on the extraction of ontologi-
cal knowledge from texts, databases, XML les, vocabular-
ies, and the use of ontologies to build or enrich other on-
tologies. This has resulted in a wide range of models, tech-
niques and tools for the generation of knowledge structure
that can be considered as an intermediate process when
constructing ontologies. It should be noted that few works
go beyond extracting concepts and properties from source
code whereas axioms and rules are also key elements of
ontologies.
2.3 Ontology learning techniques
To extract knowledge from knowledge sources, many tech-
niques are used [3, 25, 26, 36]. Shamsfard and Barforoush
[26] proposed a classication of these techniques by con-
sidering symbolics, statistics and multi-strategies.
2.3.1 Symbolic techniques
In symbolic techniques, the extraction process consists
of examining text fragments that match some predened
rules, looking for lexico-syntactic patterns corresponding
for instance to taxonomic relations or scanning for vari-
ous types of templates related to ontological knowledge.
A symbolic method can be rule-based, linguistic-based or
pattern-based.
1. Rule-based models are represented as a set of rules
where each rule consists of a condition and an action
[30].
184 | Azanzi Jiomekong, Gaoussou Camara, and Maurice Tchuente
Logical rules may be used to discover new
knowledge by deduction (deduce new knowledge
from existing ones) or induction (synthesize new
knowledge from experience). For example, induc-
tive logic programming can be used to learn new
concepts from knowledge sources [5, 25, 26, 37];
Association rules aim at nding correlations be-
tween items in a dataset. This technique is gen-
erally used to learn relations between concepts
[5, 8, 25, 26] and can be used to recognize a tax-
onomy of relations [25] or to discover gaps in con-
ceptual denitions [5, 26, 38].
2. Linguistic approaches (syntactic analysis, morpho-
syntactic analysis, lexico-syntactic pattern parsing,
semantic processing and text understanding) are used
to derive knowledge from text corpus [25, 26]. This
technique can be used to derive an intentional de-
scription of concepts in the form of natural language
description [38].
3. Pattern/Template-driven approach allows to search for
predened keywords, templates or patterns. Indeed,
a large class of entity extraction tasks can be accom-
plished by the use of carefully constructed regular ex-
pressions [39].
Although very powerful for particular domains, symbolic
techniques are inexible because of their strong depen-
dency on the structure of the data. Symbolic techniques
are precise and robust, but can be complex to implement,
and dicult to generalize [26].
2.3.2 Statistic-based techniques
Statistic analysis for ontology learning is performed from
input data to build a statistical model [3, 25, 26, 30]. Sev-
eral statistical methods for extracting ontological knowl-
edge have been identied in the literature:
1. Co-occurrence or collocation detection identies the
occurrence of some words in the same sentence, para-
graph or document. Such occurrences hint a potential
direct relation between words [40]. These techniques
can be used to discover terms that are siblings to each
other [24].
2. Clustering can be used to create groups of similar
words (clusters) which can be regarded as repre-
senting concepts, and further hierarchically organize
these as clusters. This technique is generally used for
learning concepts by considering clusters of related
terms as concepts and learning taxonomies by orga-
nizing these groups hierarchically [5]. Ontology align-
ment can use agglomerative clustering to nd candi-
date groups of similar entities in ontologies [38].
3. Hidden Markov Models (HMMs) dene a generative
statistical models that are able to generate data se-
quences according to rather complex probability dis-
tributions and that can be used for classifying sequen-
tial patterns [41–43]. Zhou and Su [44] have used HMM
for Named Entity Recognition; Maedche and Staab [8]
have used the n-gram models based on HMMs to pro-
cess documents at the morphological level before sup-
plying them to term extraction tools. Labsky et al. [29]
present the use of HMMs to extract information on
product oered by companies from HTML les.
2.3.3 Multi-Strategy learning
Multi-Strategy learning techniques leverage the strengths
of the above techniques to extract a wide range of ontolog-
ical knowledge from dierent types of knowledge sources
[25, 26, 30]. for example, Maeche and Staab [8] present the
use of clustering for concept learning and association rules
to learn relations between these concepts.
2.4 Ontology learning evaluation
After the extraction process, the evaluation phase permits
to know whether the knowledge extracted is accurate and
to conclude on the quality of the knowledge source. The
evaluation of ontological knowledge is coined by several
authors in the literature [45, 46]. Dellschaft and Staab [46]
have proposed two ways to evaluate ontological knowl-
edge: (1) In manual evaluation by human experts, the
knowledge is presented to one or more domain experts
who have to judge to what extent it is correct; (2) The com-
parison of the knowledge to existing reference vocabular-
ies/ontologies to ensure that it covers the studied domain.
3 Ontology learning from Java
source code using Hidden Markov
Models
Source code contains well-dened words in a language
that everyone understands (for example the elements gen-
erally found on the user interface), some statements with
a particular lexicon specic to the programming language
and to the programmer. For example, in Java programming
Knowledge extraction from source code | 185
language, the term "class" is used to dene a class, the
terms "if", "else", "switch", "case" are used to dene the
business rules (candidate to become rules). Other terms
dened by the programmer such as "PatientTuberculeux"
are used to represent the names of classes (candidate to be
concept); the term "examenATB" is used to dene the re-
lation (ObjectProperty) with cardinality (candidate to be-
come axiom) between the classes "PatientTuberculeux"
and "Examen"; and the group of terms "int agePatient" is
used to dene a property (DataProperty) of the class "Pa-
tientTuberculeux". This section shows how to dene, train
and use Hidden Markov Models (HMMs) for knowledge ex-
traction from Java source code.
3.1 Hidden Markov Models
A Markov Chain is a random process having a nite set of
states, and only the current state inuences where it goes
next [41]. Hidden Markov Models are particular types of
Markov Chain composed of a nite state automaton with
edges between any pair of states that are labeled with tran-
sition probabilities. It also describes a 2-stage statistical
process in which the behavior of the process at a given
time t is only dependent on the immediate predecessor
state. It is characterized by the probability between states
P(q
t
| q
1
, q
2
, ..., q
t1
) = P(q
t
| q
t1
) and for every state at
time t an output or observation o
t
is generated. The as-
sociated probability distribution is only dependent on the
current state q
t
and not on any previous states or observa-
tions: P(o
t
| o
1
, ..., o
t1
, q
1
, ..., q
t
) = P(o
t
| q
t
) [41, 43, 47–
49]. HMMs are generally used for pattern recognition, au-
tomatic voice processing, automatic natural language pro-
cessing, character recognition [41].
A rst order HMM perfectly describes the source code
because it can be seen as a string sequence typed by a pro-
grammer in which the current word (corresponding to an
assign hidden state) depends on the previous word. In this
HMM, the observed symbol depends only on the current
state [41–43]. Equation 1 presents the joint probability of a
series of observations O
1:T
given a series of hidden states
Q
1:T
. The HMM of Fig. 1 shows how the source code can be
modeled using a HMM. In this gure, the observations are
the words ("public", "class", "Patient", etc.) typed by the
programmers and each of these words are labeled by the
hidden states "PRE", "TARGET", "POST", and "OTHER".
P(O
1:T
, Q
1:T
) =
P(q
1
) P(o
1
| q
1
)
t=2
P(q
t
| q
t1
) P(o
t
| q
t
) (1)
Filtering, smoothing, prediction, and the most likely
explanation are four uses of HMMs. The probability that a
string O is emitted by a HMM M is calculated as the sum of
all possible paths by the equation 2.
P(O | M) =
q
1
,...,q
l
l+1
k=1
P(q
k1
q
k
) P(q
k
o
k
) (2)
Where q
0
and q
l+1
are limited to q
I
and q
N
respectively and
o
l+1
is an end of word. The observable output of the system
is the sequence of symbols emitted by the states, but the
underlying state sequence itself is hidden.
In the most likely explanation, the goal is to nd the
sequence of hidden states V( O | M) that best explains
the sequence of observations (equation 3) [41–43]. To this
end, the sequence of states V(O | M) which has the
greatest probability to produce an observation sequence
is searched.
For example, in automatic translation, one may want
the most probable string sequence that corresponds to the
string to be translated. In this case, instead of taking the
sum of the probabilities, the maximum must be chosen
(equation 3).
P(O | M) =
max
q
1
...q
l
Q
l
l+1
k=1
P(q
k1
q
k
) P(q
k
o
k
) (3)
Before using the model, its parameters (transition
probabilities, emission probabilities and initial probabil-
ities) must be calculated using statistical learning, Baum-
Welch algorithm or Viterbi training [41].
3.2 Source code versus HMM
During software development, it is recommended to write
the source code according to good programming practices,
including naming conventions [50]. These practices in-
form programmers on how to name variables, organize
and present the source code. This organization can be used
to model source code using HMMs (see Fig. 1). For exam-
ple, from Java source code, we can say that at a time t, the
programmer enters a word (e.g. "public" at the beginning
of a Java source le). Thus, the keyword "public" at time t
conditions the next word at time t+1 which in this case can
be "class", "int", etc. We can say that PRE and TARGET are
the hidden states and "public" and "class" are respectively
their observations.
186 | Azanzi Jiomekong, Gaoussou Camara, and Maurice Tchuente
Figure 1: An example of HMM modeling the Java source code
Source code contains several types of les: les de-
scribing data, les processing data, user interface les and
conguration les.
3.2.1 Files describing data
These les describe the data to be manipulated and
equally, some constraints on this data (e.g., data types).
In Java EE for example, there are entities whose names are
close to the terms of the domain that will be transformed
into tables in the database. These les often contain cer-
tain rules to verify the reliability of the data. Thus, from
these les, we can retrieve concepts, properties, axioms
and rules.
3.2.2 Files containing data processing
Located between user interface les and data description
les is the data processing les of the source code consist-
ing of:
Control: For example, restricting certain data from
certain users (e.g., only the attending physician has
the right to access the data), checking the validity of
a eld (checking whether the data entered in an "age"
eld is of type integer);
Calculation: For example, converting a date of birth
into an age, determining the date of the next appoint-
ment of a patient, calculating the body mass index of
a patient based on his/her weight and height.
These are the algorithms implementing the business rules
to be applied to the data. They are thus good candidates
for axioms and rules extraction.
3.2.3 User interfaces les
The User interfaces are composed of les which describe
the information that will be presented to users for data
viewing or recording. Unlike the rst two les types, these
les contain the words of a human-readable vocabulary
that can be found in a dictionary. User interfaces usually
provide:
Translations allowing navigation from one language
to another, control for users to enter the correct data;
An aid allowing users to know for example, the role of
a data entry eld.
User Interfaces are therefore good candidates for concepts
and their denitions, properties, axioms and rules extrac-
tion.
3.2.4 Conguration les
These les allow developers to specify certain information
such as the type and path of a data source, dierent lan-
guages used by users, etc. For instance, from these les,
the languages labels (e.g. English, French, Spanish) for
terms can be extracted.
The les we just presented generally contain com-
ments that can be useful for knowledge extraction or on-
tology documentation. Knowledge extraction from user in-
terfaces/web interfaces has already been addressed in [12,
24], knowledge extraction from text has been presented in
[5, 7, 8, 23]. In this article, we will focus on knowledge ex-
traction from les describing data and their processing.
3.3 Knowledge extraction process
To extract knowledge from Java source code, we designed a
method divided into ve main steps: data collection, data
Knowledge extraction from source code | 187
preprocessing, entity labeling, formal language transla-
tion, and knowledge validation.
3.3.1 Data collection
The data collection step consist of the extraction of a
dataset necessary for the next steps. In Java les, state-
ments for importing third-party libraries and comments
are deleted. We proposed the denition of a regular expres-
sion that allow them to be identied.
3.3.2 Data preprocessing
The purpose of data preprocessing is to put data in a form
compatible with the tools to be used in the next steps.
During this phase, potentially relevant knowledge will be
identied and retrieved, and some entities will be recoded.
The problem of extracting knowledge from the source code
has been reduced to the problem of syntactic labeling. This
is to determine the syntactic label of the words of a text
[42]. In our case, it will be a matter of assigning a label to
all the words of the source code and extracting the words
marked as target words. This problem can be solved us-
ing HMMs [42, 43]. In the following paragraphs, we will
rst present the HMM structure for source code modeling.
Then, we will show how this HMM is trained and nally,
how it is used to extract the knowledge from Java source
code.
HMMs structure denition. To dene the structure
of the HMMs, we manually studied the organization of the
source code of Java language. Generally, data structures,
attributes, and conditions are surrounded by one or more
specic words. Some of these words are predened in ad-
vance in the programming language. To label the source
code, we have dened four labels, corresponding to four
hidden states of the HMM:
PRE: Corresponding to the preamble of the knowl-
edge. This preamble is usually dened in advance;
TARGET: The target, (i.e. the knowledge sought) may
be preceded by one or more words belonging to the
PRE set. The knowledge we are looking for are the
names of classes, attributes, methods, and the rela-
tionships between classes. They are usually preceded
by a meta-knowledge which describes them. For ex-
ample, the meta-knowledge "class" allows for concept
identication;
POST: Any information that follows the knowledge
sought. In some cases, POST is a punctuation charac-
ter or braces;
OTHER: Any other word in the source code that nei-
ther precedes nor follows the knowledge sought.
An example of HMM annotated with labels is given by Fig.
1. Concepts, properties, axioms, and rules are usually ar-
ranged dierently in the source code. We propose the def-
inition of two HMMs which permit them to be identied:
one to identify concepts, properties, axioms and the other
one to identify rules.
Learning Model Parameters. There are several tech-
niques to determine the parameters of a HMM: Statistical
learning on data, specialized algorithms such as Baum-
Welch or Viterbi training [41, 42]. In this paper, we have
chosen statistical learning on data to train the HMMs mod-
eled in the previous paragraphs. Thus, we assumed that
we have access to T source code les labeled f
t
knowing
that f
t
is not just a sequence of words, but a sequence of
words pairs with the word and its label (see Fig. 1) mod-
eled by the equation 4. To train the model, we assume that
we can dene the order in which the dierent words are en-
tered by the programmer. We assume that before entering
the rst word, the programmer reects on the label of that
word and as a function of it, denes the label of the next
word and so on. For example, before entering the word
public, the programmer knows that its label is PRE and
that the label of the next word is TARGET. Thus, the cur-
rent word depends only on the current label, the following
label depends on the previous label, and so on. The pro-
cess continues until the end of the le.
f
t
= [(w
t
1
, e
t
1
) , ..., (w
t
d
, e
t
d
)] ,
words(f
t
) = [w
t
t
, ..., w
t
d
] ,
labels(f
t
) = [e
t
1
, ..., e
t
d
] .
(4)
In the equation 4, w
i
and e
i
are words and labels of
f
i
les respectively. In practice, w
i
are words contained in
the source code (observations) and e
i
are the labels of w
i
used as hidden states.
From the training data, we can extract statistics on:
The rst label P(q
1
) (equation 5). A priori probability
that the rst label is equal to the word
a
is the num-
ber of times the rst label in each le of the source code
is the word
a
divided by the number of source code
les.
P(Q
1
= a) =
t
freq(e
t
1
= a, f
t
)
T
(5)
The relation between a word and its label P(O
k
| q
k
)
(equation 6). The conditional probability that the k
th
word is
w
, knowing that the label is
b
corresponds
to the number of times the word
w
associated with
188 | Azanzi Jiomekong, Gaoussou Camara, and Maurice Tchuente
the label
b
in the source code le f
t
normalized with
the fact that the label
b
is associated with any other
word in f
t
source code. For example, "Patient" can be
a concept, an attribute, but cannot be a rule.
P(O
k
= w | q
k
= b) =
α +
t
freq((w, b), f
t
)
β +
t
freq((
*
, b), f
t
)
(6)
To avoid zero probabilities for observations that do not
occur in the training data, we added smoothing terms
(α and β).
The relation between the adjacent syntactic label is
P(q
k
| q
k+1
) (equation 7). The probability that q
k+1
is equal to label
a
knowing that q
k
is equal to label
b
(previous hidden state) is the number of times
a
follows
b
in the source code of the training data di-
vided by the number of times that
b
is followed by
any other label.
P(q
k+1
= a | q
k
= b) =
α +
t
freq(b, a), label(f
t
)
β +
t
freq(b,
*
) , label(f
t
)
(7)
To avoid zero probabilities for transitions that do not occur
in the training data, we added smoothing terms (α and β).
Let us consider the HMM in Fig. 1. Then, training data
to identify concepts and attributes would be: [("public",
PRE), ("class", TARGET), ("Patient", TARGET), ("extends",
TARGET), ("ImogEntityImpl", TARGET), ("{", OTHER), (...),
("int", TARGET), ("age", TARGET), ...]. Tab. 1 presents the
initial vector, which is the probability that the rst label is
PRE, TARGET, POST, or OTHER; Tab. 2 presents the transi-
tion vector containing the frequencies that a state follows
another state; and Tab. 3 presents the emission vector con-
taining the frequencies that a state emits an observation.
Knowledge extraction. The model previously de-
ned and trained can be applied to any Java source code
in order to identify TARGET elements. It will be necessary
to nd from the les f
1
, ..., f
n
, a sequence of states q
1
, ...,
q
n
that is plausible. For this, equation 3 will be used to
determine the most plausible string sequence. From this
string, the hidden states will be identied and the targets
(words that are labeled TARGET) will be extracted. In our
approach, we used Viterbi algorithm which provides an ef-
cient way of nding the most plausible string sequence of
hidden states [51, 52]. The algorithm 1 gives an overview of
the Viterbi Algorithm. More details can be found in [41].
Any source code can then be submitted to the HMM
trained and a table similar to Tab. 10 containing the prob-
ability for the hidden states to emit a word from the source
code is built.
Let M = (π, A, B) our HMM
With π the vector of start probabilities, A the
matrix of state-transition probabilities, and B the
matrix of observation probabilities
Let δ
t
( i) =
max
q
1
,...,q
t1
P(O
1
, ..., O
t
, q
1
, ...q
t1
, q
t
= i| M)
1. Initialization
δ
1
( i) := π
i
b
i
( O
1
) ψ
1
( i) := 0
2. Recursion
For all times t, t
1
, ..., T 1:
δ
t+1
( j) := max
i
{ δ
t
( i)a
ij
} b
j
( O
t+1
)
ψ
t+1
( j) := argmax
i
{ δ
t
( i)a
ij
}
3. Termination
P
*
( O|M) = P(O, q
*
| M) = max
i
δ
T
( i)
q
*
T
:= argmax
j
δ
T
( j)
4. Back-Tracking of the Optimal Path
for all times t, t = T 1, ..., 1 :
q
*
t
= ψ
t+1
( q
*
t+1
)
Algorithm 1: The Viterbi algorithm [41, 52]
Recoding variables. Programmers usually use ex-
pressions made up of words from a specic lexicon, some-
times encoded with "ad hoc" expressions, requiring spe-
cic processing to assign a new name or a label under-
standable by humans before using. These words are gen-
erally divided into words or groups of words according to
the naming conventions of the programming language. For
example, we can have "PatientTuberculeux" "Patient
tuberculeux", "agePatient" "Age Patient", "listeExam-
ens" "liste Examens", etc. Therefore, during the recod-
ing, these names are separated in order to nd their real
sense in human understandable language.
3.3.3 Entities labeling
The extraction of relevant terms has yielded knowl-
edge and meta-knowledge. This knowledge and meta-
knowledge will permit us identify to which ontological
components they may belong to. For example, the code:
"class Patient extends Person int age", submitted to a
trained HMM to identify concepts and relations will iden-
tify three meta-knowledge ("class", "extends" and "int")
that will be used to identify two concepts (Patient and Per-
son), one attribute of type integer and a hierarchical rela-
tion between "Patient" and "Person". From the extracted
knowledge, two candidates to be concepts are related if
one is declared in the structure of the other. One may iden-
tify three types of relations:
Knowledge extraction from source code | 189
Table 1: The initial vector - probability to have a state as the rst label
f(PRE) f(TARGET) f(POST) f(OTHER)
Table 2: An example of a transition table
States PRE TARGET POST OTHER
PRE f(PRE,PRE) f(PRE,TARGET) f(PRE,POST) f(PRE,OTHER)
TARGET f(TARGET,PRE) f(TARGET,TARGET) f(TARGET,POST) f(TARGET,OTHER)
POST f(POST,PRE) f(POST,TARGET) f(POST,POST) f(POST,OTHER)
OTHER f(OTHER,PRE) f(OTHER,TARGET) f(OTHER,POST) f(OTHER,OTHER)
ObjectProperty: If two classes A and B are candi-
dates to be concepts and b of type B is declared as
attribute of class A’, then classes A and B are re-
lated. The attribute b is an ObjectProperty having A
as domain and B as range.
DatatypeProperty: If a class A is a candidate to be a
concept and contains the attributes a and b of ba-
sic data types (integers, string, boolean, etc.), then, a
and b are DatatypeProperty having the class A as
domain;
Taxonomy (subClassOf): If two classes A and B are
candidates to be concepts and the class B extends
the class A (in Java, the keyword "extends" is used),
then, one can dene a taxonomic relation between the
classes B and A’.
3.3.4 Translation in a formal language
Once all relevant knowledge are identied in the previ-
ous phase, they are automatically translated to a machine
readable language. We use OWL for concepts, properties
and axioms, and SWRL for rules.
3.3.5 Knowledge evaluation
After the extraction process, the evaluation phase permits
us to know if this knowledge is relevant to the related do-
main and to conclude on the relevance in using source
code as a knowledge source. Given that the knowledge
extracted is ontological knowledge, two evaluation tech-
niques will be used: (1) Manual evaluation by human ex-
perts in which the knowledge extracted is presented to one
or more domain experts who have to judge to what extent
these knowledge are correct; (2) The comparison of the
knowledge extracted (alignment) to gold standards which
will be existing ontologies.
3.4 HMMs denition, training and use
To extract knowledge from Java source code, two HMMs
have to be dened and trained: a HMM for concepts, prop-
erties, and axioms identication, and a HMM for rules
identication. All the algorithms for HMMs training and
usage have been coded in Java
2
.
3.4.1 HMM structure for concepts, properties and
axioms
The HMM used to identify concepts, properties and axioms
is dened by:
1. PRE = { public, private, protected, static, nal},
the set of words that precedes TARGET;
2. TARGET = { package, class, interface, extends,
implements, abstract, enum, w
i
} , i, w
i1
PRE ||
w
i2
PRE w
i1
PRE, the set of all words that we
are seeking;
3. POST = {{, ; , }}, the set of words that follow TARGET;
4. OTHER = {w
i
} , w
i
̸ PRE, w
i
̸ TARGET, w
i
̸
POST, the set of all other words.
Each HMM state emitted a term corresponding to a word
from the source code. We have seen that the observation
emitted by the PRE set can be enumerated. However, the
observation of TARGET and OTHER sets cannot be enu-
merated because they depend on the programmer. Then,
we considered data to be all the observations emitted by
TARGET and other to be all the observations emitted by
OTHER. We obtained the HMM presented by an initial vec-
tor (e.g., Tab. 4) a transition vector (e.g., Tab. 5), and an
observation vector (e.g., Tab. 6).
190 | Azanzi Jiomekong, Gaoussou Camara, and Maurice Tchuente
Table 3: An example of an observation table
package pac ; public class patient ...
PRE f(PRE,package) f(PRE, pac) f(PRE,;) f(PRE,public) f(PRE,class) f(PRE,patient) ...
TARGET f(TARGET,package) f(TARGET, pac) f(TARGET,;) f(TARGET,class) f(TARGET,patient) ...
POST f(POST,package) f(POST, pac) f(POST,;) f(POST,public) f(POST,class) f(POST,patient) ...
OTHER f(OTHER,package) f(OTHER, pac) f(OTHER,;) f(OTHER,public) f(OTHER,class) f(OTHER,patient) ...
3.4.2 HMM structure for rules
Rules can be contained in conditions. Then, we will exploit
the structure of source code to extract the rules. For ex-
ample, the portion of code (if (agePatient> 21) {Patient =
Adult}) is a rule determining whether a patient is an adult
or not. It must therefore be extracted.
The HMM to identify the rules is composed of:
1. PRE = {}, ”; , {}, the set of words that precede
one or more TARGET;
2. TARGET = {if , else, switch, w
i
} | k, r N | w
ik
PRE wi + r POST: the set of all words that follow
PRE and precede POST;
3. POST = {}}, the end of the condition;
4. OTHER = {w
i
} | w
i
̸ PRE, TARGET, POST: the set
of all other words.
We can identify the beginning and the end of a condition
represented here by the sets PRE and POST respectively.
Note that all the observations emitted by TARGET and
OTHER sets cannot be fully enumerated. Therefore, we
have considered data to be all the observations emitted by
TARGET, and other to be all the observations emitted by
OTHER.
3.4.3 Statistical learning of the HMMs
LearnJava source code (composed of 59 les and 2663
statements) was downloaded from github
3
and from this
source code, we used statistical learning on data presented
in section 3.3.2 to calculate the values of the HMMs param-
eters
4
. Tabs 4, 5, 6, 7, 8, 9 present the initialization, tran-
sition and observation vectors respectively obtained after
the training step.
3.4.4 Knowledge extraction
Once the HMMs are built, we can apply them to the source
code of any Java applications in order to extract the knowl-
edge. To do this, the most likely state sequence (equation
3) that produced this source code is calculated. To calcu-
late the most likely state sequence, we have implemented
the Viterbi algorithm [41, 51, 52] in Java
5
. In fact, we have
exploited the structure of the HMM in the context of dy-
namic programming. It consists of breaking down the cal-
culations into intermediate calculations which are struc-
tured in a table. An example of Viterbi table is given by the
Tab. 10. Every element of the table is being calculated us-
ing the previous ones. From this table, the Viterbi path is
retrieved by getting the frame with the highest probability
in the last column and given this frame, to search all the
frames that were used to build it. All the elements whose
labels are TARGET are extracted as candidates.
4 Experimentation
This section presents the experimentation of the approach
described in section 3. This experimentation consists in ex-
tracting ontological knowledge from EPICAM source code
composed of 1254 Java les and 271782 instructions. Fig. 2
presents a screenshoot of some concepts from the EPICAM
source code.
4.1 Knowledge extraction from EPICAM
To extract ontological knowledge from EPICAM source
code, we proceeded step by step using the method pre-
sented in section 3.
4.1.1 Data collection
The source les of EPICAM platform are composed of state-
ments, imported libraries and comments. Data collection
involves removing the imported libraries and comments.
To this end, we dened the regular expression
import[u0000 u ]
*
?; |(.)
*
n|(
*
[ u0000u ]
*
?
*
) to identify them. Once
identied, we wrote a Java program to delete them.
Knowledge extraction from source code | 191
Table 4: The initial vector of the HMM for concepts, properties and axioms extraction
PRE TARGET POST OTHER
0.0 1.0 0.0 0.0
Table 5: Transition vector of the HMM for concepts, properties and axioms extraction
PRE TARGET POST OTHER
PRE 0.1686 0.8260 0.0027 0.0027
TARGET 0.0008 0.7523 0.2461 0.0008
POST 0.0603 0.0033 0.0234 0.9130
OTHER 0.7364 0.1133 0.0025 0.1478
Table 6: Observation vector of the HMM for concepts, properties and axioms extraction
public private protected static nal data { ; } other
PRE 0.6417 0.1684 0.0053 0.1124 0.0722 0.0 0.0 0.0 0.0 0.0
TARGET 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
POST 0.0 0.0 0.0 0.0 0.0 0.0 0.6678 0.3256 0.0066 0.0
OTHER 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
Table 7: The initial vector of the HMM for rules extraction
PRE TARGET POST OTHER
0.0 0.0 0.0 1.0
Table 8: Transition vector of the HMM for rules extraction
PRE TARGET POST OTHER
PRE 0.0667 0.7999 0.0667 0.0667
TARGET 0.0010 0.9321 0.0659 0.0010
POST 0.0172 0.0172 0.0172 0.9484
OTHER 0.0072 0.0001 0.0001 0.9926
Table 9: Observation vector of the HMM for rules extraction
{ } ; if else
PRE 0.8462 0.0769 0.0769 0.0 0.0
TARGET 0.0 0.0 0.0 0.0185 0.0031
POST 0.0 1.0 0.0 0.0 0.0
OTHER 0.0 0.0 0.0 0.0 0.0
switch data other
PRE 0.0 0.0 0.0
TARGET 0.0010 0.9774 0.0
POST 0.0 0.0 0.0
OTHER 0.0 0.0 1.0
4.1.2 Data preprocessing
Data preprocessing consists in extracting the elements
likely to be relevant from the source code and recod-
ing them if necessary. We have used the HMMs dened
and trained in section 3.4. These HMMs were applied to
the source code of EPICAM by calculating the values of
the Viterbi table (see Tab. 10). Once the table is built,
we searched the Viterbi path by getting the frames with
the highest probability in the last column and using this
frame, we search all the frames that were used to build it.
Once the Viterbi path is identied, all the elements labeled
TARGET are extracted.
Fig. 3 presents the set of candidates for concepts, prop-
erties, and axioms identied and Fig. 4 presents the set of
candidates for rules identied.
192 | Azanzi Jiomekong, Gaoussou Camara, and Maurice Tchuente
Figure 2: An overview of the Java source code of the EPICAM project
Table 10: The Viterbi table (α table) built using EPICAM source code
package org.epicam ; public ... }
PRE 0 α(PRE, 2) α(PRE, 3) α(PRE, 4) ... α(PRE, t)
TARGET 1 α(TARGET, 2) α(TARGET, 3) α(TARGET, 4) ... α(TARGET, t)
OTHER 0 α(OTHER, 2) α(OTHER, 3) α(OTHER, 4) ... α(OTHER, t)
4.1.3 Recoding terms and rules
To recode the candidates extracted, we used Java nam-
ing conventions. All the candidates were browsed and for
the candidates containing the keywords of the program-
ming language, these keywords were removed. For exam-
ple, consider the term CasTuberculoseEditorWorkow that
was extracted from the source code; the terms Editor and
Workow are keywords of Google Web Toolkit, the technol-
ogy used to build the EPICAM platform. Then, the terms
Editor and Workow are removed and the term CasTuber-
culose is retained as candidate.
After the recoding, we moved to the next step which is
the translation into formal language.
4.1.4 Entities identication and translation into OWL
Data preprocessing phase produced a le containing only
the meta-knowledge (e.g "package", "class", "extends",
"if", "switch") and the knowledge (e.g "patientManage-
ment.Patient", "Patient" or "serology"). We wrote a Java
program to browse these les in order to identify the
knowledge that may be useful. Meta-knowledge allow
the identication of the candidates as concepts, prop-
erties and axioms. For example, if the string "pack-
age minHealth.Region.District.hospitals.patientRecord ...
class Patient extends Person ... int age ... List<Exam>
listExam" is extracted, then, the following ontological
knowledge is identied:
"package minHealth.Region.District.hospitals.
patientRecord:" This is used to identify the class
hierarchy;
"class Patient extends Person": This expression
means that "Patient" and "Person" are candidates that
will become concepts and there is a hierarchical rela-
tion between concepts "Patient" and "Person";
"int age; List <Exam> listExam": This expression
means that "age" and "listExam" are properties of the
concept "Patient"; the following axiom is also dened:
a patient has only a single age (i.e. age is a functional
property).
Knowledge extraction from source code | 193
Figure 3: An excerpt of candidates extracted for concepts, properties and axioms
Figure 4: An excerpt of candidates extracted for rules identication
After the identication of entities, we proposed a sec-
ond Java program
6
to automatically translate them into an
OWL ontology
7
.
In the same way, rules were also extracted and trans-
lated into Semantic Web Rule Language
8
. An example of
a rule specifying the rights of a doctor on patient data is
given by:
194 | Azanzi Jiomekong, Gaoussou Camara, and Maurice Tchuente
doctorsRule = "Personnel (?pers) personnel_login
(?pers, login) personnel_passwd (?pers, passwd)
Patient (?p) RendezVous (?rdv) hasRDV (?rdv, ?p)
patient_nom (?p, ?nom) patient_age (?p, ?age)
patient_sexe (?p, ?sexe) patient_telephoneUn (?p,
?telephone) rendezVous_dat eRendezVous (?rdv,
?datardv) rendezVous_honore (?rdv, ?honore) ren-
dezVous_honore (?rdv, Non) sqwrl:select (?nom, ?age,
?sexe, ?telephone, ?datardv, ?honore)";
4.2 Analysis of the elements extracted
The extraction process produced a set of candidates (Figs 3
and 4), but also false positives (Tab. 11 presents the statis-
tics). The false positives consist of the set of candidates
that belong to the PRE, POST or OTHER sets that normally
should not be extracted as observations of TARGET. We
wrote a Java program to identify and delete them.
Tab. 11 presents the statistics of candidates/group of
candidates that were extracted. After the extraction pro-
cess, we obtained dierent types of candidates/group of
candidates:
Irrelevant candidates/group of candidates: These
are utility classes and temporary variables. Utility
classes are classes that the programmer denes to per-
form certain operations. These classes usually contain
constants and methods. The names of these classes
are usually not related to the domain. Temporary vari-
ables (e.g., the variables used in a loop) are used tem-
porarily in the source code and are not related to the
domain.
Relevant candidates/group of candidates: These
are knowledge found. These candidates are composed
of synonyms (candidates of identical meaning) and re-
dundancies (candidates that come up several times).
We wrote a Java program to identify and remove re-
dundancies candidates automatically.
We also extracted candidates conditions to be rules. As we
did with the candidates to be concepts, properties and ax-
ioms, false positives were identied and deleted. From the
rules extracted, we found:
Irrelevant conditions: These are conditions that are
not really important. For example, testing whether a
temporary variable is positive or is equal to a certain
value. These conditions were the most numerous;
Relevant conditions: Conditions corresponding to a
business rule (e.g., testing if a user has access right to
certain data).
Table 11: Statistics on candidates extracted
Candidates Relevant Irrelevant
Concepts 1840 (72.87%) 685 (27.13%)
Properties 38355 (81.42%) 8755 (18.58%)
Axioms 3397 (83.22%) 685 (16.78%)
Rules 1484 (07.89%) 17332 (92.11%)
5 Evaluation
The concepts, properties and axioms extracted were trans-
lated into an OWL ontology. The extracted rules are repre-
sented in SWRL. We used the Protege editor to provide a
graphical visualization of the ontology and rules to human
experts for their evaluation. Fig. 5 presents an overview of
the ontology obtained.
Three experts from the tuberculosis surveillance do-
main involved in the EPICAM project were invited to evalu-
ate the knowledge extracted. They are from three dierent
organizations in Cameroon (Centre Pasteur of Cameroon,
National Tuberculosis Control Program and a hospital in
Yaounde). The domain experts were asked to check rst if
the terms extracted are relevant to the tuberculosis clinical
or epidemiological perspectives. Second, they analyzed
the axioms and rules. First of all, they found that the ter-
minology was relevant to the tuberculosis. However, they
suggested to correct some typos caused by the names of the
classes and attributes given by programmers. Axioms and
rules were generally correct. Some rules were suggested to
be updated as the business rules have evolved (e.g. user
access to patient data has been improved taking into ac-
count their post such as epidemiologist, physician, nurse
or administrative sta).
In line with the experts validation, we evaluated the
coverage of the ontology terms by taking reference on
other ontologies in the biomedical domain. We used Bio-
Portal [53] as a biomedical ontology repository. BioPortal
contains more than 300 ontologies including a large num-
ber of medical terminologies such as SNOMED (System-
atized Nomenclature of Medicine) [54]. BioPortal has an
Ontology Recommender module that is used to nd the
best ontologies for a biomedical text or a set of keywords
[55]. This task is done according to four criteria: (1) the ex-
tent to which the ontology covers the input data; (2) the
acceptance of the ontology in the biomedical community;
(3) the level of detail of the ontology classes that cover the
input data; (4) and the specialization of the ontology to the
domain of the input data. We gave as input keywords to the
Recommender the set of terms (concepts and properties)
Knowledge extraction from source code | 195
Figure 5: An overview of the generated OWL ontology
of the ontology extracted by our HMM. Fig. 6 shows that
the ontology terms are covered by many biomedical on-
tologies. In the rst line of the recommended ontologies,
we could see that NCIT, SNOWMEDCT, ONTOPARON (ac-
cepted by the community with a score of 75.6%) cover the
terms from our ontology with a score of 82.9%, have a level
of details of 64% and the level of specialization of 40%. We
came to the conclusion that terms extracted by our HMM
are relevant to the biomedical domain.
At the end of the evaluation, we conclude that EPICAM
source code contains ontological knowledge that can be
used as a relevant basis to build and/or enrich an ontology
for the tuberculosis surveillance domain.
6 Related work
Despite the large amount of available source codes and
the fact that they may contain relevant knowledge of the
domain [12, 21–23] addressed by the software, the num-
ber of existing work on knowledge extraction from these
knowledge sources is quite low. Parser-based approach
and machine learning techniques are the commonly used
in knowledge extraction from source code.
6.1 Parser-based approach
A straightforward solution to extract knowledge from
source code is to use a parser. There are works in this di-
rection for generating knowledge base (RDF triples) or ex-
tracting ontological knowledge (concepts and properties)
from source codes using parsers. For instance, CodeOntol-
ogy [20, 56] parser is able to analyze Java source code and
serialize it into RDF triples. From these triples, highly ex-
pressive queries using SPARQL (SPARQL Protocol and RDF
Query Language) can be executed for dierent software
engineering purposes including the searching of specic
software component for reuse. Ganapathy and Sagayaraj
196 | Azanzi Jiomekong, Gaoussou Camara, and Maurice Tchuente
Figure 6: The Ontology Recommender output from the extracted ontology terms
[28] used QDox
9
generator to generate an ontology that
will further enable the developers to reuse source code ef-
ciently. QDox generator is a parser that can be used for
extracting classes, attributes, interfaces and method def-
inition from Java source code. In the approach proposed
by [12], the authors dened the components parts of the
source code and break down the source code into these
components. The source code is browsed and the dier-
ent components are analyzed in order to take an appro-
priate action which is the extraction of knowledge sought.
This knowledge can be used in supplementing and assist-
ing ontology development from database schemas.
Beyond RDF triples, terms, concepts and properties
extraction, existing parsers do not provide services for ax-
ioms and rules extraction. To overcome these limits, they
need to be improved. However, building and/or updating
parsers for programming languages is a non-trivial, labo-
rious and time-consuming task [57, 58].
6.2 Machine learning-based approach
Machine learning approaches are also proposed to extract
knowledge from source code.
Kalina Bontcheva and Marta Sabou [23] have pre-
sented an approach for ontology learning from software
artifacts such as software documentation, discussion fo-
rums and source code by using the language processing
NOTES | 197
facilities provided by GATE 2 platform
10
. GATE 2 is an Open
source software developed in Java for building and de-
ploying Human Language Technology application such
as parsers, morphology, tagging, Information Retrieval
tools, Information Extraction components, etc. To extract
concepts from source code, Kalina Bontcheva and Marta
Sabou used the GATE key phrase extractor, which is based
on TF.IDF (term frequency/inverted document frequency).
The TD.IDF approach is an unsupervised machine learn-
ing technique which consists of nding words/phrases
that are characteristic of the given text, while ignoring
phrases that occur frequently in the text simply because
they are common in the language as a whole. When using
TF.IDF on the source code, high frequency terms specic
to the programming language can be eliminated and only
terms specic to the given software project would be se-
lected as relevant to the domain (ontology concept). This
approach is used to extract concept. However, ontologi-
cal knowledge is also made up of properties, axioms and
rules.
Labsky et al. [29] presented an approach for informa-
tion extraction on product oered by companies from their
websites. To extract information from HTML documents,
they used Hidden Markov Models to annotate these docu-
ments. Tokens modelled by this HMM include words, for-
matting tags and images. The HMM is modelled using four
states: the target state (T) which is the slot to extract, the
prex and the sux state (P, S) which constitute the slot’s
context, and the irrelevant tokens modelled by a single
background state (B). This approach permitted the extrac-
tion of slots and the relation between nearby slots. For ex-
ample product image often follows its name. Unlike the
authors approach which consists of terms extraction, our
approach uses meta-data extracted from source code in
order to identify to which ontological component every
term/group of terms corresponds to.
7 Conclusion and future work
In this paper, we proposed an approach for knowledge ex-
traction from Java source code using Hidden Markov Mod-
els (HMMs). We experimented this approach by extract-
ing ontological knowledge from EPICAM, a tuberculosis
epidemiological surveillance platform developed in Java.
Evaluation by domain experts (clinicians and epidemiolo-
gists) permitted us to show the relevance of the knowledge
extracted. In line with the experts validation, we evaluated
the coverage of terms extracted by reference ontologies
in biomedical domain. We used Ontology Recommender
from BioPortal repository. The results of the evaluation
shows that the terms are well covered by many biomedi-
cal ontologies (e.g., NCIT, SNOWMEDCT, ONTOPARON).
Our goal in this paper was twofold: (1) to show that
source code contains ontological knowledge that could
be used in domain ontology engineering and (2) to show
how to dene, train and use HMMs to extract these knowl-
edge. Since we have used the statistical learning on data
approach to calculate the parameters of the HMMs, our
future work consists of experimenting the Baum-welch
and Viterbi training approaches. The performance of these
three approaches will be evaluated and compared to the
parser approach.
Notes
1
http://www.medes.fr/fr/nos-metiers/la-e-sante-et-l-
epidemiologie/la-tele-epidemiologie/projet-epicam.html
2
https://github.com/jiodelus/source2onto
3
https://github.com/mafudge/LearnJava
4
https://github.com/jiodelus/source2onto/blob/
master/code2onto-model/src/main/java/cm/uy1/training/
HMMTrainingData.java
5
https://github.com/jiodelus/source2onto/blob/
master/code2onto-model/src/main/java/cm/uy1/modelUse/
KnowledgeExtractionHMM.java
6
https://github.com/jiodelus/source2onto/blob/master/
code2onto-model/src/main/java/cm/uy1/helper/OWLHelper.java
7
https://github.com/jiodelus/ontologies/blob/master/epicam/
epicam.owl
8
https://github.com/jiodelus/ontologies/blob/master/epicam/
epicamrules.owl
9
https://github.com/paul-hammant/qdox
10
https://gate.ac.uk/
References
[1] Studer R., Benjamins V.R., Fensel D., Knowledge Engineering:
Principles and Methods, Data Knowl. Eng., 1998, 25(1-2), 161–
197, 10.1016/S0169-023X(97)00056-6
[2] Gómez-Pérez A., Fernández-López M., Corcho Ó., Ontologi-
cal Engineering: With Examples from the Areas of Knowl-
edge Management, e-Commerce and the Semantic Web, Ad-
vanced Informationand Knowledge Processing, Springer, 2004,
10.1007/b97353
[3] Konys A., Knowledge systematization for ontology learning
methods, in Knowledge-Based and Intelligent Information &
Engineering Systems, Proceedings of the 22nd International
Conference KES-2018, Belgrade, Serbia, 3-5 September 2018.,
2018, 2194–2207, 10.1016/j.procs.2018.07.229
[4] Suárez-Figueroa M.C., Gómez-Pérez A., Fernández-López M.,
The NeOn Methodology framework: A scenario-based method-
ology for ontology development, Applied Ontology, 2015, 10(2),
198 | NOTES
107–145, 10.3233/AO-150145
[5] Cimiano P., Ontology learning and population from text - al-
gorithms, evaluation and applications, Springer US, 2006,
10.1007/978-0-387-39252-3
[6] Ghosh M.E., Naja H., Abdulrab H., Khalil M., Ontology Learning
Process as a Bottom-up Strategy for Building Domain-specic
Ontology from Legal Texts, In Proceedings of the 9th Interna-
tional Conference on Agents and Articial Intelligence, ICAART
2017, Volume 2, Porto, Portugal, February 24-26, 2017., 2017,
473–480, 10.5220/0006188004730480
[7] Alexander M., Raphael V., The Ontology Extraction & Mainte-
nance Framework Text-To-Onto, In International Conference on
Data Mining (ICDM), San Jose, USA, November 29 - December 2,
2001, IEEE, Los Alamitos (CA), 2001
[8] Alexander M., Steen S., Semi-automatic engineering of ontolo-
gies from text, Proceedings of the 12th Internal Conference on
Software and Knowledge Engineering. Chicago, USA, 2000
[9] Cerbah F., Lammari N., Ontology Learning from Databases:
Some Ecient Methods to Discover Semantic Patterns in Data,
in A..I.P. Serie, ed., Perspectives in Ontology Learning, 2014, 30
[10] Cullot N., GhawiR., Yétongnon K., DB2OWL :A Toolfor Automatic
Database-to-Ontology Mapping, In Proceedings of the Fifteenth
Italian Symposium on Advanced Database Systems, SEBD 2007,
17-20 June 2007, Torre Canne, Fasano, BR, Italy, 2007, 491–494
[11] Idrissi B.E., Baïna S., Baïna K., Ontology Learning from Rela-
tional Database: How to Label the Relationships Between Con-
cepts?, In Beyond Databases, Architectures and Structures -
11th International Conference, BDAS 2015, Ustroń, Poland, May
26-29, 2015, Proceedings, 2015, 235–244, 10.1007/978-3-319-
18422-7_21
[12] Zhao S., Chang E., Dillon T.S., Knowledge extraction from web-
based application source code: An approach to database re-
verse engineering for ontology development, In Proceedings of
the IEEE International Conference on Information Reuse and In-
tegration, IRI 2008, 13-15 July 2008, Las Vegas, Nevada, USA,
2008, 153–159, 10.1109/IRI.2008.4583022
[13] Hacherouf M., Bahloul S.N., Cruz C., Transforming XML docu-
ments to OWL ontologies: A survey, Journal of Information Sci-
ence, 2015, 41(2), 242–259, 10.1177/0165551514565972
[14] Leung N.K.Y., Lau S.K., Tsang N., Reuse existing ontologies in
an ontology development process - an integration-oriented on-
tology development methodology, International Journal of Web
Science, 2014, 2(3), 159–180, 10.1504/IJWS.2014.066435
[15] Pinto H., Gómez-Pérez A., Martins J., Some Issues on Ontology
Integration, In Proceedings of the 16th International Joint Con-
ference on Articial Intelligence (IJCAI 99) Workshop: KRR5: On-
tologies and Problem-Solving Methods: Lesson Learned and Fu-
ture Trends, volume 18, 1999
[16] Smith B., Ashburner M., Rosse C., Bard J., Bug W., Ceusters
W., al., The OBO Foundry: coordinated evolution of ontologies
to support biomedical data integration, Nature biotechnology,
2007, 25(11), 1251–1255, 10.1038/nbt1346
[17] Bouihi B., Bahaj M., An UML to OWL based approach for
extracting Moodle’s Ontology for Social Network Analy-
sis, Procedia Computer Science, 2019, 148, 313 322,
https://doi.org/10.1016/j.procs.2019.01.039, the Second
International Conference on Intelligent Computing in Data
Sciences, ICDS2018
[18] Djuric D., Gasevic D., Devedzic V., Ontology Modeling and
MDA, Journal of Object Technology, 2005, 4(1), 109–128,
10.5381/jot.2005.4.1.a3
[19] Xu Z., Ni Y., He W., Lin L., Yan Q., Automatic extraction
of OWL ontologies from UML class diagrams: a semantics-
preserving approach, World Wide Web, 2012, 15(5-6), 517–545,
10.1007/s11280-011-0147-z
[20] Atzeni M., Atzori M., CodeOntology: RDF-ization of Source Code,
In The Semantic Web - ISWC 2017 - 16th International Semantic
Web Conference, Vienna, Austria, October 21-25, 2017, Proceed-
ings, Part II, 2017, 20–28, 10.1007/978-3-319-68204-4_2
[21] Azanzi F.J., Camara G., Knowledge Extraction from Source Code
Based on Hidden Markov Model: Application to EPICAM, In 14th
IEEE/ACS International Conference on Computer Systems and
Applications, AICCSA 2017, Hammamet, Tunisia, October 30 -
Nov. 3, 2017, 2017, 1478–1485, 10.1109/AICCSA.2017.99
[22] Azanzi F.J., Camara G., An Approach for Knowledge Extraction
from Source Code (KNESC) of Typed Programming Languages, In
Trends and Advances in Information Systems and Technologies
- Volume 1 [WorldCIST’18, Naples, Italy, March 27-29, 2018].,
2018, 122–131, 10.1007/978-3-319-77703-0_12
[23] Bontcheva K., Learning Ontologies from Software Artifacts: Ex-
ploring and Combining Multiple Choices., In J.Z. Pan, Y. Zhao,
eds., Semantic Web Enabled Software Engineering, volume 17
of Studies on the Semantic Web, IOS Press, 2014, 235–250
[24] Brunzel M., The XTREEM Methods for Ontology Learning from
Web Documents., In P. Buitelaar, P. Cimiano, eds., Ontology
Learning and Population: Bridging the Gap between Text and
Knowledge, volume 167 of Frontiers in Articial Intelligence and
Applications, IOS Press, 2008, 3–26
[25] Asim M.N., Wasim M., Khan M.U.G., Mahmood W., Abbasi H.M.,
A survey of ontology learning techniques and applications,
Database, 2018, 2018, bay101, 10.1093/database/bay101
[26] Shamsfard M., Barforoush A.A., The state of the art
in ontology learning: a framework for comparison, The
Knowledge Engineering Review, 2003, 18(4), 293–316,
10.1017/S0269888903000687
[27] Unbehauen J., Hellmann S., Auer S., Stadler C., Knowledge Ex-
traction from Structured Sources, in S. Ceri, M. Brambilla, eds.,
Search Computing: Broadening Web Search, volume 7538 of
Lecture Notes in Computer Science, Springer Berlin Heidelberg,
Berlin, Heidelberg, 2012, 34–52, 10.1007/978-3-642-34213-4_3
[28] Ganapathy G., Sagayaraj S., To Generate the Ontol-
ogy from Java Source Code, International Journal of Ad-
vanced Computer Science and Applications, 2011, 2(2),
10.14569/IJACSA.2011.020218
[29] Labský M., Svátek V., Sváb O., Praks P., Krátký M., Snásel V.,
Information Extraction from HTML Product Catalogues: From
Source Code and Images to RDF, in 2005 IEEE / WIC /
ACM International Conference on Web Intelligence (WI 2005),
19-22 September 2005, Compiegne, France, 2005, 401–404,
10.1109/WI.2005.78
[30] Zhou L., Ontology learning: state of the art and open issues, In-
formation Technology and Management, 2007, 8(3), 241–252,
10.1007/s10799-007-0019-5
[31] Hitzler P., Krötzsch M., Rudolph S., Foundations of Semantic
Web Technologies, Chapman and Hall/CRC Press, 2010
[32] Kharbat F., El-Ghalayini H., Building Ontology from Knowledge
Base Systems, Data Mining in Medical and Biological Research,
2008, 10.5772/6407
[33] García-Silva A., García-Castro L.J., Castro A.G., Corcho Ó., Build-
ing Domain Ontologies Out of Folksonomies and Linked Data,
NOTES | 199
International Journal on Articial Intelligence Tools, 2015, 24(2),
10.1142/S021821301540014X
[34] Wang S., Wang W., Zhuang Y., Fei X., An ontology evolution
method based on folksonomy, Journal of Applied Research and
Technology, 2015, 13(2), 177 187
[35] Fahad M., ER2OWL: Generating OWL Ontology from ER Dia-
gram, In Intelligent Information Processing IV, 5th IFIP Interna-
tional Conference on Intelligent Information Processing, Octo-
ber 19-22, 2008, Beijing, China, 2008, 28–37, 10.1007/978-0-
387-87685-6_6
[36] Hazman M., El-Beltagy S.R., Rafea A., A Survey of Ontology
Learning Approaches, International Journal of Computer Appli-
cations, 2011, 22(8), 36–43
[37] Lisi F.A., Learning Onto-Relational Rules with Inductive Logic
Programming, CoRR, 2012, abs/1210.2984
[38] Wróblewska A., Podsiadly-Marczykowska T., Bembenik R., Pro-
taziuk G., Rybinski H., Methods and Tools for Ontology Build-
ing, Learning and Integration Application in the SYNAT Project,
in R. Bembenik, L. Skonieczny, H. Rybinski, M. Niezgodka,
eds., Intelligent Tools for Building a Scientic Information Plat-
form, volume 390 of Studies in Computational Intelligence,
Springer Berlin Heidelberg, Berlin, Heidelberg, 2012, 121–151,
10.1007/978-3-642-24809-2_9
[39] Li Y., Krishnamurthy R., Raghavan S., Vaithyanathan S., Ja-
gadish H.V., Regular Expression Learning for Information Extrac-
tion, in 2008 Conference on Empirical Methods in Natural Lan-
guage Processing, EMNLP 2008, Proceedings of the Conference,
25-27 October 2008, Honolulu, Hawaii, USA, A meeting of SIG-
DAT, a Special Interest Group of the ACL, 2008, 21–30
[40] Kolesnikova O., Survey of Word Co-occurrence Measures for
Collocation Detection, Computación y Sistemas, 2016, 20(3),
327–344
[41] Fink G.A., Markov Models for Pattern Recognition: From The-
ory to Applications, Advances In Computer Vision and Pattern
Recognition, Springer-Verlag, London, 2 edition, 2014
[42] Russell S.J., Norvig P., Articial Intelligence - A Modern Ap-
proach, Third International Edition, Pearson Education, 2010
[43] Seymore K., Mccallum A., Rosenfeld R., LearningHidden Markov
Model Structure for InformationExtraction, In AAAI 99Workshop
on Machine Learning for Information Extraction, 1999, 37–42
[44] Zhou G., Su J., Named Entity Recognition using an HMM-based
Chunk Tagger, In Proceedings of the 40th Annual Meeting of
the Association for Computational Linguistics, July 6-12, 2002,
Philadelphia, PA, USA., 2002, 473–480
[45] Amith M., He Z., Bian J., Lossio-Ventura J.A., Tao C., Assessing
the practice of biomedical ontology evaluation: Gaps and op-
portunities, Journal of Biomedical Informatics, 2018, 80, 1–13,
10.1016/j.jbi.2018.02.010
[46] Dellschaft K., Staab S., Strategies for the Evaluation of Ontol-
ogy Learning, In Proceedings of the 2008 Conference on Ontol-
ogy Learning and Population: Bridging the Gap Between Text
and Knowledge, IOS Press, Amsterdam, The Netherlands, The
Netherlands, 2008, 253–272
[47] Eddy S.R., What is a hidden Markov model?, Nature Biotechnol-
ogy, 2004, 22(10), 1315, 10.1038/nbt1004-1315
[48] Franzese M., Iuliano A., Hidden Markov Models, in S. Ran-
ganathan, M. Gribskov, K. Nakai, C. SchAnbach, eds., Encyclo-
pedia of Bioinformatics and Computational Biology, Academic
Press, Oxford, 2019, 753 762, https://doi.org/10.1016/B978-
0-12-809633-8.20488-3
[49] Kouemou G.L., History and Theoretical Basics of Hidden Markov
Models, Hidden Markov Models, Theory and Applications, 2011,
10.5772/15205
[50] Binkley D., Davis M., Lawrie D., Morrell C., To camel-
case or under_score, in 2009 IEEE 17th International
Conference on Program Comprehension, 2009, 158–167,
10.1109/ICPC.2009.5090039
[51] Forney G.D., The Viterbi Algorithm: A Personal History, CoRR,
2005, abs/cs/0504020
[52] Viterbi A.J., Viterbi algorithm, Scholarpedia, 2009, 4(1), 6246,
10.4249/scholarpedia.6246
[53] Whetzel P.L., Noy N.F., Shah N.H., Alexander P.R., Nyulas C., Tu-
dorache T., Musen M.A., BioPortal: enhanced functionality via
new Web services from the National Center for Biomedical On-
tology to access and use ontologies in software applications,
Nucleic Acids Research, 2011, 39(Web-Server-Issue), 541–545,
10.1093/nar/gkr469
[54] Silva T.S.D., MacDonald D., Paterson G.I., Sikdar K.C., Cochrane
B., Systematized nomenclature of medicine clinical terms
(SNOMED CT) to represent computed tomography procedures,
Computer Methods and Programs in Biomedicine, 2011, 101(3),
324–329, 10.1016/j.cmpb.2011.01.002
[55] Romero M.M., Jonquet C., O’Connor M.J., Graybeal J., Pazos A.,
Musen M.A., NCBO Ontology Recommender 2.0: an enhanced
approach for biomedical ontology recommendation, Journal of
Biomedical Semantic, 2017, 8(1), 21:1–21:22, 10.1186/s13326-
017-0128-y
[56] Atzeni M., Atzori M., CodeOntology: Querying Source Code in a
Semantic Framework, In Proceedings of the ISWC 2017 Posters
& Demonstrations and Industry Tracks co-located with 16th In-
ternationalSemantic Web Conference (ISWC 2017), Vienna, Aus-
tria, October 23rd - to - 25th, 2017., 2017
[57] Fenwick M., Weatherby G., Ellis H.J.C., Gryk M.R., Parser Com-
binators: A Practical Application for Generating Parsers for NMR
Data, In Tenth International Conference on Information Technol-
ogy: New Generations, ITNG 2013, 15-17 April, 2013, Las Vegas,
Nevada, USA, 2013, 241–246, 10.1109/ITNG.2013.39
[58] Nierstrasz O., Kurs J., Parsing for agile modeling, Sci-
ence of Computer Programming, 2015, 97, 150–156,
10.1016/j.scico.2013.11.011