The Results of the i-project
of Ministry of science, education and sports of the Republic of Croatia

Text mining system - system for automatic indexing,
categorization and semantic text exploration

Project manager: Prof. dr. sc. Bojana Dalbelo Bašić
E-mail: bojana.dalbelo@fer.hr


Extract:

Within this project we have developed several systems for automatic or semi-automatic indexing and automatic text document and web-site classification based on dozen different methods.

Since the primary preprocessing data procedure in all text indexing and classification systems has to deal with Croatian morphology, which is rather complex, a system for approximate automatic lematization of the Croatian language was developed. The particular attention was payed to the development of efficient algorithms for this system and this results represent the first experimental data and algorithms applied to Croatian language in text-classification systems. This procedure could have a significant importance in further automatic procedures in text classification in Croatian.

Beside the development of the systems for text indexing and classification which have their value as stand-alone applications (PEI for HIDRA), within this project the test data bases for Croatian text classification have been developed in close cooperation with Institute of Linguistics, Faculty of Philosophy, University of Zagreb. The basic knowledge for further researches in that direction has been collected which will enable the further development of automatic computational methods, for Croatian language.

The students of the Faculty of Electrical Engineering and Computing from Zagreb and the Faculty of Organizations and Informatics from Varaždin worked jointly on this project. Also the collaboration with the Institute of Linguistics, Faculty of Philosophy, University of Zagreb, the Croatian Information Documentation Referral Agency (HIDRA) and Joint Research Center (Ispra, Milano) of the European Commission has been established.

The project manager had a presentation in the international conference. The collaborators on this project prof. dr. sc. Marko Tadić and mr. sc. maja Cvitaš participated the international workshop. A paper related to deep text analysis has been published in a scientific journal. Four diploma theses were completed within this project and one of them was rewarded with "Stanko Turk" prize for an outstanding work.

A web-site with the detailed results description of this i-project was designed and it has been used as a portal for deep data and text analysis domain (data and text mining). This web-site is also a principal spot for discussion regarding prospective activities of group teachers and students from the Faculty of Electrical Engineering and Computing who occupy themselves with data and text mining.


Content:

1. SOFTWARE SYSTEMS
2. EXPERIMENTAL STANDARD DATA BASES FOR TEXT CLASSIFICATION IN CROATIAN LANGUAGE
3. PARTICIPATION AT INTERNATIONAL CONFERENCES
4. PAPERS
5. COLLABORATION WITH OTHER INSTITUTIONS AND PROJECTS
6. SAS DONATION AND EXPERIMENTS MADE WITH SAS TEXT MINER
7. PRIZES
8. FURTHER PROJECT DEVELOPMENT

1. SOFTWARE SYSTEMS

1.1 DOCUMENT INDEXING SYSTEM WITH EUROVOC DESCRIPTORS
Authors: Prof. dr. sc. Bojana Dalbelo Bašić, prof. dr. sc. Marko Tadić (from the Faculty of Philosophy, University of Zagreb), mr.sc. Maja Cvitaš (Croatian Information Documentation Referral Agency), Jan Šnajder, assistant, students: Hrvoje Eklić, Matija Jančec, Goran Jovanov, Mladen Kolar, Jure Mijić, Frane Šarić, Igor Vukmirović

     
[program]   [PEI documentation]   [PEP documentation]

1.2 AUTOMAIC INDEXING AND CATEGORIZATION SYSTEM FOR WEB SITES BASED ON CROATIAN INTERNET DOMAIN
Authors: Mr.sc. Jasminka Dobša, subproject coordinator, Mr.sc. Danijel Radošević, collaborant, Zlatko Stapić, student, Marinko Zubac student from the Faculty of Organization and Informatics of the University of Zagreb, Varaždin


[program]   [documentation]

1.3 SYSTEMS FOR AUTOMATIC DOCUMENT CLASSIFICATION BASED ON SEVERAL DIFFERENT METHODS WITH THE EMPHASIS ON DIFFERENCE BETWEEN CROATIAN AND ENGLIH LANGUAGE
Marko Antonić
Automatic document classification system implementing Support Vector Machines and Bayes classificator methods

Diploma thesis

[program]   [documentation]   [screenshots]

Zvonimir Szorsen
Automatic document classification system implementing decision trees

Diploma thesis

[program]   [dokumentation]   [screenshots]

Rene Ahel
Automatic document classification system implementing Bayes classificator and k-nn algorithm
Diploma thesis
[program]   [dokumentation]   [screenshots]

Domagoj Tominac
Automatic document classification system implementing k-nn algorithm

Seminar paper

[program]   [dokumentation]   [screenshots]

Stjepan Buljat
Automatic document classification system implementing Fuzzy ARTMAP algorithm

Seminar paper

[program]   [dokumentation]   [screenshots]

2. EXPERIMENTAL STANDARD DATA BASES FOR TEXT CLASSIFICATION IN CROATIAN LANGUAGE

Croatian language data bases:

1. database contains over 92000 newspaper articles from "Vjesnik" newspaper - testing, validation and learning sets (database is constructed by prof. dr. sc. Marko Tadić within the project 0130418 MZOŠ, where the Croatian National Corpus http://www.hnk.ffzg.hr is being developed.)

2. Croatian-English Parallel Newspaper Database: database contains over 4780 newspaper articles from "Croatia Weekly" (database is constructed by prof. dr. sc. Marko Tadić within the project 0130418 MZOŠ, where the Croatian-English parallel Corpus http://www.hnk.ffzg.hr is being developed.)

3. Croatian Legal Texts Database: containing text from legal documents published by Narodne novine

4. ISIS EUROVOC thesaurus database
[dokumentation]


3. PARTICIPATION AT INTERNATIONAL CONFERENCES

Jasminka Dobša, Bojana Dalbelo Bašić: Comparison of the Text Mining Methods Based on a Vector Space Model
XXIInd International Biometric Conference (IBC 2004) in parallel with the Australian Statistical Conference (ASC, Cairns, Australia, 2004). 11 - 16 July 2004.
(http://www.ozaccom.com.au/cairns2004/contsess_mon.html#Mon1)

Addressing the Language Barrier Problem in the Enlarged EU, Automating Eurovoc Descriptor Assignment, JRC Ispra, Italy, 16-17 September 2004.
(Participated by Prof. dr. sc. Marko Tadić, mr.sc. Maja Cvitaš)
(http://www.jrc.cec.eu.int/langtech/Eurovoc/Eurovoc-Workshop_Sept2004.html#Worksh)


4. PAPERS

Dobša, Jasminka; Dalbelo Bašić, Bojana: Comparison of Information Retrieval Techniques: Latent Semantic Indexing and Concept Indexing.
// Journal of Information and Organizational Sciences. 28 (2004), 1-2; 1-17
[paper]


5. COLLABORATION WITH OTHER INSTITUTIONS AND PROJECTS

Institute of Linguistics, Faculty of Philosophy, University of Zagreb, Projects 0130418 "Development of Croatian Language Resources"

http://www.ffzg.hr/oling/
Prof. dr. sc. Marko Tadić

All kind of consultation and assistance related to Croatian language problematics, particularly lemmatization problem


Croatian Information Documentation Referral Agency (HIDRA)
http://www.hidra.hr
Prof. Neda Erceg, director
Mr.sc. Maja Cvitaš

Within the cooperation with HIDRA, an Semiutomatic Document Indexing System with EUROVOC Descriptors has been developed (see the details here ###html-link).
Further collaboration with HIDRA has been initiated with the purpose to develop and implement the fully Automatic Document Indexing System using EUROVOC descriptors.
This collaboration will proceed between The Faculty of Electrical Engineering and Computing (FER), Croatian Information Documentation Referral Agency (HIDRA) and Faculty of Philosophy (FFZG) that are starting a new project organized around that tast.
http://www.hidra.hr/hidra/hidran.htm


European Commission
Joint Research Centre - Ispra site
Institute for the Protection and Security of the Citizen (IPSC)

http://www.jrc.cec.eu.int/langtech/index.html#Projects
Dr. Ralf Steinberger
http://www.jrc.cec.eu.int/langtech/RS.html

Cooperation is based on the problems regarding the automatic indexing using EUROVOC descriptors.
Dr. Steinberger bit će pozvani predavač na ITI2005 s temom vezanom za pretraživanje teksta.
http://iti.srce.hr


SAS Institute d.o.o.

http://www.sas.com

Marijana Brajac
E-mail: marijana.brajac@slo.sas.com
Maja Škrjanc Lapajne
Vodja odnosov s strankami
Detelova ulica 2
SI-1000 Ljubljana, Slovenija
Tel.: +386 1 230 86 00
Fax: +386 1 230 86 20


Institut Ruder Boškovic
http://knjiznica.irb.hr/hrv/index.html

Mr.sc. Jadranka Stojanovski
Head of the Library
"Ruđer Bošković" Instutute
Bijenička cesta 54, P.O.Box 180
10002 ZAGREB


6. SAS DONATION AND EXPERIMENTS MADE BY SAS TEXT MINER

SAS Adriatic donated this project with the SAS® Text Miner module for experimental purposes. Several exeriments were made on Croatian and English text databases. The results revealed that SAS text preprocessing is good enough for Croatian language deep analysis but better result are achieved using lemmatized databases. Experiment descriptions and results are attached in the report.

Bereček Boris, Cvitaš Ana: "Deep text analysis on Newspaper Vjesnik Database and Croatian-English Parallel Corpus Croatia Croatia Weekly using SAS® Text Miner-a, FER Zagreb, 2005.
[report]


7. PRIZES

The reward "Stanko Turk" for an outstanding diploma thesis in computer science in academic year 2003/2004 was given to
Mislav Malenica: for the "Text categorization appliance core methodes",, diploma thesis, the Faculty of Electrical Engineering and Computing in University of Zagreb, Zagreb, October 2004.

(Mentor: prof. dr. sc. B. Dalbelo Bašić. The diploma thesis was completed with the emphasisn on scientific research)
[diploma thesis]


8. FURTHER PROJECT DEVELOPMENT

Through the established cooperations which were mentioned in section 5, further project development is planed with automatic document and web-site classification experiments, and automatic document indexing as well.

All further activities will be published on this web site: http://www.zemris.fer.hr/projects/textmining/.

At least three papers are expected to be published in the near future.



Project manager:

Prof. dr. sc. Bojana Dalbelo Bašić


February, 2005.


© Sva prava pridrana. FER 2004