The Results of the i-project
of Ministry of science, education and sports of the Republic of Croatia
Text mining system - system for automatic indexing,
categorization and semantic text exploration
Project manager: Prof. dr. sc. Bojana Dalbelo Bašić
E-mail: bojana.dalbelo@fer.hr
Within this project we have developed several systems for automatic or semi-automatic indexing
and automatic text document and web-site classification based on dozen different methods.
Since the primary preprocessing data procedure in all text indexing and classification systems has to deal with
Croatian morphology, which is rather complex, a system for approximate automatic lematization of the Croatian language
was developed. The particular attention was payed to the development of efficient algorithms for this system and this
results represent the first experimental data and algorithms applied to Croatian language in text-classification systems.
This procedure could have a significant importance in further automatic procedures in text classification in Croatian.
Beside the development of the systems for text indexing and classification which have their value as stand-alone applications (PEI for HIDRA),
within this project the test data bases for Croatian text classification have been developed in close cooperation with Institute of Linguistics,
Faculty of Philosophy, University of Zagreb. The basic knowledge for further researches in that direction has been collected which will
enable the further development of automatic computational methods, for Croatian language.
The students of the Faculty of Electrical Engineering and Computing from Zagreb and the Faculty
of Organizations and Informatics from Varaždin worked jointly on this project. Also the collaboration with the
Institute of Linguistics, Faculty of Philosophy, University of Zagreb, the Croatian Information
Documentation Referral Agency (HIDRA) and Joint Research Center (Ispra, Milano) of the European Commission has been established.
The project manager had a presentation in the international conference. The collaborators on
this project prof. dr. sc. Marko Tadić and mr. sc. maja Cvitaš participated the international workshop.
A paper related to deep text analysis has been published in a scientific journal. Four diploma theses were completed
within this project and one of them was rewarded with "Stanko Turk" prize for an outstanding work.
A web-site with the detailed results description of this i-project was designed and it has been used as a portal for
deep data and text analysis domain (data and text mining). This web-site is also a principal spot for discussion regarding
prospective activities of group teachers and students from the Faculty of Electrical Engineering and Computing who occupy themselves
with data and text mining.
1.1
|
DOCUMENT INDEXING SYSTEM WITH EUROVOC DESCRIPTORS
|
Authors: Prof. dr. sc. Bojana Dalbelo Bašić, prof. dr. sc. Marko Tadić (from the Faculty of Philosophy, University of Zagreb),
mr.sc. Maja Cvitaš (Croatian Information Documentation Referral Agency), Jan Šnajder, assistant, students: Hrvoje Eklić, Matija Jančec, Goran Jovanov,
Mladen Kolar, Jure Mijić, Frane Šarić, Igor Vukmirović
[program]
[PEI documentation]
[PEP documentation]
|
1.2
|
AUTOMAIC INDEXING AND CATEGORIZATION SYSTEM FOR WEB SITES BASED ON CROATIAN INTERNET DOMAIN
|
Authors: Mr.sc. Jasminka Dobša, subproject coordinator, Mr.sc. Danijel Radošević, collaborant,
Zlatko Stapić, student, Marinko Zubac student from the Faculty of Organization and Informatics of the University of Zagreb, Varaždin
[program]
[documentation]
|
Croatian language data bases:
1. database contains over 92000 newspaper articles from "Vjesnik" newspaper - testing, validation and learning sets
(database is constructed by prof. dr. sc. Marko Tadić within the project 0130418 MZOŠ, where the Croatian National Corpus
http://www.hnk.ffzg.hr is being developed.)
2. Croatian-English Parallel Newspaper Database: database contains over 4780 newspaper articles from "Croatia Weekly" (database is constructed by prof. dr. sc. Marko Tadić within the project 0130418 MZOŠ, where the Croatian-English parallel Corpus
http://www.hnk.ffzg.hr is being developed.)
3. Croatian Legal Texts Database: containing text from legal documents published by Narodne novine
4. ISIS EUROVOC thesaurus database
[dokumentation]
Jasminka Dobša, Bojana Dalbelo Bašić: Comparison of the Text Mining Methods Based on a Vector Space Model
XXIInd International Biometric Conference (IBC 2004) in parallel with the Australian Statistical Conference (ASC, Cairns, Australia, 2004).
11 - 16 July 2004.
(http://www.ozaccom.com.au/cairns2004/contsess_mon.html#Mon1)
Addressing the Language Barrier Problem in the Enlarged EU, Automating Eurovoc Descriptor Assignment, JRC Ispra,
Italy, 16-17 September 2004.
(Participated by Prof. dr. sc. Marko Tadić, mr.sc. Maja Cvitaš)
(http://www.jrc.cec.eu.int/langtech/Eurovoc/Eurovoc-Workshop_Sept2004.html#Worksh)
Dobša, Jasminka; Dalbelo Bašić, Bojana: Comparison of Information Retrieval Techniques: Latent Semantic Indexing
and Concept Indexing.
// Journal of Information and Organizational Sciences. 28 (2004), 1-2; 1-17
[paper]
Institute of Linguistics, Faculty of Philosophy, University of Zagreb, Projects 0130418 "Development of Croatian Language Resources"
http://www.ffzg.hr/oling/
Prof. dr. sc. Marko Tadić
|
All kind of consultation and assistance related to Croatian language problematics, particularly lemmatization problem
|
Croatian Information Documentation Referral Agency (HIDRA)
http://www.hidra.hr
Prof. Neda Erceg, director
Mr.sc. Maja Cvitaš
|
Within the cooperation with HIDRA, an Semiutomatic Document Indexing System with EUROVOC Descriptors has been developed (see the details here ###html-link).
Further collaboration with HIDRA has been initiated with the purpose to develop and implement the fully Automatic Document Indexing System using
EUROVOC descriptors.
This collaboration will proceed between The Faculty of Electrical Engineering and Computing (FER),
Croatian Information Documentation Referral Agency (HIDRA) and Faculty of Philosophy (FFZG) that are starting a new project
organized around that tast.
http://www.hidra.hr/hidra/hidran.htm
|
SAS Adriatic donated this project with the SAS® Text Miner module for experimental purposes. Several exeriments
were made on Croatian and English text databases. The results revealed that SAS text preprocessing is good enough
for Croatian language deep analysis but better result are achieved using lemmatized databases. Experiment descriptions
and results are attached in the report.
Bereček Boris, Cvitaš Ana: "Deep text analysis on Newspaper Vjesnik Database and Croatian-English Parallel Corpus Croatia
Croatia Weekly using SAS® Text Miner-a, FER Zagreb, 2005.
[report]
The reward "Stanko Turk" for an outstanding diploma thesis in computer science in academic year 2003/2004 was given to
Mislav Malenica: for the "Text categorization appliance core methodes",, diploma thesis, the Faculty of Electrical
Engineering and Computing in University of Zagreb, Zagreb, October 2004.
(Mentor: prof. dr. sc. B. Dalbelo Bašić. The diploma thesis was completed with the emphasisn on scientific research)
[diploma thesis]
Through the established cooperations which were mentioned in section 5, further project development is planed with automatic
document and web-site classification experiments, and automatic document indexing as well.
All further activities will be published on this web site: http://www.zemris.fer.hr/projects/textmining/.
At least three papers are expected to be published in the near future.
Project manager:
Prof. dr. sc. Bojana Dalbelo Bašić
February, 2005.
|