The main goal of the PRiSMHA project is to demonstrate that
a rich semantic representation of the content of archival historical documents
can significantly improve the access to archival resources, and it is sustainable,
thanks to collaborative semantic annotation,
supported by automatic Information Extraction techniques.
THE SEMANTIC MODEL
HERO (Historical Event Representation Ontology)
is a modular ontology that covers the different aspects of historical events:
The event type, the place where it occurred,
the time when it took place, and the participants in the event,
with their roles.
HERO is available at w3id.org/hero/HERO.
HERO-900 is a domain ontology that refine HERO by introducing notions relevant to
the history of the 20th century, and in particular to the
students and workers protest during the years 1968-1969 in Italy,
which is domain selected for the PRiSMHA project.
An application version of HERO has been implemented in OWL 2 DL and
drives the annotation system, representing its domain knowledge.
This version of HERO currently contains 447 classes, 380 properties, 161 individuals and 4,661 logical axioms.
THE ANNOTATION PLATFORM
The figure shows the main modules of the PRiSMHA overall architecture.
The ontology (HERO+HERO-900) drives the user interaction of both the
Crowdsourcing Platform UI and the Final UI, and represents the "vocabulary" of the
Semantic KB, which is implemented as a RDF triplestore and
contains assertions about domain entities.
PRiSMHA implements a hybrid strategy, by integrating user-generated content
and automatic techniques:
user-generated content is provided through the Crowdsourcing Platform,
while automatic techniques are represented by the Information Extraction (IE) module and
the LOD linking module.
The IE module offers a support to user annotation by identifying relevant entities
within the document, and providing information about them.
The LOD linking module offers suggestions retrieved from external datasets
(Wikidata, in the current prototype), besides the possibility of linking the PRiSMHA entity to an external one.
Data in the Semantic KB are accessible through both a SPARQL endpoint and a
Final User Interface (UI)).
DIGITIZATION AND OCR OF ARCHIVAL DOCUMENTS
At the beginning of the project, we selected 200 documents from the archive collections of the
Fondaz. Ist. piemontese A. Gramsci,
containing newspaper and review articles, leaflets, and a few images,
besides some textual biographies of preeminent figures of the Italian history of the 20th century,
specifically involved in the historical period in focus.
The 200 documents from the archives have been digitized and uploaded on the
Polo del '900 archival platform
9centRo, together with their standard metadata.
These documents have been analyzed to select candidates to undergo an OCR process, and
only 13 have been considered unsuited for the OCR process, thus obtaining a quite satisfactory text
for 187 documents.
Besides the standard advantages of having a full text available,
in PRiSMHA full text is of paramount importance in order to apply Information Extraction techniques
supporting users in the annotation process.