The methodology adopted to carry out the research is the workflow used and defined by the mythLOD project. As described by Pasqual and Tomasi, the process is a series of “consequential and iterative steps” that can be summarised as follows: (1) Analysis of input data; (2) Data management - which can be broken down into (2.1) data modelling, (2.2) data cleaning, entity linking and dataset production; (3) Dataset testing activity and visualisation tools implementation. (Pasqual and Tomasi 2022) Analysis of input data (1) aims to analyse source data in order to assess database consistency considering metadata quality. From this analysis, three main considerations emerged: the presence of a significant amount of qualitative data, i.e. non-numeric data, composed of sequences of character strings; the lack of use of reference models, both for the structure (ontologies) and for the data (for example, controlled vocabularies or taxonomies); the presence of an interestingly high number of “complex authoriality” records. In the data management phase (2), as claimed by (Pasqual and Tomasi 2022), “the whole process of the conversion of the [...] collection from a tabular format into LOD” has been performed over the analysis described in (1). The (2.1) conceptual model design recommends the re-use of ontologies and semantic structures for the domain representation, using as main reference models FRBRoo and CIDOC-CRM; (2.2) the data cleaning, data alignment and RDF production via RDFlib are recommended for the standardisation of ambiguous data and the creation of RDF valid URIs. mythLOD methodology suggests data alignment (e.g. authorities), in DHLOD the internal Collection Taxonomy has been aligned with the AAT68 taxonomy, cf. section 3.4.2.2 - Data cleaning, Data alignment and RDF dataset production. With regards to data validation and testing (3), the testing was conducted via competency questions defined with domain experts from Fondazione Carisbo, SPARQL query has run over the Knowledge Base and domain experts feedback have been collected. As a result, the data visualisation proposed by the mythLOD methodology was not utilised as feedback from the queries was obtained.
The source data are a partial extraction from the Digital Humanities Fondazione Carisbo project database. The extraction result was made available by the database manager as a JSON file. Each database item was provided with up to 134 metadata (expressed as key-value pairs) concerning descriptive artwork metadata, metadata about the database entry, an internal taxonomy for the items categorisation and operational metadata. While performing the source materials analysis I selected a set of relevant metadata used for the description of artworks. The table is the result of the second data analysis and each row represents each selected metadata, as reported in table 3.1. Each row specifies: the field name (metadata field column) with its original form in the relational database inside the brackets; an original illustrative record (original record column), usually provided with one example, or a couple when different syntax options are present; definition of drawbacks (problem definition column) that must be solved during the data cleaning activity; proposed solutions for addressing the identified problems (possible solution column). Finally, the first column clusters the metadata into “groups”: this is useful both for the model design and the data manipulation process.
The modelling activity was carried out by reusing existing ontologies as proposed by (Pasqual and Tomasi 2022). Although mythLOD has been modelled over the Digital Hermeneutics conceptual model (Daquino, Pasqual, and Tomasi 2020), the model was not suited for DHLOD data. DHLOD data are neither involving subjective information (e.g. attributions, titles, and dates), nor the provenance information available in the Digital Humanities Fondazione Carisbo Database are considered sufficiently precise. For example, the database specifically records provenance information about the annotator who performed the digitisation of each catalographic record, but no information is provided about the cataloguer. As a result, Digital Hermeneutics conceptual structure was excluded and a plain RDF triple structure was adopted.
Overall the whole data modelling activity, a set of mind maps has been designed without implying any term belonging to existing ontologies in order to set all the pieces of content in a triple-fashion structure. Then, I reviewed the literature on existing ontologies (presented in sections 2.1- Models and ontologies for cultural heritage domain and 2.2 - Conversion from a simple format to LOD) and selected terms from existing ontologies so as to refactor the model. (1) FRBRoo69, (2) CIDOC-CRM conceptual model70, (3) its LinkedArt application71 and (4) the GND ontology72 has been reused to model descriptive metadata. Additionally, the assertion representation (e.g. uncertain attributions, dates and titles) has been represented in a triple-fashion by reusing CIDOC-CRM conceptual model for the sake of model consistency.
The data cleaning operations were performed in order to create strings that could be correctly transformed into valid URIs in the next phase of the process (RDF production). After the analysis illustrated in the previous paragraph (cf. 3.4.1 - Source data analysis and competency questions definition), the solutions proposed were put into practice via mostly automatic execution of the following operations. Several semi-automatic data cleaning activities have been performed over data:
As anticipated in section 3.3 - Methodology, the data alignment was carried out only for the values of the “technique” metadata field and for the internal database categories taxonomy (Collection Taxonomy). The technique used by the author to create an artwork is an objective concept. However, it can be expressed in multiple ways, with a subjective interpretation especially when no common taxonomy has been established in the annotation process. This is the reason why it was felt the need to give further details and to increase the consistency of the concepts. This goal was reached by manually aligning the technique values with the corresponding terms in the Art & Architecture Taxonomy by Getty Research Institute. On the practical side, after the process of string conversion into machine readable terms (cf. Data cleaning), each label was assigned a term from the controlled vocabulary, as reported in the following table 3.2.
The second alignment operation was operated on the first hierarchical order of terms of the taxonomy Collections, created in order to formalise the internal organisation of the database (cf. 4.1.1 - The collection taxonomy).
The last data manipulation operation of the workflow is the actual conversion of the data into RDF triples. This process was performed automatically via the python library RDFlib75. For every triple were defined subject, property and object from the ontologies defined in the model or by creating the URIs through the module URIref. For BlankNodes, where required in the model, the correct URIs were generated via the module BNode. The resulting file, in turtle format, is presented in section 4.2 - the RDF dataset.