Exploring and Clearing Data in Biomedical Databases
Room :Αιθουσα Συνεδριάσεων A56 (1ος όροφος)
Defined as the process of detecting and correcting inconsistencies and errors in data, data cleaning constitutes an essential pre-processing step in many database and bioinformatics-related tasks. Curated and valid data is a prerequisite for the upcoming research activity of any academic researcher and not only. In particular, the need for cleaned data dominates in every scientific activity and in today's economy.
There are several existing data cleaning tools, with a varying degree of success in dealing with the challenges of this process.
In this thesis, I present the development and functionality of a completed data cleaning tool. This tool is a user-friendly web application offering an advanced (semi)automatic data cleaning process on large volumes of heterogeneous data. The tool runs on top of the madIS system, which provides data processing and analysis functionality via an extended relational database system.
Automatic detection of type errors and numeric outliers is achieved during the data profiling process. An extensive suite of data analysis, constraint satisfaction, interactive data mining and statistical visualization results is offered to the user in order to identify potential errors, outliers, misspellings and violations. In addition, the tool suggests corrections that are easily accepted or rejected.
As part of its data curation functionality, the tool also supports data extensibility with row and aggregate operations being available to compute new derived variables in the data. Finally, the tool keeps history of users’ actions allowing them to undo/redo history, extract workflows and re-execute them on different or additional data.
Η ομιλία αυτή αποτελεί τη δημόσια παρουσίαση της διπλωματικής εργασίας της ομιλήτριας!