Yun Huang, Alina Lungeanu, Chuang Zhang (SONIC lab) together with their collaborators from NICO (Mike Stringer, Jonathan Haynes), and AMARAL Lab (Dan McClary, Xiaohan Zeng) are the winners for the data pre-processing challenge at Mining the Digital Traces of Science (MDTS11) International Workshop with their submission “Structured and Relational Information Extraction”.

Based on a dataset provided by Thomson ISI Web of Science, with a focus on embryology and embryonic science from 1956 to 2010, the team developed AWK and Python scripts to extract more than 30 attributes related to articles, issues, and authors and construct 16 relational tables in MySQL.

Using SQL stored procedures, users can easily extract author-publication, author-citation, co-authorship, and citation similarity relations as well as related author keywords, keyword plus, addresses, publication years, and subject categories for a subset or all authors.

The data pre-processing scripts facilitate the collaboration on designing and developing innovative tools to access scientific publication databases (such as ISI Web of Science), in order to empower users with new methods of navigation, interaction, and data visualization for this kind of databases.

