I have been working at FEUP InfoLab since 2015, always focusing on entity-oriented search. I began this chapter by working as a researcher on the ANT search engine. I then became a MAP-i PhD student, where I am working towards my thesis on "Graph-Based Entity-Oriented Search". My doctoral work consists of exploring the viability of graphs and hypergraphs as a joint representation for text and knowledge and as a way to generalize entity-oriented search tasks.
I have also done research for nearly two years in Laboratório SAPO/U.Porto, at the Faculty of Engineering of the University of Porto. I've worked across many areas, but my main focus has always been network science. I'm deeply interested in the relationships between real world entities and their grouping behavior.
I have also worked for a year in CRACS/INESC TEC, at the Faculty of Sciences of the University of Porto, pursuing a similar line of work, in the context of the Breadcrumbs Project, a computational journalism platform that takes advantage of social networking behavior to organize news fragments. In addition to community detection and data visualization, I was also able to devote some time to and develop skills in the areas of machine learning and data mining.
A lot of my work has been done on an information retrieval environment, where I focused on the link analysis task. I've participated in TREC 2010 Blog Track, experimenting with the h-index as a query independent feature for the blog distillation task, a metric typically used to measure the impact of a scientist's work. Continuously working with blog networks and other semantic-rich networks has brought to my attention the importance of working on better methodologies for the study of this kind of heterogeneous networks.
I devoted a couple of years to the study of community detection methodologies and the analysis of groups induced by social behavior in different types of data (e.g. networks of people, places, tags, documents, etc.). This is a subject I have begun exploring with my Master's thesis, where I studied the link ecosystem of the portuguese blogosphere. I have always been very interested in understanding human relationships and using networks and their community structure to solve problems in different areas. I have used community detection to organize documents, for disambiguation in a who's who situation, to integrate multidimensional information, or as enablers for topic detection through clustering of similar documents.
I have done some research in the area of machine learning, having a general understanding of classification with training, but mainly focusing on topic models, frequently working with methodologies such as Latent Dirichlet Allocation. I have also experimented with alternate topic modeling techniques based on networks of words and community detection.
After working with so much data, I felt a strong interest in creating rich visualizations, so I've worked for six months on real-time data summarization and visualization. I had the chance to face the problems that a high flow data stream presents, both in storage and visualization. With this knowledge I built a prototype for Laboratório SAPO, codename Ciclope, where I used SAPO Blogs clickstream to create a group of chart and tree visualizations that would allow an author to monitor the traffic flow of his or her blog. I have also developed two visualization systems capable of displaying a multidimensional network of news clips, with relationships based on the coreference of entities of different types, and defining three dimensions corresponding to three of the five-Ws of journalism: who, where and when. This is available in the form of two widgets in the user dashboard of the Breadcrumbs system.
2003-2011, 2012-2013, 2015-Present
I have worked at the Faculty of Engineering of the University of Porto, as an external researcher for Laboratório SAPO/U.Porto, during a one year period, from June 2010 to June 2011. I got my MSc in Informatics and Computing Engineering from this same institution, where I was accepted as a PhD candidate for ProDEI, the Doctoral Program in Informatics Engineering, but unfortunately had to drop out for lack of funding.
The Faculty of Engineering, and more specifically the Department of Informatics Engineering, is increasingly investing in research, having inaugurated new labs and improved existing ones during these last few years. A great effort has been put into the people of this institution, giving them the conditions, the motivation and the guidance necessary to pursue their research interests, in an attempt to positively contribute to the international engineering and scientific community.
I then returned to Laboratório SAPO/U.Porto in order to pursuit new research goals in the area of music retrieval and recommendation, while attempting to maintain a relationship with the Department of Computer Science at the Faculty of Sciences, through PDCC, the Doctoral Program in Computer Science.
For a brief period, I worked in the industry, as a software engineer and data scientist, but have recently come back to the Department of Informatics Engineering, to work as a researcher on entity-oriented search at FEUP InfoLab.
2011-2012
I have worked on the Breadcrumbs project, at the Center for Research in Advanced Computing Systems, an associate unit of INESC TEC, that operates at the Faculty of Sciences of the University of Porto.
The Faculty of Science has a privileged location, surrounded by activity and covered in a wide range of tree species, nurtured by the presence of the Botanical Garden, one of the most beautiful green areas of the city of Porto, and an attraction for the students of plant biology and landscape architecture.
|
Devezas, J. (2010). Link Ecosystem of the Portuguese Blogosphere. Master's Thesis, Faculdade de Engenharia da Universidade do Porto, Porto, Portugal. |
|
Devezas, J. (2011). An Overview of the Graph Database Paradigm - Breadcrumbs Project Report, CRACS/INESC TEC, Faculdade de Ciências da Universidade do Porto, Porto, Portugal. |
|
|
Devezas, J., S. Nunes, and C. Ribeiro (2011). Overlapping Community Detection - Labs SAPO/UP Report, Laboratório SAPO/U.Porto, Universidade do Porto, Porto, Portugal. |
As a researcher of Laboratório SAPO/U.Porto, in the course of the yearly period from June 2010 to June 2011, I was able to work on two main projects and participate in an information retrieval contest. From November 2011 to November 2012, I worked on the Breadcrumbs Project, researching and implementing several algorithms for community detection, topic modeling, event detection and text mining and search. I also developed two visualization tools display and explore the produced results. From December 2012 to November 2013, I went back to Laboratório SAPO/U.Porto to work on music information retrieval and recommender systems. I worked on the Juggle project, for music discovery and location-based recommendation to groups.
Juggle project aimed at improving music discovery based on a hybrid large-scale recommender system, capable of handling and combining different types of data, namely text and audio content, context from elements such as tags or location, and collaborative information from user profiles.
We tackled the challenges of multimodality and large-scale by developing a graph-based recommender system, supported on Neo4j, a popular and robust graph database that facilitated the modeling of content, context and collaborative information as nodes and edges in a graph. One of the biggest challenges was the translation of audio content to relationships in a graph, specifically the comparison of the audio features of a million songs with each other, which we solved by using an approximate search algorithm from image retrieval.
Our recommendation algorithm was mainly supported on neighborhood methods for collaborative filtering, but we also used metrics from text retrieval to boost the relevance of tags in the long tail, while not completely disregarding tag popularity, in order to offer a playlist that better potentiated the discovery of music.
Juggle Mobile was developed as a new branch of the Juggle project and, while the name indicates that it works in mobile devices, since we've decided to build it using responsive design, it was also made available in the PC, as a web application. Juggle Mobile aimed at delivering an artist-based experience to mobile devices, using both individual and group-based discovery of artists through their biographies.
In Juggle Mobile we provide the users with the ability to create an account and fill their taste profiles either based on our random artist rating system, or by importing their existing music information from Facebook or Last.fm. All the data from these different sources is mixed together with our weighting model and used to provide recommendations to the user or to a group of nearby users.
Our experiments were based on a linear algebra approach, where, instead of a graph, we used a user-items matrix, applying singular value decomposition to build a latent factor model that provided the support for individual and group recommendations. For groups, we proposed a rating aggregation method that ensured an equal chance for every group member to have a relevant influence in the recommendations outcome.
As a Breadcrumbs researcher, I was able to make contributions on several different areas. I implemented a language-independent named entity recognition system based on DBpedia entity lists. This system enabled the identification of three different types of entities — people, places and dates — tied to three of the five dimensions (the Five-Ws) of journalism: who, where and when. Using this data, a multidimensional entity coreference network was built, connecting news clips that cited the same entity. Next, I implemented the community detection methodologies for multidimensional networks proposed by Tang et al.. This included the dimension integration strategies proposed by the authors, based on their unified view of four traditional community detection methodologies. These algorithms were also implemented in the system, along with the Louvain method, one of the state of the art algorithms for community detection.
Next, two visualization tools were developed to display and explore the acquired data. The first was analogous to a map, where communities were visualized as countries resulting of the aggregation of a node population. The second enabled the exploration of the multidimensional network based on the three identified dimensions: who, where and when. Some simple chart visualizations were created to display statistics about the top user and system tags and entities.
We used a topic model, based on Latent Dirichlet Allocation, to suggest titles for each collection of news clips; a simplistic event detection system was also created, in order to find relevant peaks of activity in a time series of entity frequencies. Some other trivial systems, such as an administration panel, capable of scheduling tasks, and a widget dashboard were also implemented.
These algorithms were all developed using a web services architecture, communicating using either XML or JSON. Several scientific papers were published as the results of the described research. Below are some screenshots of the Breadcrumbs modules I contributed to in some way.
From June 2010 to December 2010, I worked on Ciclope, a real-time data visualization project aimed at gathering information from SAPO Blogs clickstream and displaying it in a useful way, allowing the blog owner to have an understanding of how the traffic flow of his or her blog behaves.
From January 2011 to June 2011, I focused on graph mining and community detection. As part of my work, I developed Unite, a Java library with the goal to provide an agile platform for link extraction and graph mining. Unite was built with out-of-the-box support for the most common graph building use cases, providing at the same time an extensible and highly modular interface that allows developers to adapt the framework to their own needs. Below is an example of how to read content from a MySQL database, parse it and write the resulting graph to a Blueprints-enabled graph database — in this case, Neo4j.
In 2010, I was presented with the opportunity to participate in the Blog Track for the Text REtrieval Conference. TREC is an information retrieval conference that holds several competitive tracks every year, providing their own datasets (at a price) together with human assessment for the relevance of each resource, given the search topics of the competition. This allows the participants to calculate metrics such as the mean average precision in order to evaluate their retrieval system. For a novice in the area, the experience gained through TREC's participation is immense.
Our work focused on using some of the structural properties of the blog graph as query-independent features for the blog distillation process. Specifically, we ranked each blog according to the in-degree and then compared it to the rank according to the h-index (a metric commonly used in bibliometrics to measure the scientific output of a researcher). We then introduced a weighted and normalized value, for each of these link-based metrics, in the final document score, obtaining result improvements for the h-index, but not the in-degree.
During the participation in TREC 2010 Blog Track, I acquired competences in the tasks of link extraction, graph mining and blog distillation, additionally learning about large scale methodologies for indexing, searching and parsing large-scale collections.