Leveraging MongoDB Atlas and Databricks To Perform Reddit Posts Sentiment Analysis — Part 1

A few months back, Databricks and MongoDB Inc announced a seamless integration between the two platforms. The combination of these powerful tools opens up new avenues for data processing and analysis. When I saw this update, I decided to experiment with these technologies, focusing on sentiment analysis of social media data, aiming to understand the perception and sentiments of developers towards MongoDB. I wanted to understand how MongoDB, a widely used document database, is faring in the eyes of those who use it daily — the developers. I should mention, in full transparency, that I’m an employee at MongoDB. This familiarity gave me an advantage in navigating this aspect of the technology for my project.

The primary objective of this project was to explore and illustrate how MongoDB Atlas and Databricks can be used together for sentiment analysis, specifically intending to understand developers’ sentiment towards MongoDB, using Reddit as our primary data source for now.

I wanted to achieve the following through this project:

1. Scalable data extraction and storage: MongoDB Atlas’s scalability and flexibility make it ideal for handling a large volume of data that we can extract from Reddit.

2. Effective sentiment analysis: Using Databricks and Textblob, I planned to effectively analyze the sentiments expressed in Reddit posts and comments.

3. Visual insight generation: With MongoDB Atlas Charts, I aimed to convert the analyzed data into visual insights, making the information more digestible and easier to understand.

This blog narrates how I achieved this process, presenting the steps I took and the results I obtained.

High-Level Architecture

The solution consists of the following components:

- Reddit API: For data extraction.

- MongoDB Atlas: Provides a hosted/managed MongoDB database cluster, used to store and retrieve Reddit data.

- Databricks: Used to perform sentiment analysis.

- Textblob: A Python library used within Databricks to analyze text for sentiment.

- MongoDB Atlas Charts: A component of the MongoDB Atlas platform used for visualizing sentiment scores.


A Deeper Look into the Key Technologies

- MongoDB Atlas: MongoDB Atlas is a fully managed cloud-based MongoDB database platform developed by MongoDB Inc, offering an intuitive platform for applications. It effectively handles operational tasks, providing a reliable and efficient environment for storing, retrieving, and analyzing data, in our case, Reddit data. MongoDB’s document-oriented database model, known for its flexibility and scalability, makes it an excellent choice for managing our data. To further optimize our data management, I utilized MongoDB’s Time Series collection, which was introduced in MongoDB Version 5 and specifically designed for time-stamped data. These collections store data according to a specified time field, facilitating efficient data storage, retrieval, and time-based analysis. This blog provides more details on MongoDB’s Time Series collection.

- Databricks and Textblob: Databricks is a unified data analytics platform that is widely used for data analysis. In this project, Databricks hosts our Python scripts, which use the Textblob library for sentiment analysis. Textblob provides a simple API for a variety of natural language processing (NLP) tasks including part-of-speech tagging, noun phrase extraction, and sentiment analysis. Being quite the beginner in machine learning, I decided to use a pre-trained model like Textblob which allowed me to conduct sentiment analysis without having to think about training a model as well as saving me a huge amount of time.

- MongoDB Atlas Charts: MongoDB Atlas Charts is a powerful visualization tool that provides real-time visual insights from our MongoDB data. In the context of this project, visualizing sentiment analysis over time can prove invaluable to decision-makers in many areas, such as marketing or strategic planning. By revealing clear trends and patterns in public sentiment, it helps to inform key business decisions.

Building the Solution

The detailed steps to build this solution can be found on this git repository.

To perform sentiment analysis on the Reddit data, I created a Java Maven project to interact with the Reddit API focusing on posts related to “mongodb”. This project allowed me to retrieve the data in a structured format and store it directly in a time series collection in MongoDB Atlas.

To facilitate data transfer between MongoDB Atlas and Databricks, I utilized PyMongo, the official Python driver for MongoDB. This allowed me to connect to the MongoDB Atlas cluster and pull Reddit data directly within Databricks. More details on this setup can be found in the git repository or this blog provides the steps on how to start a Databricks free trial on AWS as I did.

Finally, Textblob came into play, helping me perform sentiment analysis within Databricks. By defining a function that used TextBlob to perform sentiment analysis, I was able to gain valuable insights from the Reddit data. You can find the complete code on the git repository.

After completing the sentiment analysis utilizing the TextBlob function, I integrated the results back into MongoDB Atlas. Each document now included a sentiment score, which was derived from the sentiment analysis. In this scoring mechanism, a score of 1 represents a very positive sentiment, whereas a score of -1 implies a highly negative sentiment. These scores were then leveraged to create detailed visualizations using MongoDB Atlas Chart.

By leveraging MongoDB Atlas as the data storage solution and integrating it with Databricks, I can benefit from the scalability, flexibility, and powerful querying capabilities of MongoDB, while also taking advantage of Databricks’ advanced analytics and machine learning capabilities.

Conclusion

Utilizing MongoDB Atlas and Databricks for sentiment analysis proved to be highly productive. MongoDB Atlas was able to manage a large amount of Reddit data, and Databricks, in addition to Textblob, made sentiment analysis quite straightforward.

A key outcome was the ability to track sentiment trends over time using MongoDB Atlas Charts. This temporal analysis revealed the ups and downs of sentiments towards MongoDB, which could be strategically correlated with marketing campaigns or new version releases to assess their impact.

Overall, this project showcased how MongoDB Atlas and Databricks can be integrated for sentiment analysis. In the first part of this blog, I focused on creating a simplified version of this project. In the next part, I will dive deeper into real-time analysis using Apache Kafka and intensify the use of charts and visualizations to get a deeper understanding of developers' sentiments towards MongoDB. If you’re interested, simply click follow to get notified when it’s live. Stay tuned for more updates!


Previous
Previous

Leveraging MongoDB Atlas, Kafka Confluent Cloud, and Databricks to Perform Reddit and StackOverflow Posts Sentiment Analysis — Part 2