Unveiling Sentiments in Elon Musk’s Tweets: An Exploration through Sentiment Analysis and Text Mining

Azhar Muhammad Fikri Fuadi
9 min readNov 24, 2023

Introduction

Elon Musk, the renowned entrepreneur and innovator, frequently expresses his thoughts and ideas on Twitter which often makes headlines and can have significant effects on public opinion. Sentiment analysis dives deep into the sentiment of Elon Musk’s tweets, employing advanced natural language processing and machine learning techniques to determine the sentiment of text data.

Sentiment analysis, also known as opinion mining, is a process that involves employing advanced natural language processing and machine learning techniques to determine the sentiment of text data. In simpler terms, it’s the art of gauging whether a piece of text carries a positive, negative, or neutral tone.

In this case, sentiment analysis can determine whether his tweets are positive or negative, which can provide insights into how people are reacting to his tweets and whether his tweets are impacting public opinion. Additionally, sentiment analysis can help identify the potential impact of his tweets on the stock market, particularly in the electric vehicle and space industries.

Data Collection

For this analysis, we collected two distinct datasets to accomplish different objectives:

The objective of this dataset is to serve as a foundation for training a sentiment analysis model, where each tweet is manually labeled with sentiments (positive, negative, litigious, and uncertain), providing the essential groundwork for the machine learning model to discern and learn intricate patterns associated with different sentiments, with labels obtained through manual annotation.

The application of the model involves preprocessing Elon Musk’s tweets to align them with the trained format, predicting sentiments for each tweet, and subsequently labeling this dataset through the implementation of the sentiment analysis model.

Tweets Cleaning

Before delving into the sentiment analysis, we performed extensive data preprocessing and cleaning. This included handling missing values, removing duplicates, and cleaning tweets by eliminating usernames, hashtags, links, newline characters, and ampersands. Additionally, we replaced quotes and strips, ensuring the text was ready for analysis.

# remove usernames
def remove_usernames(tweet):
return re.sub(r"@\w+", "", tweet)

# remove hashtags
def remove_hashtags(tweet):
return re.sub(r"#\w+", "", tweet)

# remove every links
def remove_links(tweet):
return re.sub(r"(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?", "", tweet)

# remove newlines
def remove_newlines(tweet):
return re.sub(r"\n", "", tweet)

# remove ampersands ('amp' in '&amp' or '&' symbol)
def remove_ampersands(tweet):
return unescape(tweet)

# replace quotes
def replace_quotes(tweet):
return tweet.replace("’", "'")

# replace strips
def replace_strips(tweet):
return tweet.replace("–", "-")

# tweets cleaning
df["Text"] = df["Text"].apply(remove_usernames)
df["Text"] = df["Text"].apply(remove_hashtags)
df["Text"] = df["Text"].apply(remove_links)
df["Text"] = df["Text"].apply(remove_newlines)
df["Text"] = df["Text"].apply(remove_ampersands)
df["Text"] = df["Text"].apply(replace_quotes)
df["Text"] = df["Text"].apply(replace_strips)
df

Sentiment Proportion

To understand the distribution of sentiments in the labeled dataset, we visualized the frequency of each sentiment using a countplot. The sentiments included positive, negative, litigious, and uncertain. To focus on a binary sentiment classification, we removed tweets labeled as “litigious” and “uncertain.”

Text Preprocessing

To prepare the text data for modeling, we performed several preprocessing steps. These included converting text to lowercase, expanding contractions, removing numbers, punctuation, stopwords, and extra whitespaces, and lemmatizing words. These steps aimed to standardize the text and enhance the model’s ability to extract meaningful features.

# case folding
def case_folding(tweet):
return str.lower(tweet)

# expand contraction
contractions_dict = {
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"i'd": "i had",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it had",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "iit will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that had",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there had",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they had",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have"
}

def expand_contractions(text, contractions_dict):
contractions_pattern = re.compile('({})'.format('|'.join(contractions_dict.keys())),
flags=re.IGNORECASE | re.DOTALL)
# re.IGNORECASE: case insensitive
# re.DOTALL: titik dianggap titik, bukan sebagai wildcard

def expand_match(contraction):
match = contraction.group(0)
first_char = match[0]
expanded_contraction = contractions_dict.get(match) \
if contractions_dict.get(match) \
else contractions_dict.get(match.lower())
expanded_contraction = expanded_contraction
return expanded_contraction

expanded_text = contractions_pattern.sub(expand_match, text) # dari singkatan, menjadi dijabarkan
expanded_text = re.sub("'", "", expanded_text) # menghilangkan petik satu
return expanded_text

def main_contraction(text):
text = expand_contractions(text, contractions_dict)
return text

# remove numbers
def remove_numbers(tweet):
return "".join([char for char in tweet if not char.isdigit()])

# remove punctuations
def remove_punctuations(tweet):
return re.sub(r"[^\w\s]", " ", tweet)

# remove stopwords
stop_words = stopwords.words('english')
stop_words.extend(["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "vs"])
stop_words.remove("not")
def remove_stopwords(tweet):
return " ".join([word for word in nltk.wordpunct_tokenize(tweet) if word not in stop_words])

# remove extra whitespaces
def remove_whitespaces(text):
return " ".join(text.split())

# lemmatization
lemmatizer = WordNetLemmatizer()
def lemmatization(tweet):
list_hasil = []

for word in nltk.wordpunct_tokenize(tweet):
list_hasil.append(lemmatizer.lemmatize(word))

return " ".join(list_hasil)

# text preprocessing
df["preprocessing result"] = df["Text"].apply(case_folding)
df["preprocessing result"] = df["preprocessing result"].apply(main_contraction)
df["preprocessing result"] = df["preprocessing result"].apply(remove_numbers)
df["preprocessing result"] = df["preprocessing result"].apply(remove_punctuations)
df["preprocessing result"] = df["preprocessing result"].apply(remove_stopwords)
df["preprocessing result"] = df["preprocessing result"].apply(remove_whitespaces)
df["preprocessing result"] = df["preprocessing result"].apply(lemmatization)
df

Encoding the Target

In this step, the categorical labels in the “Label” column of the DataFrame were converted into numerical values. The “negative” sentiment was encoded as 0, while the “positive” sentiment was encoded as 1.

df["Label"] = df["Label"].apply(lambda x : 0 if x == "negative" else 1)

Define Features (X) and Target (y) and Data Splitting

The data was separated into features (X) and the target variable (y). The “preprocessing result” column contains the features (X), and the “Label” column contains the target variable (y). Also, the data was split into training and testing sets using the train_test_split function from scikit-learn. The stratify parameter ensures that the distribution of labels in the training and testing sets remains similar to the original dataset.

# define features and target
X = df["preprocessing result"]
y = df["Label"]

# data splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.2, random_state = 0)

Preparing for Embedding Layer (Padding)

Tokenization of text data and padding were performed to ensure that all sequences have the same length, which is essential when working with neural networks.

word_tokenizer = Tokenizer()
word_tokenizer.fit_on_texts(X_train)

X_train = word_tokenizer.texts_to_sequences(X_train)
X_test = word_tokenizer.texts_to_sequences(X_test)

vocab_length = len(word_tokenizer.word_index) + 1
maxlen = 100

X_train = pad_sequences(X_train, maxlen=maxlen, padding="post")
X_test = pad_sequences(X_test, maxlen=maxlen, padding="post")

Make Embedding Layer with GloVe

An embedding matrix was created using pre-trained GloVe word vectors. This matrix serves as the weight for the embedding layer in the neural network.

embedding_dictionary = dict()
glove_path = open("glove.6B.100d.txt", encoding="utf8")

for line in glove_path:
records = line.split()
word = records[0]
vector_dimensions = asarray(records[1:], dtype="float32")
embedding_dictionary[word] = vector_dimensions

embedding_matrix = zeros((vocab_length, 100))

for word, index in word_tokenizer.word_index.items():
embedding_vector = embedding_dictionary.get(word)
if embedding_vector is not None:
embedding_matrix[index] = embedding_vector

Creating LSTM Model

For sentiment analysis, we employed a Long Short-Term Memory (LSTM) neural network, a type of recurrent neural network (RNN), which is well-suited for sequence data like text. The model architecture includes an embedding layer with pre-trained weights, an LSTM layer with 32 units, a dropout layer for regularization, a flattening layer, and a dense layer with a sigmoid activation function for binary classification.

model = Sequential(name="SentimentAnalysis")
model.add(Embedding(
input_dim=embedding_matrix.shape[0],
output_dim=embedding_matrix.shape[1],
weights=[embedding_matrix],
input_length=maxlen,
trainable=False))
model.add(LSTM(32, return_sequences=True))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(1, activation="sigmoid"))
opt = Adam(learning_rate=0.001)
model.compile(loss="binary_crossentropy", optimizer=opt, metrics=["accuracy"])
Output:

Model: "SentimentAnalysis"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_5 (Embedding) (None, 100, 100) 13756100

lstm_7 (LSTM) (None, 100, 32) 17024

dropout_4 (Dropout) (None, 100, 32) 0

flatten_2 (Flatten) (None, 3200) 0

dense_5 (Dense) (None, 1) 3201

=================================================================
Total params: 13776325 (52.55 MB)
Trainable params: 20225 (79.00 KB)
Non-trainable params: 13756100 (52.48 MB)
_________________________________________________________________

Model Training

The model was trained on the training data, specifying the number of epochs, batch size, and validation split, with a training accuracy of 98,87%

history = model.fit(X_train, y_train, epochs=6, verbose=1, batch_size=128, validation_split=0.2)
Epoch 1/6
2452/2452 [==============================] - 105s 42ms/step - loss: 0.1007 - accuracy: 0.9616 - val_loss: 0.0484 - val_accuracy: 0.9827
Epoch 2/6
2452/2452 [==============================] - 88s 36ms/step - loss: 0.0435 - accuracy: 0.9840 - val_loss: 0.0379 - val_accuracy: 0.9853
Epoch 3/6
2452/2452 [==============================] - 91s 37ms/step - loss: 0.0358 - accuracy: 0.9863 - val_loss: 0.0338 - val_accuracy: 0.9870
Epoch 4/6
2452/2452 [==============================] - 95s 39ms/step - loss: 0.0325 - accuracy: 0.9875 - val_loss: 0.0326 - val_accuracy: 0.9875
Epoch 5/6
2452/2452 [==============================] - 90s 37ms/step - loss: 0.0308 - accuracy: 0.9880 - val_loss: 0.0336 - val_accuracy: 0.9871
Epoch 6/6
2452/2452 [==============================] - 90s 37ms/step - loss: 0.0291 - accuracy: 0.9887 - val_loss: 0.0313 - val_accuracy: 0.9881

Model Evaluation

We evaluated the model’s performance on the test set, achieving an accuracy score and creating a confusion matrix to visualize the classification results. The confusion matrix revealed the number of true positive, true negative, false positive, and false negative predictions.

score = model.evaluate(X_test, y_test, verbose=1)
print("Test score:", score[0])
print("Test accuracy:", score[1])
Output:

Test score: 0.03294483944773674
Test accuracy: 0.987897515296936

Live Prediction on Elon Musk’s Tweets

After training the model, we applied it to predict the sentiment of Elon Musk’s tweets. We preprocessed Elon Musk’s tweets in the same manner as the labeled dataset and fed them into the model. The predictions were then analyzed and visualized to understand the distribution of sentiments in Elon Musk’s recent tweets.

Worldcloud

To visually represent the prevalent themes in Elon Musk’s tweets, wordcloud were generated for both negative and positive sentiments. Stopwords and specific terms were excluded to focus on meaningful content. The distinctive color palettes for each sentiment category added a visual flair to the word clouds.

Conclusion

Sentiment analysis of Elon Musk’s tweets provides valuable insights into public perception. Understanding the sentiment can help anticipate market reactions and public sentiment shifts, particularly in industries influenced by Musk’s ventures.

This analysis serves as a foundational exploration, and further enhancements, such as fine-tuning the model and incorporating more sophisticated techniques, could be explored for more nuanced sentiment analysis.

In conclusion, sentiment analysis unveils the sentiments hidden within Elon Musk’s tweets, offering a glimpse into the collective mood and reactions of the Twitterverse towards one of the most influential figures in the business and technology world.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

No responses yet

Write a response