Text Sentiment Analysis and Its Application Based on Natural Language Processing

/ 6

Text Sentiment Analysis and Its Application Based on Natural Language Processing

Muhammad Waleed Asif

中山大学

Abstract: Sentiment analysis of text delves into the systematic evaluation, processing, abstraction, and deduction of subjective narratives laden with emotional undertones. Specifically, it scrutinizes the perceptions and emotions of consumers relative to products, services, and various entities. Utilizing Natural Language Processing (NLP) models, we harness robust functional capacities to interpret and analyze pre-acquired reviews. Subsequent findings are elegantly visualized using Excel. To optimize our methodology, machine learning and the Long Short-Term Memory (LSTM) neural network model are employed to establish and refine structural mappings for word vector training. This approach not only mitigates the complexities associated with manually sifting through voluminous data but also amplifies the precision of our sentiment analysis endeavors.

Keywords: Natural Language Processing; Text; Sentiment Analysis.

  1. Introduction

In 2022, the scale of internet application users in China continued its steady growth. As of December 2022, among the netizens, the usage rates for instant messaging, online videos, and short videos were 97.5%, 94.5%, and 90.5% respectively. This translated to user bases of approximately 1.007 billion, 975 million, and 934 million for each category. It's evident that the active user base on platforms is burgeoning rapidly. In everyday natural environments, people communicate using natural language, and text, as the predominant medium on these platforms, holds immense potential. If we can extract emotional predilections from these messages, it could significantly aid in understanding societal mood dynamics and be invaluable for developing entertainment-based applications. Initially, gauging the emotional intensity and direction from these texts required subjective conjecture. However, today, with the explosive proliferation of online data, the sheer volume of information and its perse expression methods have made it impractical to rely solely on traditional methods, which are labor and time-intensive. Hence, the turn towards computerized solutions for automated sentiment analysis of review data has gained considerable attention in recent years. It has emerged as a hot topic in the academic arena. Online information has become deeply integrated into daily life, influencing inpidual choices and decision-making. By analyzing review data, we can forecast future trends and subsequently offer informed recommendations. This underscores the vast potential and invaluable application of text sentiment analysis in contemporary contexts.

  1. Methods and Steps in Natural Language Processing

1)Text Acquisition

The quality of the sample directly determines the accuracy of results derived from either machine learning or statistical methods, which are the mainstream research methodologies in contemporary practice. The paramount issue at hand is how to procure vast amounts of data. The accuracy of both aforementioned methodologies hinges upon the quality, size, and distribution uniformity of the samples, all of which exert a considerable influence on algorithmic performance.

One approach to corpus acquisition is to search online for language resources provided by third parties, such as Wikipedia. However, in real-world applications, many research and development systems operate within specific domains, which may not always align with users' requirements. In such instances, an alternative method is necessitated. Data collectors can be employed to gather the desired information. Additionally, Python frameworks like PySpider or Scrapy enable the facile scripting of web crawlers, facilitating the automated collection of voluminous data sets, thus paving the way for subsequent analytical endeavors.

2)Tokenization

Chinese and English exhibit distinct thought processes during tokenization, owing to their inherent linguistic characteristics. English typically employs spaces for word segmentation. For instance, the tokenization methods in both Chinese and English differ due to their unique attributes, leading to pergent cognitive approaches. In English, most tokenization is accomplished using spaces, For Example:

Figure 1: English Tokenization Results

However, for Chinese, due to its more intricate grammar, third-party libraries are often employed for tokenization. An example of such a library is "jieba". For instance:

Figure 2: Chinese Tokenization Using the Jieba Library

3)Text Cleaning

Because most of the time, the compiled texts contains a plethora of unnecessary content, such as embedded HTML codes, CSS tags etc. It's also beneficial to eliminate redundant stop words and punctuation marks to ensure text clarity. Below are some common cleaning techniques to remove punctuation:

Figure 3: Removal of Punctuation Marks

4)Normalization

Standard procedures often necessitate the utilization of lemmatization and stemming. Let us first delve into discerning the distinctions between these two methodologies.

Stemming:

Figure 4: Lemmatization and Stemming

Printed Result: 'wolv'

Lemmatization:

Figure 5: Lemmatization Results

Printed Result: 'wolf'

Theoretically, the process of stemming can be described as "reduction," wherein words are truncated to their root form. For instance, "effective" is transformed to "effect," and "cats" becomes "cat." On the other hand, lemmatization involves "transformation," converting words back to their base or canonical form. For example, "driving" is treated as "drive," and "drove" is also rendered as "drive."

In terms of complexity, stemming is relatively straightforward. It merely reduces words to their rudimentary forms. Lemmatization, in contrast, demands a return to the original form, necessitating morphological parsing. This entails not only the transformation of affixes but also the identification of word classes, distinguishing between the base form of a word and other grammatical variants. The precision of part-of-speech tagging can influence the accuracy of lemmatization. Consequently, lemmatization presents a more intricate challenge.

In the final research findings, there are also certain differences between the results of stemming and lemmatization. For instance, post-stemming, the word “airliner” is reduced to “airlin,” and “revival” is truncated to “reviv.” Conversely, through lemmatization, the derived words retain their full semantic integrity and are typically valid, dictionary-recognized terms.

Within their respective application domains, these distinctions persist. Both methods are employed for text processing and information retrieval, yet their focal points perge. In the realm of information search, stemming is increasingly favored, as evidenced by its adoption in tools such as Lucene, Solr etc. Lemmatization, on the other hand, has gained widespread traction in fields like natural language processing and text mining. Consequently, it's imperative to establish standards based on practical considerations and the specific context of application.

In the development of systems, paramount attention must be paid to performance, usability, and scalability. To ensure stable system performance, it is advisable to integrate Redis as a caching middleware to alleviate the concurrency on the MySQL database. The user interface should epitomize simplicity and minimalism to enhance user-friendliness. Furthermore, the incorporation of security measures, such as firewalls and data backups, is crucial to safeguard system integrity. Employing a layered design complemented by principles of low coupling and high cohesion will pave the way for system extensibility.

5)Feature Extraction

Typically, operations involve methodologies like Word2Vec, TF-IDF, and CountVectorizer:

a)Word2Vec:

Word2Vec is an Estimator that trains the Word2Vec model using a set of words. This model maps each word to a fixed-size vector. The Word2Vec model averages the words in a document into a single vector, which can serve as a predictive feature to gauge document similarity.

b)TF-IDF: 

Given a word t, a document d, and a corpus D. Term Frequency TF(t,d) is the occurrence count of word t in document d. Document Frequency DF(t,D) is the count of documents that contain the word. If we were to gauge the importance of a word based solely on term frequency, it would unduly emphasize words that commonly appear across documents, such as "of," "the," and "a." If a word frequently surfaces throughout the corpus, it indicates that the word doesn't encapsulate specific information unique to a particular document.

c)CountVectorizer:

CountVectorizer and CountVectorizerModel are tools designed to convert documents into vectors. In the absence of a pre-existing dictionary, CountVectorizer can operate akin to an Estimator, extracting words and yielding a CountVectorizerModel. This model produces a sparse representation of words within a document, a representation which can be further employed by algorithms like LDA.

  1. Application of Text Sentiment Classification Methods in Natural Language Processing

1)Neural Network Model

The model introduced here is LSTM, also known as Long Short-Term Memory. LSTM is essentially a derivative of RNN, designed primarily to address the shortcomings of RNN. A single neural unit of LSTM encompasses a forget gate, input gate, and output gate. While the structure of this network has been modified, its training methodology remains congruent with other neural networks, principally employing backpropagation and gradient-based techniques. Below is a schematic representation of its inpidual neural unit:

Figure 6: Architecture of a Single neural unit structure

This architecture inherits the memory functionality of RNN while also addressing the latter's inability to discard obsolete information. In text processing, it's plausible that the current text segment has no correlation with content from a distant past. However, RNNs persistently retain such long-ago information and fail to discard it. LSTM, with its incorporation of a forget gate, can gradually "forget" previous irrelevant information, providing a nuanced mechanism for memory management.

2)Machine Processing of Text

It is widely acknowledged that machines can only process numerical data. Hence, the conundrum arises: how can we convert textual information into numerical form? This issue stands as one of the most fundamental and crucial challenges in the domain of natural language processing. As we know, machines solely interpret numbers, so understanding the transformation of text to numbers becomes paramount.

Typically, a sentence is composed of numerous words, colloquially referred to as "tokens" or simply "words". If one masters the technique of numerically representing these tokens, it logically follows that representing entire sentences or phrases becomes attainable.
Contemporary research in natural language processing predominantly commences by training the desired word vectors, meaning each word is associated with a specific vector. For instance, the word "机器" might be represented by a multi-dimensional vector like [1,2,3,4,5,6,4,7,8,…]. Such a representation renders the word processable by machines. The act of obtaining these word vectors is technically termed as "word2vec."

Conclusion

In light of the above discussion, it is evident that within the domain of natural language processing, sentiment analysis of textual data yields a commendable degree of accuracy and enhances efficiency. However, the linguistic system entrenched in the human brain is profoundly intricate. Both language and emotion are human constructs, making human speech and sentiments exceptionally complex entities. A sentence isn't merely a rudimentary amalgamation of subject, verb, and object; it isn't just a collection of words. Variations in word combinations, sequences, and contexts can yield disparate interpretations. In the era of big data, sentiment analysis through natural language processing holds significant merit and value, particularly in the realm of online public opinion monitoring.

References

[1] Zhao Chunhao, Chen Mingang, Xu Siying, Ding Minjie, Suo Zhaoqing. Research on Evaluation and Standardization of Natural Language Processing Systems [J]. Information Technology and Standardization, 2023, (10):48-56+63. [2] Song Xuewu, Zhang Jinsong, Tang Shigui, Liao Song, Chen Yun, Yin Zhi. Application of Natural Language Processing in Tender and Bid Document Management Platforms [J]. Technology Innovation and Application, 2023, 13(29):189-192.

[3] Zhang Kanglin, Dai Yicheng. Research on SQL Statement Generation Based on Natural Language Processing [J]. Scientific Technological Innovation, 2023, (22):69-72.

[4] Zhu Zhenyuan, Zhang Linjing. Sentiment Analysis of Natural Language Texts [J]. Computer Knowledge and Technology, 2023, 19(01):38-40.

[5] Huang Xinkai, Luo Zixun, Xu Peng. Design of Product Review Analysis System Based on Natural Language Processing [J]. Information and Computer (Theory Edition), 2022, 34(16):163-165+169.