Showing results for Initial Vector Art Vector
GitHub Repo
https://github.com/leenasuva/Topic-Analysis---BERT-Tokenizer
leenasuva/Topic-Analysis---BERT-Tokenizer
The problem highlights the use of machine learning algorithms to categorize different comments scraped from an online platform and make relevant predictions about the topics associated with those comments. There are a total of 40 topics to classify these comments. Even though the problem seems like a simple classification problem, as we dive deeper to understand the data, we realize that the real problem asks us to make sense of the comments mentioned in the dataset and then assign categories. Since the number of topics/classes is much greater than any common classification problem, the expected accuracy won’t be too high. These days, Topic Modeling and Classification have received tremendous popularity when analyzing products and services for various brands, during election times to measure popularity, discover public sentiments around multiple issues, etc. Primarily deriving meaningful topics from these comments is incredibly challenging because of variations in language, insertion of emojis, and use of partial and profane comments. It is essential to choose a scheme that translates the comments to word embeddings to calculate some similarity between those comments to assign relevant topics; it is also imperative to translate the context and meaning of those comments and cluster them to relevant topics. There are multiple approaches to Topic Modeling, such as Latent Dirichlet Analysis (LDA) and Probabilistic Latent Semantic Analysis (LSA). These benchmark techniques utilized for such problems seem to provide viable results. The initial approach was to use Tf-Idf and Word2Vec to vectorize the comments and then use state-of-the-art classification techniques to assign topics to these vectors. When utilized, bag-of-Words with Tf-Idf and Word Embedding with Word2Vec would pose a significant hidden problem. The main problem with these approaches is that they treat the exact words with different meanings identically without adding any context to them. For example, the term “bank” in “Peter is fishing near the bank.” and “Two people robbed the state bank on Monday.” would have the same vectors in this representation. This approach would give us misleading results, and therefore, to improve the performance of our prediction mechanisms, it is essential to switch to a process that finds a way to translate the context of the words. Transformers: a reasonably new modeling technique, presented by Google’s research professionals in their seminal paper “Attention is All You Need,” tackles the exact problem. Google’s BERT (Bidirectional Encoder Representations from Transformers) combines ELMO context embedding and several Transformers, plus it’s bidirectional (which was a big novelty for Transformers). The vector assigned to a word using BERT is a function of the entire sentence; therefore, a word can have different vectors based on the context. ELMO is a word embedding technique that utilizes LSTMs to look at each sentence and then assigns those embeddings.
GitHub Repo
https://github.com/sadaf-ali/Ear-recognition-in-3D-using-2D-curvilinear-features