START Vector Product

GitHub Repo https://github.com/Mizanur888/VectorDotProductWith_MPI_Scatter

Mizanur888/VectorDotProductWith_MPI_Scatter

In this assignment we parallelize the matrix-vector product to calculate dot product more efficiently. To reducing the calculation time, we send chunks to array to different processes, so that each process can calculate the dot product of chunks of array. Also, we calculate the start, end and total time to determine how long it taken by each process.

GitHub Repo https://github.com/suy1968/Sentiment_analysis-using-nltk

suy1968/Sentiment_analysis-using-nltk

This project was aimed to used sentiment analysis to determined product recommendation. We started with the data engineering and text mining, which cover change text into tokens, remove punctuation, numbers, stop words and normalization them by using lemmatization. Following we used bag of words model to convert the text into numerical feature vectors.

GitHub Repo https://github.com/qferbs/vector-math

qferbs/vector-math

This is a learning project to do vector math such as vector addition, cross product, and dot product. Run vectorGUI to start.

GitHub Repo https://github.com/aditya2005-code/E-Commerce_Product_Recommender

aditya2005-code/E-Commerce_Product_Recommender

A content-based e-commerce product recommendation system that uses NLP techniques, TF-IDF vectorization, and cosine similarity to suggest similar products based on product metadata. Designed to handle cold-start scenarios and easily extensible for real-world recommendation use cases.

GitHub Repo https://github.com/Jai-Agarwal-04/Sentiment_Analysis_with_Insights

Jai-Agarwal-04/Sentiment_Analysis_with_Insights

Sentiment Analysis with Insights using NLP and Dash This project show the sentiment analysis of text data using NLP and Dash. I used Amazon reviews dataset to train the model and further scrap the reviews from Etsy.com in order to test my model. Prerequisites: Python3 Amazon Dataset (3.6GB) Anaconda How this project was made? This project has been built using Python3 to help predict the sentiments with the help of Machine Learning and an interactive dashboard to test reviews. To start, I downloaded the dataset and extracted the JSON file. Next, I took out a portion of 7,92,000 reviews equally distributed into chunks of 24000 reviews using pandas. The chunks were then combined into a single CSV file called balanced_reviews.csv. This balanced_reviews.csv served as the base for training my model which was filtered on the basis of review greater than 3 and less than 3. Further, this filtered data was vectorized using TF_IDF vectorizer. After training the model to a 90% accuracy, the reviews were scrapped from Etsy.com in order to test our model. Finally, I built a dashboard in which we can check the sentiments based on input given by the user or can check the sentiments of reviews scrapped from the website. What is CountVectorizer? CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. This is helpful when we have multiple such texts, and we wish to convert each word in each text into vectors (for using in further text analysis). CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the document is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample. What is TF-IDF Vectorizer? TF-IDF stands for Term Frequency - Inverse Document Frequency and is a statistic that aims to better define how important a word is for a document, while also taking into account the relation to other documents from the same corpus. This is performed by looking at how many times a word appears into a document while also paying attention to how many times the same word appears in other documents in the corpus. The rationale behind this is the following: a word that frequently appears in a document has more relevancy for that document, meaning that there is higher probability that the document is about or in relation to that specific word a word that frequently appears in more documents may prevent us from finding the right document in a collection; the word is relevant either for all documents or for none. Either way, it will not help us filter out a single document or a small subset of documents from the whole set. So then TF-IDF is a score which is applied to every word in every document in our dataset. And for every word, the TF-IDF value increases with every appearance of the word in a document, but is gradually decreased with every appearance in other documents. What is Plotly Dash? Dash is a productive Python framework for building web analytic applications. Written on top of Flask, Plotly.js, and React.js, Dash is ideal for building data visualization apps with highly custom user interfaces in pure Python. It's particularly suited for anyone who works with data in Python. Dash apps are rendered in the web browser. You can deploy your apps to servers and then share them through URLs. Since Dash apps are viewed in the web browser, Dash is inherently cross-platform and mobile ready. Dash is an open source library, released under the permissive MIT license. Plotly develops Dash and offers a platform for managing Dash apps in an enterprise environment. What is Web Scrapping? Web scraping is a term used to describe the use of a program or algorithm to extract and process large amounts of data from the web. Running the project Step 1: Download the dataset and extract the JSON data in your project folder. Make a folder filtered_chunks and run the data_extraction.py file. This will extract data from the JSON file into equal sized chunks and then combine them into a single CSV file called balanced_reviews.csv. Step 2: Run the data_cleaning_preprocessing_and_vectorizing.py file. This will clean and filter out the data. Next the filtered data will be fed to the TF-IDF Vectorizer and then the model will be pickled in a trained_model.pkl file and the Vocabulary of the trained model will be stored as vocab.pkl. Keep these two files in a folder named model_files. Step 3: Now run the etsy_review_scrapper.py file. Adjust the range of pages and product to be scrapped as it might take a long long time to process. A small sized data is sufficient to check the accuracy of our model. The scrapped data will be stored in csv as well as db file. Step 4: Finally, run the app.py file that will start up the Dash server and we can check the working of our model either by typing or either by selecting the preloaded scrapped reviews.

GitHub Repo https://github.com/Daniblit/Ensemble-Predictive-Model-Forecasting-AMGEN-stock-price-at-year-end-31s

Daniblit/Ensemble-Predictive-Model-Forecasting-AMGEN-stock-price-at-year-end-31s

The basis of this project involves analyzing Amgen future profitability based on its current business environment and financial performance. Technical Analysis, on the other hand, includes reading the charts and using statistical figures to identify the trends in the stock market. The dataset used for this analysis was downloaded from Yahoo finance for year 2009 to 2019. There are multiple variables in the dataset – date, open, high, low, volume. Adjusted close. The columns Open and Close represent the starting and final price at which the stock is traded on a day. High and Low represent the maximum, minimum price of the share for the day. The profit or loss calculation is usually determined by the closing price of a stock for the day, I used the adjusted closing price as the target variable. I downloaded data on the inflation rate, unemployment rate, Industrial Production Index, Consumer Price Index for All Urban Consumers: All Items and Real Gross Domestic Product as independent variables, Quarterly Financial Report: U.S. Corporations: Cash Dividends Charged to Retained Earnings All Manufacturing: All Nondurable Manufacturing: Chemicals: Pharmaceuticals and Medicines Industry, Producer Price Index by Industry: Pharmaceutical Preparation Manufacturing, 30-Year Treasury Constant Maturity Rate, and Producer Price Index by Industry: Pharmaceutical and Medicine Manufacturing Index. The independent variables are economic parameters which was obtained from Federal Reserve Economic Data (FRED) website. Methodology 1. Linear Regression: The linear regression model returns an equation that determines the relationship between the independent variables and the dependent variable. I used linear regression tool in Alteryx with ARIMA tool to forecast the stock prices for the year. The algorithm was trained with the historical data to see how the variables impact on the dependent variable. The test data was used to predict the adjusted closing price for the year and predicted a stock price of $193.38. 2. Support Vector Machines (SVM): Support Vector Networks (SVN), are a popular set of supervised learning algorithms originally developed for classification (categorical target) problems and can be used for regression (numerical target) problems. SVMs are memory efficient and can address many predictor variables. This model finds the best equation of one predictor, a plane (two predictors) or a hyperplane (three or more predictors) that maximally separates the groups of records, based on a measure of distance into different groups based on the target variable. A kernel function provides the measure of distance that causes to records to be placed in the same or different groups and involves taking a function of the predictor variables to define the distance metric. I used the SVM tool in Alteryx with ARIMA tool to forecast the stock prices for the year and predicted a stock price of $189.44. 3. Spline Model: The Spline Model tool was used because it provides the multivariate adaptive regression splines (or MARS) algorithm of Friedman. This statistical learning model self-determines which subset of fields best predict a target field of interest and can capture highly nonlinear relationships and interactions between fields. I used the Spline tool in Alteryx with ARIMA tool to forecast the stock prices for the year and predicted a stock price of $201.84. The results from the models was weighted by comparing the RMSE of each model. A lower RMSE indicates that the model’s predictions were closer to the actual values. However, a simpler model with the same RMSE as a more complex model is generally better, as simpler models are less likely to be overfit. Though the Spline model had a lower RMSE, the Linear Regression model had fewer variables. Thus, we combined the 3 models with the ARIMA forecast in a model ensemble, which allows us to use the results of multiple models. The forecasted stock price is $197.99 with 1.5% increase for 31st December 2019. Apart from economic parameters, stock price is affected by the news about the company and other factors like demonetization or merger/demerger of the companies. There are certain intangible factors which can often be impossible to predict beforehand hence the model predicts that the stock price of Amgen will continue to rise except there is a drastic downturn of the company.

GitHub Repo https://github.com/ayush-ak-725/production-grade-rag

ayush-ak-725/production-grade-rag

A local rag pipeline using ollama server with qwen3:8B and chroma DB for vector store, we are using antiaging strategies and remedies doc to start with

GitHub Repo https://github.com/arpit3043/Extractive-Text-Summerization

arpit3043/Extractive-Text-Summerization

Summarization systems often have additional evidence they can utilize in order to specify the most important topics of document(s). For example, when summarizing blogs, there are discussions or comments coming after the blog post that are good sources of information to determine which parts of the blog are critical and interesting. In scientific paper summarization, there is a considerable amount of information such as cited papers and conference information which can be leveraged to identify important sentences in the original paper. How text summarization works In general there are two types of summarization, abstractive and extractive summarization. Abstractive Summarization: Abstractive methods select words based on semantic understanding, even those words did not appear in the source documents. It aims at producing important material in a new way. They interpret and examine the text using advanced natural language techniques in order to generate a new shorter text that conveys the most critical information from the original text. It can be correlated to the way human reads a text article or blog post and then summarizes in their own word. Input document → understand context → semantics → create own summary. 2. Extractive Summarization: Extractive methods attempt to summarize articles by selecting a subset of words that retain the most important points. This approach weights the important part of sentences and uses the same to form the summary. Different algorithm and techniques are used to define weights for the sentences and further rank them based on importance and similarity among each other. Input document → sentences similarity → weight sentences → select sentences with higher rank. The limited study is available for abstractive summarization as it requires a deeper understanding of the text as compared to the extractive approach. Purely extractive summaries often times give better results compared to automatic abstractive summaries. This is because of the fact that abstractive summarization methods cope with problems such as semantic representation, inference and natural language generation which is relatively harder than data-driven approaches such as sentence extraction. There are many techniques available to generate extractive summarization. To keep it simple, I will be using an unsupervised learning approach to find the sentences similarity and rank them. One benefit of this will be, you don’t need to train and build a model prior start using it for your project. It’s good to understand Cosine similarity to make the best use of code you are going to see. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Since we will be representing our sentences as the bunch of vectors, we can use it to find the similarity among sentences. Its measures cosine of the angle between vectors. Angle will be 0 if sentences are similar. All good till now..? Hope so :) Next, Below is our code flow to generate summarize text:- Input article → split into sentences → remove stop words → build a similarity matrix → generate rank based on matrix → pick top N sentences for summary.

GitHub Repo https://github.com/gawrkai/django-opensearch-starter

gawrkai/django-opensearch-starter

Django + OpenSearch product search API with tag filtering and semantic vector support

GitHub Repo https://github.com/teuwa/Leap_StepMotorcontrol

teuwa/Leap_StepMotorcontrol

Starting project on using the Leap SDK with Java or Python to control a step motor. Hope to advance to Multiple motors with each directional vector of the hand then move to movement of the fingers. Final product would be to have an Beaglebone Black or a Raspberry Pi using the Leap to control a motorized hand.