By Harshit Tyagi, Data Science Instructor | Mentor | YouTuber


Building a Structured Financial Newsfeed Using Python, SpaCy and Streamlit

One of the very interesting and widely used applications of NLP is Named Entity Recognition(NER).

Getting insights from raw and unstructured data is of vital importance. Uploading a document and getting the important bits of information from it is called information retrieval.

Information retrieval has been a major task/challenge in NLP. And NER(or NEL — Named Entity Linking) is used in several domains(finance, drugs, e-commerce, etc.) for information retrieval purposes.

In this tutorial post, I’ll show you how you can leverage NEL to develop a custom stock market news feed that lists down the buzzing stocks on the internet.

 

Pre-requisites

 
 
There are no such pre-requisites as such. You might need to have some familiarity with python and the basic tasks of NLP like tokenization, POS tagging, dependency parsing, et cetera.

I’ll cover the important bits in more detail, so even if you’re a complete beginner you’ll be able to wrap your head around what’s going on.

So, let’s get on with it, follow along and you’ll have a minimal stock news feed that you can start researching.

 

Tools/setup you’ll need:

 
 

  1. Google Colab for initial testing and exploration of data and the SpaCy library.
  2. VS Code(or any editor) to code the Streamlit application.
  3. Source of stock market information(news) on which we’ll perform NER and later NEL.
  4. A virtual python environment(I am using conda) along with libraries like Pandas, SpaCy, Streamlit, Streamlit-Spacy(if you want to show some SpaCy renders.)

 

Objective

 
 
The goal of this project is to learn and apply Named Entity Recognition to extract important entities(publicly traded companies in our example) and then link each entity with some information using a knowledge base(Nifty500 companies list).

We’ll get the textual data from RSS feeds on the internet, extract the names of buzzing stocks, and then pull their market price data to test the authenticity of the news before taking any position in those stocks.

Note: NER may not be a state-of-the-art problem but it has many applications in the industry.

Moving on to Google Colab for experimentation and testing:

 

Step 1: Extracting the trending stocks news data

 
 
To get some reliable authentic stock market news, I’ll be using Economic Times and Money Control RSS feeds for this tutorial but you can also use/add your country’s RSS feeds or Twitter/Telegram(groups) data to make your feed more informative/accurate.

The opportunities are immense. This tutorial should serve as a stepping stone to apply NEL to build apps in different domains solving different kinds of information retrieval problems.

If you go on to look at the RSS feed, it looks something like this:

https://economictimes.indiatimes.com/markets/rssfeeds/1977021501.cms


 

Our goal is to get the textual headlines from this RSS feed and then we’ll use the SpaCy to extract the main entities from the headlines.

The headlines are present inside the <title> tag of the XML here.

Firstly, we need to capture the entire XML document and we can use the requests library to do that. Make sure you have these packages installed in your runtime environment in colab.

You can run the following command to install almost any package right from a colab’s code cell:

!pip install <package_name>

 

Send a GET request at the provided link to capture the XML doc.

import requestsresp = requests.get("https://economictimes.indiatimes.com/markets/stocks/rssfeeds/2146842.cms")

 

Run the cell to check what you get in the response object.

It should give you a successful response with HTTP code 200 as follows:



Now that you have this response object, we can pass its content to the BeautifulSoup class to parse the XML document as follows:

from bs4 import BeautifulSoupsoup = BeautifulSoup(resp.content, features="xml")
soup.findAll('title')

 

This should give you all the headlines inside a Python list:



Image by author

 

Awesome, we have the textual data out of which we will extract the main entities(which are publicly traded companies in this case) using NLP.

It’s time to put NLP into action.

 

Step 2: Extracting entities from the headlines

 
 
This is the exciting part. We’ll be using a pre-trained core language model from the spaCy library to extract the main entities in a headline.

A little about spaCy and the core models.

spaCy is an open-source NLP library that processes textual data at a superfast speed. It is the leading library in NLP research which is being used in enterprise-grade applications at scale. spaCy is well-known for scaling with the problem. And it supports more than 64 languages and works well with both TensorFlow and PyTorch.

Talking about core models, spaCy has two major classes of pretrained language models that are trained on different sizes of textual data to give us state-of-the-art inferences.

  1. Core Models — for general-purpose basic NLP tasks.
  2. Starter Models — for niche applications that require transfer learning. We can leverage the model’s learned weights to fine-tune our custom models without having to train the model from scratch.

Since our use case is basic in this tutorial, we are going to stick with the en_core_web_sm core model pipeline.

So, let’s load this into our notebook:

nlp = spacy.load("en_core_web_sm")

 

Note: Colab already has this downloaded for us but if you try to run it in your local system, you’ll have to download the model first using the following command:

 

en_core_web_sm is basically an English pipeline optimized for CPU which has the following components:

  • tok2vec — token to vectors(performs tokenization on the textual data),
  • tagger — adds relevant metadata to each token. spaCy makes use of some statistical models to predict the part of speech(POS) of each token. More in the documentation.
  • parser — dependency parser establishes relationships among the tokens.
  • Other components include senter, ner, attribute_ruler, lemmatizer.

Now, to test what this model can do for us, I’ll pass a single headline through the instantiated model and then check the different parts of a sentence.

# make sure you extract the text out of <title> tagsprocessed_hline = nlp(headlines[4].text)

 

The pipeline performs all the tasks from tokenization to NER. Here we have the tokens first:



Image by author

 

You can look at the tagged part of speech using the pos_ attribute.



Image by author

 

Each token is tagged with some metadata. For example, Trade is a Proper Noun, Setup is a Noun, : is punctuation, so on, and so forth. The entire list of Tags is given here.

And then, you can look at how they are related by looking at the dependency graph using dep_ attribute:



Image by author

 

Here, Trade is a Compound, Setup is Root, Nifty is appos(Appositional modifier). Again, all the syntactic tags can be found here.

You can also visualize the relationship dependencies among the tokens using the following displacy render() method:

spacy.displacy.render(processed_hline, style="dep",jupyter=True, options={'distance': 120})

 

which will give this graph:



Image by author

 

 

Entity extraction

 
 
And to look at the important entities of the sentence, you can pass 'ent’ as style in the same code:



Image by author — I used another headline because the one we used above didn’t have any entities.

 

We have different tags for different entities like the day has DATE, Glasscoat has GPE which can be Countries/Cities/States. We are majorly looking for entities that have ORG tag that’ll give us Companies, agencies, institutions, etc.

We are now capable of extracting entities from the text. Let’s get down to extracting the organizations from all the headlines using ORG entities.

This will return a list of all the companies as follows:



Image by author

 

So easy, right?

That’s the magic of spaCy now!

The next step is to look up all these companies in a knowledge base to extract the right stock symbol for that company and then use libraries like yahoo-finance to extract their market details like price, return, etc.

 

Step 3 — Named Entity Linking

 
 
To learn about what stocks are buzzing in the market and get their details on your dashboard is the goal for this project.

We have the company names but in order to get their trading details, we’ll need the company’s trading stock symbol.

Since I am extracting the details and news of Indian Companies, I am going to use an external database of Nifty 500 companies(a CSV file).

For every company, we’ll look it up in the list of companies using pandas, and then we’ll capture the stock market statistics using the yahoo-finance library.

Image by author
 

One thing that you should notice here is that I’ve added a “.NS” after each stock symbol before passing it to the Ticker class of the yfinance library. It is because indian NSE stock symbols are stored with a .NS suffix in yfinance.

And the buzzing stocks would turn up in a dataframe like below:



Image by author

 

Voila! isn’t this great? Such a simple yet profound app that could point you in the right direction with the right stocks.

Now to make it more accessible, we can create a web application out of the code that we have just written using Streamlit.

 

Step 4 — Building a web app using Streamlit

 
 
It’s time to move to an editor and create a new project and virtual environment for the NLP application.

Getting started with Streamlit is super easy for such demo data applications. Make sure you have streamlit installed.

 

Now, let’s create a new file called app.py and start writing functional code to get the app ready.

Import all the required libraries at the top.

import pandas as pdimport requestsimport spacyimport streamlit as stfrom bs4 import BeautifulSoupimport yfinance as yf

 

Add a title to your application:

st.title('Buzzing Stocks :zap:')

 

Test your app by running streamlit run app.py in your terminal. It should open up an app in your web browser.

I have added some extra functionality to capture data from multiple sources. Now, you can add an RSS feed URL of your choice into the application and the data will be processed and the trending stocks will be displayed in a dataframe.

To get access to the entire code base, you can check out my repository here:

 
GitHub – dswh/NER_News_Feed
 

You can add multiple styling elements, different data sources, and other types of processing to make it more efficient and useful.

My app in its current state looks like the image in the banner.

If you want to follow me step-by-step, watch me code this application here:

 

Next Steps!

 
 
Instead of picking a financial use case, you can also pick any other application of your choice. Healthcare, e-commerce, research, and many others. All industries require documents to be processed and important entities to be extracted and linked. Try out another idea.

A simple idea is extracting all the important entities of a research paper and then creating a knowledge graph of it using the Google Search API.

Also, if you want to take the stock news feed app to another level, you can add some trading algorithms to generate buy and sell signals as well.

I encourage you to go wild with your imagination.

 

How you can connect with me!

 
 
If you liked this post and would like to see more of such content, you can subscribe to my newsletter or my YouTube channel where I’ll keep sharing such useful and quick projects that one can build.

If you’re someone who is just getting started with programming or want to get into data science or ML, you can check out my course at WIP Lane Academy.

Thanks to Elliot Gunn.

 
Bio: Harshit Tyagi is an engineer with amalgamated experience in web technologies and data science(aka full-stack data science). He has mentored over 1000 AI/Web/Data Science aspirants, and is designing data science and ML engineering learning tracks. Previously, Harshit developed data processing algorithms with research scientists at Yale, MIT, and UCLA.

Original. Reposted with permission.

Related:



Source link

Leave a Reply

Your email address will not be published.