Understanding Python with NLP_Hands-On Python Natural Language Processing-同人网

上QQ阅读APP看书，第一时间看更新

Understanding Python with NLP

Python is a high-level, object-oriented programming language that has experienced a meteoric rise in popularity. It is an open source programming language, meaning anyone with a functioning computer can download and start using Python. Python's syntax is simple and aids in code readability, ease of use in terms of debugging, and supports Python modules, thereby encouraging modularity and scalability.

In addition, it has many other features that contribute to its halo and make it an extremely popular language in the developer community. A prominent drawback often attributed to Python is its relatively slower execution speed compared to compiled languages. However, Python's performance is shown to be comparable to other languages and it can be vastly improved by employing clever programming techniques or using libraries built using compiled languages.

If you are a Python beginner, you may consider downloading the Python Anaconda distribution (https://www.anaconda.com/products/individual), which is a free and open source distribution for scientific computing and data science. Anaconda has a number of very useful libraries and it allows very convenient package management (installing/uninstalling applications). It also ships with multiple Interactive Development Environments (IDEs) such as Spyder and Jupyter Notebook, which are some of the most widely used IDEs due to their versatility and ease of use. Once downloaded, the Anaconda suite can be installed easily. You can refer to the installation documentation for this (https://docs.anaconda.com/anaconda/install/).

After installing Anaconda, you can use the Anaconda Prompt (this is similar to Command Prompt, but it lets you run Anaconda commands) to install any Python library using any of the Anaconda's package managers. pip and conda are the two most popular package managers and you can use either to install libraries.

The following is a screenshot of the Anaconda Prompt with the command to install the pandas library using pip:

Now that we know how to install libraries, let's explore the simplicity of Python in terms of carrying out reasonably complex tasks.

We have a CSV file with flight arrival and departure data from major US airports from July 2019. The data has been sourced from the US Department of Transportation website (https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236).

The file extends to more than half a million rows. The following is a partial snapshot of the dataset:

We are interested to know about which airport has the longest average delay in terms of flight departure. This task was completed using the following three lines of code. The pandas library can be installed using pip, as discussed previously:

import pandas as pd

data = pd.read_csv("flight_data.csv")
data.groupby("ORIGIN").mean()["DEP_DELAY"].idxmax()

Here's the output:

Out[15]: 'PPG'

So, it turns out that a remote airport (Pago Pago international airport) somewhere in the American Samoa had the longest average departure delays recorded in July 2019. As you can see, a relatively complex task was completed using only three lines of code. The simple-looking, almost English-like code helped us read a large dataset and perform quantitative analysis in a fraction of a second. This sample code also showcases the power of Python's modular programming.

In the first line of the code, we imported the pandas library, which is one of the most important libraries of the Python data science stack. Please refer pandas' documentation page (https://pandas.pydata.org/docs/) for more information, which is quite helpful and easy to follow. By importing the pandas library, we were able to avoid writing the code for reading a CSV file line by line and parsing the data. It also helped us utilize pre-coded pandas functions such as idxmax(), which returns the index of the maximum value of a row or column in a data frame (a pandas object that stores data in tabular format). Using these functions significantly reduced our effort in getting the required information.

Other programming languages have powerful libraries too, but the difference is that Python libraries tend to be very easy to install, import, and deploy. In addition, the large, open source developer community of Python translates to a vast number of very useful libraries being made available regularly. It's no surprise, then, that Python is particularly popular in the data science and machine learning domain and that the most widely used Machine Learning (ML) and DS libraries are either Python-based or are Python wrappers. Since NLP draws a lot from data science, ML, and Deep Learning (DL) disciplines, Python is very much becoming the lingua franca for NLP as well.

Python's utility in NLP

Learning a new language is not easy. For an average person, it can take months or even years to attain intermediate level fluency in a new language. It requires an understanding of the language's syntax (grammar), memorizing its vocabulary, and so on, to gain confidence in that language. Likewise, it is also quite challenging for computers to learn natural language since it is impractical to code every single rule of that language.

Let's assume we want to build a virtual assistant that reads queries submitted by a website's users and then directs them to the appropriate section of the website. Let's say the virtual assistant receives a request stating, How do we change the payment method and payment frequency?

If we want to train our virtual assistant the human way, then we will need to upload an English dictionary in its memory (the easy part), find a way to teach it English grammar (speech, clause, sentence structure, and so on), and logical interpretation. Needless to say, this approach is going to require a herculean effort. However, what if we could transform the sentence into mathematical objects so that the computer can apply mathematical or logical operations and make some sense out of it? That mathematical construct can be a vector, matrix, and so on.

For example, what if we assume an N-dimensional space where each dimension (axis) of the space corresponds to a word from the English vocabulary? With this, we can represent the preceding statement as a vector in that space, with its coordinate along each axis being the count of the word representing that axis. So, in the given sentence, the sentence vector's magnitude along the payment axes will be 2, the frequency axes will be 1, and so on. The following is some sample code we can use to achieve this vectorization in Python. We will use the scikit-learn library to perform vectorization, which can be installed by running the following command in the Anaconda Prompt:

          pip install scikit-learn

Using the CountVectorizer module of Python's scikit-learn library, we have vectorized the preceding sentence and generated the output matrix with the vector (we will go into the details of this in subsequent chapters):

from sklearn.feature_extraction.text import CountVectorizer

sentence = ["How to change payment method and payment frequency"]
vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit_transform(sentence).todense()

Here is the output:

matrix([[1, 1, 1, 2]]), dtype=int64)

This vector can now be compared with other sentence vectors in the same N-dimensional space and we can derive some sort of meaning or relationship between these sentences by applying vector principles and properties. This is an example of how a sentence comprehension task could be transformed into a linear algebra problem. However, as you may have already noticed, this approach is computationally intensive as we need to transform sentences into vectors, apply vector principles, and perform calculations. While this approach may not yield a perfectly accurate outcome, it opens an avenue for us to explore by leveraging mathematical theorems and established bodies of research.

Expecting humans to use this approach for sentence comprehension may be impractical, but computers can do these tasks fairly easily, and that's where programming languages such as Python become very useful in NLP research. Please note that the example in this section is just one example of transforming an NLP problem into a mathematical construct in order to facilitate processing. There are many other methods that will be discussed in detail in this book.