Computing with Language: Texts and Words
We’re
all very familiar with text, since we read and write it every day. Here we will
treat text as raw data for the
programs we write, programs that manipulate and analyse it in a variety of
interesting ways. But before we can do this, we have to get started with the Python
interpreter. The size of data is increasing at exponential rates day by day.
Almost all type of institutions, organizations, and business industries are
storing their data electronically. A huge amount of text is flowing over the
internet in the form of digital libraries, repositories, and other textual
information such as blogs, social media network and e-mails. It is challenging
task to determine appropriate patterns and trends to extract valuable knowledge
from this large volume of data. Traditional data mining tools are incapable to
handle textual data since it requires time and effort to extract information.
OBJECTIVE:
Ø
To extract the
text from anaconda cloud
Ø
To analyse the
extracted text from anaconda cloud
Ø
Commands to be
used to analyse the text in all possible ways to get an meaningful output
Ø
To know the
importance of analysing text
Ø
To know how and
what algorithm is used to analyse the text and python interpreter works Plotting of dispersion plot from the meaningful
analysed text
Text mining
It is also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).
Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods.
Getting Started with Python
One of the friendly things
about Python is that it allows you to type directly into the interactive interpreter—the
program that will be running your Python programs. You can access the Python
interpreter using a simple graphical interface called the Interactive
Development Environment (IDLE). On a Mac you can find this under Applications→
MacPython, and on Windows under All Programs→Python. Under Unix you can run
Python from the shell by typing idle (if this is not installed, try typing
python). The interpreter will print a blurb about your Python version; simply
check that you are running Python 2.4 or 2.5 (here I am using Anaconda spider
python 3.6)
The >>> prompt
indicates that the Python interpreter is now waiting for input. When
copying examples from this book,
don’t type the “>>>”
Getting Started with NLTK
Before going further
you should install NLTK, downloadable for free from http://www.nltk.org/.
Follow the instructions there to download the version required for your platform.
Once you’ve installed
NLTK, start up the Python interpreter as before, and install the data required
for the book by typing the following two commands at the Python prompt, then
selecting the book collection as shown
Downloading the NLTK Book Collection: Browse the available packages
using nltk.download(). The Collections tab on the
downloader shows how the packages are grouped into sets, and you should select
the line labeled book to obtain all data required for the examples and
exercises in this book. It consists of about 30 compressed files requiring
about 100Mb disk space. The full collection of data (i.e., all in the
downloader) is about five times this size (at the time of writing)
and continues to expand.
Once
the data is downloaded to your machine, you can load some of it using the
Python interpreter. The first step is to type a special command at the Python
prompt, which tells the interpreter to load some texts for us to explore: from
nltk.book import *. This says “from NLTK’s book module, load all items.” The book
module contains all the data you will need as you read this chapter. After
printing a welcome message, it loads the text of several books (this will take
a few seconds). Here’s the command again, together with the output that you
will see. Take care to get spelling and punctuation right, and remember that
you don’t type the >>>.
Any time we want to find out about these texts, we just have to
enter their names at
the Python prompt:
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> text2
<Text: Sense and Sensibility by Jane Austen 1811>
>>>
Now that we can use the Python interpreter, and have some data to
work with, we are ready to get started.
Searching Text
There are many ways to examine the context of a text apart from
simply reading it. A concordance view shows us every occurrence of a given
word, together with some context. Here we look up the word monstrous in Moby
Dick by entering text1 followed by a period, then the term concordance, and
then placing "monstrous" in parentheses:
>>> text1.concordance("monstrous")
Try searching for other words; to save re-typing, you might be
able to use up-arrow, Ctrl-up-arrow, or Alt-p to access the previous command
and modify the word being searched. You can also try searches on some of the
other texts we have included. For example, search Sense and Sensibility for the
word affection, using text2. concordance ("affection"). Search the
book of Genesis to find out how long some people lived, using:
text3.concordance("lived"). You could look at text4, the Inaugural
Address Corpus, to see examples of English going back to 1789, and search for
words like nation, terror, god to see how these words have been used
differently over time. We’ve also included text5, the NPS Chat Corpus: search
this for unconventional words like im, ur, lol. (Note that this corpus is
uncensored!)
Once you’ve spent a little while examining these texts, we hope
you have a new sense of the richness and diversity of language. Next you will
learn how to access a broader range of text, including text in languages other
than English.
A concordance permits us to see words in context. For example, we
saw that monstrous occurred in contexts such as the ___ pictures and the ___
size. What other words appear in a similar range of contexts? We can find out
by appending the term similar to the name of the text in question, then
inserting the relevant word in parentheses:
>>>
text1.similar("monstrous")
>>>
text2.similar("monstrous")
Observe that we get different results for different texts. Austen
uses this word quite differently from Melville; for her, monstrous has positive
connotations, and sometimes functions as an intensifier like the word very.
The term common_contexts allows us to examine just the contexts
that are shared by two or more words, such as monstrous and very. We have to
enclose these words by square brackets as well as parentheses, and separate
them with a comma:
>>>
text2.common_contexts(["monstrous", "very"])
DISPERSION PLOT :
It is one thing to automatically detect that a particular word
occurs in a text, and to display some words that appear in the same context.
However, we can also determine the location of a word in the text: how many
words from the beginning it appears. This positional information can be
displayed using a dispersion plot.
Each stripe represents an instance of a word, and each row represents the
entire text.
Term
frequency
Suppose we have a set of English text documents and wish to determine which document is most relevant to the query "the brown cow". A simple way to start out is by eliminating documents that do not contain all three words "the", "brown", and "cow", but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document; the number of times a term occurs in a document is called its term frequency. However, in the case where the length of documents vary greatly, adjustments are often made (see definition below).
The first form of term weighting is due to Hans Peter Luhn (1957) and is based on the Luhn Assumption:
•The weight of a term that occurs in a document is simply proportional to the term frequency.
The first form of term weighting is due to Hans Peter Luhn (1957) and is based on the Luhn Assumption:
•The weight of a term that occurs in a document is simply proportional to the term frequency.
Inverse
document frequency
Because the term "the" is so common, term frequency will tend to incorrectly emphasize documents which happen to use the word "the" more frequently, without giving enough weight to the more meaningful terms "brown" and "cow". The term "the" is not a good keyword to distinguish relevant and non-relevant documents and terms, unlike the less common words "brown" and "cow". Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.
Karen Spärck Jones (1972) conceived a statistical interpretation of term specificity called Inverse Document Frequency (IDF), which became a cornerstone of term weighting:
•The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs.
tf–idf is the product of two statistics, term frequency and inverse document frequency. Various ways for determining the exact values of both statistics exist.
Karen Spärck Jones (1972) conceived a statistical interpretation of term specificity called Inverse Document Frequency (IDF), which became a cornerstone of term weighting:
•The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs.
tf–idf is the product of two statistics, term frequency and inverse document frequency. Various ways for determining the exact values of both statistics exist.
Term
frequency
In the case of the term frequency tf(t,d), the simplest choice is to use the raw count of a term in a document, i.e. the number of times that term t occurs in document d. If we denote the raw count by ft,d, then the simplest tf scheme is tf(t,d) = ft,d. Other possibilities include[5]:128
•Boolean "frequencies": tf(t,d) = 1 if t occurs in d and 0 otherwise;
•term frequency adjusted for document length : ft,d ÷ (number of words in d)
•logarithmically scaled frequency: tf(t,d) = 1 + log ft,d, or zero if ft,d is zero;
•augmented frequency, to prevent a bias towards longer documents, e.g. raw frequency divided by the raw frequency of the most occurring term in the document:
Inverse document frequency:
The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.
with
•N : total number of documents in the corpus N = | D |
•| { d ∈ D : t ∈ d } | : number of documents where the term t appears (i.e., t f ( t , d ) ≠ 0). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to 1 + | { d ∈ D : t ∈ d } |.
Term
frequency–Inverse document frequency
Then tf–idf is calculated as
A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and tf-idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0.
The above algorithum is processed by python
when you give the following command
>>>text4.dispersion_plot(["citizens",
"democracy", "freedom", "duties",
"America"])
INTERPRITATION:
We will see some striking patterns of word usage over the last 220
years (in an artificial text constructed by joining the texts of the Inaugural
Address Corpus end-to-end). You can produce this plot as shown below. You might
like to try more words (e.g., liberty, constitution) and different texts. Can
you predict the dispersion of a word before you view it? As before, take care
to get the quotes, commas, brackets, and parentheses exactly right.
Important: You need to have Python’s NumPy and Matplotlib packages installed
in order to produce the graphical plots used. Please see http://www.nltk.org/
for installation instructions.
CONCLUSION:
The availability
of huge volume of text based data need to be examined to extract valuable
information. Text mining techniques are used to analyse the interesting and
relevant information effectively and efficiently from large amount of
unstructured data. text mining techniques that help to improve the text mining
process. Specific patterns and sequences are applied in order to extract useful
information by eliminating irrelevant details for predictive analysis.
Selection and use of right techniques and tools according to help us make the
text mining process easy and efficient. knowledge integration, varying concepts
granularity, multilingual text refinement, and natural language processing
ambiguity are major issues and challenges that arise during text mining
process. In future research work, we will focus to design algorithms which will
help to resolve issues presented in this work.
Author: G POONURAJ NADAR
Author: G POONURAJ NADAR
Comments
Post a Comment