Assalamualaikum…
This 4th posting is discuss about the concordance and how to use it in the text analysis. We have been given this task to explain about how the concordance software can help us in doing an analysis on the text base on the purpose of the text analysis.
What is Concordance?
CONCORDANCE
According to Brittannica Encyclopedia, concordance is define as literally agreement, harmony, hence derivatively a citation of parallel passages, and specifically an alphabetic arrangement of the words contained in a book with citations of the passages in which they occur.
A concordance’s function is basically to bring together, in other words, to ‘concord’ passages of text which show the use of a word. It’s a type of index arrangement, working in a similar way to the verbal index found at the back of textbooks, in other words, it searches for instances of a word or phrase and comes up with each case of it. Whereas an index in the back of a book will show words in alphabetical order but only refer to them, the concordance will show each instance of each word together in the context from which it came.
The term ‘concordance’ is usually applied to literary and linguistic studies, but it is an extremely useful tool which enables students to access a piece of text non-sequentially or to study the ways in which it uses language. Concordances are not only used for literary purposes, but also as cross-reference systems for computer programmers, which enable teams of programmers working together to keep track of all references to, for example, a variable name, across all the files which make up a project.
A concordance is particularly useful for studying a piece of literature when thinking in terms of a particular word, phrase or theme. It will show exactly how often a word occurs, or even if it does not occur and so can be extremely helpful in building up some idea of how different themes recur within a poem and how they relate to the rest of the poem. So if a question is asked about a certain theme or the use of certain words in a poem and what their place is within the poem, it is easy to see how often they come up by looking at the concordance. This is particularly easy when it comes to computer concordances, all you have to do is click on the relevant letter of the alphabet, a list of words will come up and it is easy to scroll down until you find the right one. From there, you can find out all occurrences of that word in the poem in question.
So, a concordance is a bit like a computer; it will find things for you, but it will not do the thinking, you have to do that. What it will do is to get you on the right track when, in this case, it comes to deeper analysis of a piece of text. Most useful are interactive concordances, such as the one here in English. An interactive concordance will also find answers to specific queries and produce lists of all instances of words or phrases, but the advantage is the speed and the fact the concordance will find all words, not necessarily possible in textual concordancing.
Using Concordance For Content Analysis
Software for Content Analysis – A Review
Will Lowe
Introduction
Software for content analysis divides, according to its intended function, into three major categories. The first set of programs perform dictionary-based content analysis. They have the ‘basic handful’ of text analysis functions, involving word counting, sorting, and simple statistical tests. The basic handful are described in the next section. The second set contains development environments. These programs are designed to partially automate the construction of dictionaries, grammars, and other text analysis tools, rather than being analyzers themselves. Development environments are more similar to high-level text-specific programming languages than to freestanding content analysis packages. The third category contains annotation aids. While an annotation aid can often perform some automatic content analysis, it is intended more as an electronic version of the set of marginal notes, cross-references and notepad jottings that a researcher will
generate when analyzing a set of texts by hand. The next section describes the basic handful of text analysis functions, and the rest of the paper provides brief descriptions of twenty-one content analysis programs. Some recommendations are made in the conclusion.
1.1 The Basic Handful
The basic handful of functions consists of word frequency counts and analysis, category frequency counts and analysis, and visualization.
Word Frequency Analysis
Word frequency analysis provides a list of all the words that occur in a text and the number of times they occur. More sophisticated methods split the text into subparts, e.g. chapters, and create frequency lists for each part. Lists can be compared either visually, or using a statistical test such as _2, to see if their are significantly more mentions of particular words in one part than another. Another common use for the subpart procedure is to compare different sources addressing the same substantive question to measure how different their treatment of it is on the basis of the sorts of words they use. Statistically this procedure can sometimes be reasonable because the counts from one source are compared with the total counts for all words over all the sources; significant differences may then track differences of emphasis across sources1. Some packages make use of synonym lists or lemmatize before the analysis in order to merge word counts. Lemmatization removes the grammatical structure from the surface form of a word, leaving only the stem; words are then counted as identical when they share a stem. For example, a lemmatizing frequency count would treat ‘steal’ and ‘stole’ as the same word. Lists of lemma and synonyms are naturally language specific. Word frequency analysis is the simplest form of content analysis. In fact most operating systems (e.g. Unix/Linux, Mac OSX, and recent versions of Windows) have utilities to perform basic word counting and sorting built in.
Category Frequency Analysis
Content analysis programs almost all allow the user to specify a dictionary. ‘Dictionary’ in this context means a mapping a set of words or phrases to one word; the one word is the label of a substantive category and the set describes the words or phrases that indicate the tokening of the category in text. As an example, the Linguistic Inquiry and Word Count (LIWC) dictionary maps the word set {ashes, burial*, buried, bury, casket*, cemet*, coffin*, cremat*, dead death*, decay*, decease*, deteriorat*, die,
died, dies, drown*, dying fatal, funeral*, grave*, grief, griev*, kill*, mortal*, mourn*, murder* suicid*, terminat*} to LIWC category 59, death. The asterisks are ‘wild-card’ characters telling the program to treat ‘cremating’, ‘cremated’ and ‘cremate’, as all matching cremat*, and thus all mapping to category 59. Category counts allow a slightly more sophisticated analysis because they allow the user to provide a more explicit model of latent content in text. The implicit model of text generation is that the author of the text has some message expressed in terms of categories, and that this message is ‘coded’ into English when she writes. Coding entails picking any one of a set of English words that represent that concept, perhaps constrained by grammatical or pragmatic criteria. If the content analyst can recover or construct the word set used by the author, it can be placed in a dictionary and used to decode other texts. According to the LIWC scheme the sentence “Her suicide caused him to consider his own mortality” refers to the categories of ‘death’ and ‘metaphysics’ twice, ‘social’ three times, and ‘causation’ once: Her–SOCIAL suicide–DEATH/METAPH caused–CAUSE him–SOCIAL to consider–COGMECH his–SOCIAL own mortality–DEATH/METAPH. But according to the implicit model of LIWC, “He thought of his own death only because she killed herself” is an equally good instantiation of the underlying content because it tokens the same categories the same number of times. Of course many other sentences them these categories too, and many of the are quite unrelated in meaning. When a text is reduced to its category tokens with respect to some dictionary the same statistical analysis can be performed as with word counts. For most applications of automated content analysis, a word is reduced to a vector of category counts. Different texts can be compared either across within each category, or more usefully, by looking at high-dimensional distance measures between the complete vectors associated with each text. Most information retrieval programs, e.g. Google, will make use of a similar vector representation of texts – each query is converted into a sparse category vector by coding it as if it were a very short text, and this vector is compared geometrically to all available other vectors to find the nearest, that is, most relevant text to the query.
Visualization
When a text has been reduced to vector form, either by counting words or categories, it can be visualized. Two standard methods provided by most content analysis programs are clustering and multidimensional scaling. Cluster analysis is no doubt familiar, but the multidimensional scaling bears some discussion. It appears that most scaling procedures packaged for content analysis perform metric rather than non-metric multidimensional scaling. This means that the programs are looking for the linear mapping (for visualization purposes it will be a plane) that passes through the vectors and captures most variation in their positions when they are projected onto it. Metric methods therefore enforce linear structure, which may or may not be reasonable. More computationally intensive methods are non-metric, and consider not the positions of the vectors but their distance ranking to one another. Non-metric methods attempt to preserve ranked distances in their mapping to the plane, and thus allow more non-linear structure to appear in the final visualization. Why does this difference matter? It might appear that visualization functions are an advantage in a content analysis program, and this is may be true for preliminary data exploration. But researchers will most likely end up putting their data into a regular statistics package at some point, perhaps to get a more sophisticated statistical analysis. Since most modern statistics packages have very sophisticated visualization functions, the visualization will almost certainly be better performed then. This will also be desirable in the case where the content analysis package does not (or will not) document the exact clustering or visualization routine being performed.
Other Basic Functions
Several programs can generate concordances, sometimes described as KWIC (’key word in context’) analysis. The table below is a selection of lines from a small window full concordance for the word ‘content’ in the paragraphs preceding this one.
— — Software for |
content |
analysis divides according to |
can perform dictionary based |
content |
analysis They have the
|
often perform some automatic |
content |
analysis it is intended
|
Although computing concordances is not really a method of automated content analysis, it can be a very fruitful way to examine the data in the process of designing a content analysis; one example use for concordance analysis would be to quickly discover, without having to read the entire text, that the presence of a particular word occurs only in a subset of its possible substantive roles, even when we might expect it to be more broadly distributed on purely linguistic grounds (e.g. that taxes are only mentioned when the text is talking about lowering them.)
Concordances are also useful representation for discovering sets of words that co-occur reliably with the keyword, and thus might be natural choices for dictionary word sets.
Finally, with the addition of some minor annotation capability the researcher may manually code each instance as being of a particular category, either as part of a ‘training set’ for subsequent automated analysis, or simply as quick confirmation that, say 75% of mentions are of a particular type. The principle advantage of concordances in all these roles is that they lighten the reading burden of the researcher, so she can work with a larger volume of text.
2 Content Analysis Programs
This section describes twenty-one content analysis packages. They are divided into dictionary-based programs and development environments. A final section describes the two most popular annotation aids. Where possible each section states the platforms that the software runs on, the licensing scheme, the accessibility of the code-base and whether it is able to work with non-English language text.
Licensing cost has been distinguished from the accessibility of the code-base because although many packages are free to use, their code is not available. Being able to see the code is useful if one needs to know exactly what is going on when the program performs more complex analysis. In this respect the software is effectively proprietary. However, since there is no tradition among Windows and Mac users to make their code available even when the software is written to be given away, it may only be convention that makes the code-base inaccessible. That is, individual authors of free software may happily provide code details on request. This will certainly not be the case for the commercial packages.
2.1 Dictionary-based Content Analysis
CATPAC
—Homepage: http://www.terraresearch.com/catpac.cfm
—Operating Systems: Windows
—License:
Commercial $595
Academic $295
Student $49
—Code base: Proprietary (executable only)
—Languages: English (ASCII only)
Despite the bold claims of the manufacturer:
“CATPAC is an intelligent program that can read any text and summarize its main ideas. It needs no pre-coding and makes no linguistic assumptions.”
CATAC performs only the basic handful of functions. Visualization involves cluster analysis and multidimensional scaling. Cluster analysis can be interactive. CATPAC also apparently allows three dimensional visualizations with appropriately colored glasses.
CATPAC seems adequate to the basic handful. However the user interface is weak and the http: //www.galileoco.com/pdf/catman.pdf is atrocious.
Computer Programs for Text Analysis
—Homepage: http://www.dsu.edu/˜johnsone/ericpgms.html
—Operating Systems: MS-DOS
—License: Freeware
—Codebase: Proprietary (executable only)
—Languages: English (ASCII only)
These are a set of utility programs run from the DOS command line. They cover the basic handful except for visualization, and are designed primarily for literary analysis.
Concordance
—Homepage: http://www.rjcw.freeserve.co.uk
—Operating Systems: Windows
—License: $89 + $10 handling fee. $40 per subsequent license.
—Codebase: Proprietary (executable only)
—Languages: English, Chinese (See http: //deall.ohio-state.edu/chan.9/conc/concordance. htm)
Concordance is marketed as a way of producing and publishing concordances for literary texts (See for example http://www.dundee.ac.uk/english/wics/wics.htm) However the program also performs a superset of the basic handful of word analysis and category analysis functions, including regular expressions and lemmatization. (Lemmatization involves reducing all instances of a word to its stem.) There appears to be no visualization option. The most appealing aspect of Concordance is its potential for processing text in languages other than English
(see http://deall.ohio-state.edu/chan.9/conc/concordance.htm for more detail).
It is not clear from the manufacturer’s information whether reason Concordance can deal with Chinese is because it processes all text in Unicode, or because it has been specifically designed for Chinese scripts. If the underlying processing model uses Unicode then it is reasonable to expect support for other languages. If, on the other hand, it is an ad-hoc extension then Concordance is likely to be less generally useful.
Understanding of Concordance as a Content Analysis
Concordance is the software that can be use to make a text analysis throughout variety kinds of text. Most of the purposes in doing an analysis are to find out the word order and what kind of sentence structure in each text. There are lots of software that can be order or can be download at online for free.
The uses of concordance nowadays are really important due to any aspects of uses. For example in the education sector, it is easy for the student to make an analysis in their assignments or in project paper; they doesn’t have to search manually just click at the search in concordance software and everything that is needed will come out in just a seconds. Same as in the other sector which might be use especially in the aspects of management and organize data.
-This task was done in pair with Syahirah Bt Said