CRTC Hearing Text Analysis, Continued
This is a continuation of the text analysis performed on a series of transcripts published by the CRTC between April 11 and April 28, 2016.
This analysis will look at the following words:
- broadband
- fibre
- service
- digital literacy
Key Findings of the Analysis
BROADBAND
The word broadband
and its variations appear with the greatest frequency in the transcript dated April 11.
Here is a KWIC concordance from April 11 that shows its use. View the complete file.
After removing stopwords and lemmatizing the text, broadband
occurs a total of 1143
times across all of the transcripts.
Here is a list of the most significant collocated words appearing with broadband
using the Log-likelihood ratio.
The word most significantly collocated with broadband
is national
, followed by strategy
and access
. It is extremely important to note here that words will appear twice in the following list. As the ngrams can appear both before and after the word, care must be taken to identify duplicate occurrences in the list below and then combine the totals. Therefore, the list below is a sample. Please refer to this file for the complete list, and calculate accordingly.
FIBRE
The word fibre
and its variations appear with the greatest frequency in the transcript dated April 20.
Fibre
is by far the most common of these occurrences, with ‘fibres’ occurring only four times in total throughout all of the transcripts. Here is a KWIC concordance from April 20 that shows its use. View the complete file.
After removing stopwords and lemmatizing the text, fibre
occurs a total of 476
times across all of the transcripts.
Here is a list of the most significant collocated words appearing with fibre
using the Log-likelihood ratio.
The word most significantly collocated with fibre
is optic
, followed by premise
and build
. The list below is a sample. Please refer to this file for the complete list, and calculate accordingly.
SERVICE
The word service
and its variations appear with the greatest frequency in the transcript dated April 18.
Service
is the more common of these occurrences, though ‘services’ occurs many times throughout the transcripts as well. Here is a KWIC concordance from April 18 that shows the use of service
. View the complete file.
Here is another concordance from April 18 showing the use of services
. View the complete file.
The differences in usage between service
, which looks as if it primarily occurs in the phrase service providers
, and services
, which occurs in a variety of contexts, might make a case against lemmatization.
Regardless, after removing stopwords and lemmatizing the text, service
occurs a total of 2435
times across all of the transcripts.
Here is a list of the most significant collocated words appearing with service
using the Log-likelihood ratio.
The word most significantly collocated with service
is provider
, followed by basic
and telecommunication
. The list below is a sample. Please refer to this file for the complete list, and calculate accordingly.
DIGITAL LITERACY
Counting the frequency of two words requires a slightly different approach. This is accomplished by counting the bigrams
rather than individual words. The bigram digital literacy
occurs a total of 100
times across all of the transcripts.
Counting the frequency of the bigram in each individual transcript takes a little more work. Since the text processing has reduced the transcripts to a comma-separated list of individual words, we will first count the frequency of the words as they appear separately in the transcripts. The highest separate frequency of digital
and literacy
occurs on April 21.
Counting the frequency of the bigrams for April 21 yields 45
occurrences of digital literacy
as a word pair.
Counting the frequency of the bigrams for April 28 yields 10
occurrences of digital literacy
as a word pair.
And finally, here is a sample of the concordance for digital literacy
from April 21. It’s important to note that concordances can only be generated for an individual word. Since there are fewer instances of literacy
than digital
, the concordance was generated for literacy
alone. That’s why there are a few results that don’t include digital
. View the complete file.