Skip to content

Descent098/readn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Readn

A very quick and dirty text reading difficulty analysis library

This library provides a few different algorithms for determining the difficulty to read of a certain text. It is not accurate enough to be used in research, it's a good estimation tool, nothing more. For basically every index/level that exists you want to aim for 7-10 in any non-academic work intended for adults, though this is not a hard rule. It is also single-threaded, meaning it's quite slow on large text. This will be addressed at some point when I get time.

Warning

The effectiveness of this library is massively limited by it's accuracy in:

  • counting sentences; This also uses a heuristic since text often breaks the standard grammatical formatting. This also makes all of the measurements worse at text like poetry, since sentences often don't end with punctuation.
  • counting syllables; A heuristic is used to calculate this since there's no known perfect way of doing it

The error percentages go up the more the text is not just in simple plaintext format with standard sentence formatting (i.e. ending sentences with proper punctuation). Likewise there are major limitations in what types of text are possible to analyze:

  • Unicode characters are completely ignored
  • Non-english text with english characters will be incorrectly parsed
  • Numbers are ignored
  • Non-sentence text data will cause errors (i.e. math formulas in markdown text)

Installation

You can add the package to your project using:

go get https://github.com/Descent098/readn

Flesch-Kincaid

This method is the most variant of the three (effected by error rates), but typically the most widely used in literature. It tends to UNDERESTIMATE (lower level, higher ease), level tends to be 0.5-2 years less than it should be depending on formatting, and ease tends to be 1-2 bands higher than it should be (i.e. you may get 40 putting it at a college level, but it likely should be College graduate or proffessional). The method returns 2 values the FleschKincaidResult.Ease and the FleschKincaidResult.Level. The Level is essentially the numebr of years of education to understand, the ease is the opposite, it's a score where the higher the ease, the easier the text is to read. Ease is approximately:

Score School Level (US) Notes
100-90 5Th Grade Very easy to read. Easily understood by an average 11-year-old student
90.0-80.0 6th grade Easy to read. Conversational English for consumers.
80.0-70.0 7th grade Fairly easy to read.
70.0-60.0 8th & 9th grade Plain English. Easily understood by 13- to 15-year-old students.
60.0-50.0 10th to 12th grade Fairly difficult to read.
50.0-30.0 College Difficult to read.
30.0-10.0 College graduate Very difficult to read. Best understood by university graduates.
10.0-0.0 Professional Extremely difficult to read. Best understood by university graduates.

Usage

package main

import "https://github.com/Descent098/readn"

func main(){
    text := `some text here`

    res:= readn.FleschKincaid(text)

    fmt.Printf("Your Kincaid ease is ~%.2f your education index it ~%.2f years education", res.Ease, res.Level)
}

Automated Readability Index (ARI)

This is the algorithm is a good alternative to Flesh-Kincaid for business use cases. The index is basically index-1 years of education to read. So

Score Age Grade Level
1 5-6 Kindergarten
2 6-7 First Grade
3 7-8 Second Grade
4 8-9 Third Grade
5 9-10 Fourth Grade
6 10-11 Fifth Grade
7 11-12 Sixth Grade
8 12-13 Seventh Grade
9 13-14 Eighth Grade
10 14-15 Ninth Grade
11 15-16 Tenth Grade
12 16-17 Eleventh Grade
13 17-18 Twelfth Grade
14+ 18-22 College/University

ARI is the most accurate in this package because it does not rely on sylables, opting for character count, which is trivial to do. This means the error % is just dependent on sentence count accuracy. It is also designed specifically for technical documents, and works best in this context, though it also works for more general use cases.

Usage

package main

import "https://github.com/Descent098/readn"

func main(){
    text := `some text here`

    index := readn.AutomatedReadabilityIndex(text)

    fmt.Printf("Your ARI is ~%.2f years education", index)
}

Formula

$4.71(\frac{\textbf{characters}}{\textbf{words}})+0.5(\frac{\textbf{words}}{sentences})-21.43$

Simple Measure Of Gobbledygook (SMOG)

SMOG tends to be best suited for the medical field, and is the recommended choice for most paper publishers. It is designed primarily for Medical writing, but it does work outside this context. However it's important to note it does not work on text with less than 30 sentences. The number given is essentially the number of years of education you should have to be able to read the text.

Usage

package main

import "https://github.com/Descent098/readn"

func main(){
    text := `some text here`

    val, err := readn.SimpleMeasureOfGobbledygook(text)

    if err !=nil{
        fmt.Fatal("Text was too short for SMOG analysis")
    }

    fmt.Printf("Your SMOG score is ~%.2f years education", index)
}

Formula

$grade=1.0430 \sqrt{\textbf{number of polysyllabic words} \times \frac{30}{\textbf{number of sentences}}} +3.1291$

A polysyllabic word is defined as any word with 3 or more syllables

Formula

$ease=206.835-1.015(\frac{\textbf{total words}}{\textbf{total sentences}})-84.6(\frac{\textbf{total syllables}}{\textbf{total words}})$

$level=0.39(\frac{\textbf{total words}}{\textbf{total sentences}})+11.8(\frac{\textbf{total syllables}}{\textbf{total words}})-15.59$

References

Testing data is taken from:

https://www.gutenberg.org/ and requires the inclusion of the following notice in relation to each file in /testdata:

This eBook is for the use of anyone anywhere in the United States and most
other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online
at www.gutenberg.org. If you
are not located in the United States, you will have to check the laws
of the country where you are located before using this eBook.

Additionally reference values to test against were adapted from the following sources:

About

A package to do reading-difficulty analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages