Skip to content

An open-source spellchecker for Central Kurdish Wikipedia, delivered as a MediaWiki gadget and powered by a Python/Flask service hosted on Toolforge.

License

Notifications You must be signed in to change notification settings

KurdishWikipedia/bijar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bijar CKB Spellchecker

Wikipedia Project Page Discussion GitHub contributors GitHub stars Project Status: In Development GitHub last commit GitHub issues Repo size Python Version Powered by Flask Database Toolforge License: MIT

Important

This project is under development. The dictionary is incomplete, features are subject to change, and occasional bugs are expected.

Bijar is a spellchecking tool for the Central Kurdish Wikipedia, delivered as an open-source Flask webservice and a MediaWiki gadget. It is designed to help editors improve article quality by identifying and correcting spelling errors.

The name "Bijar" (بژار) is a Kurdish word for "weeding," reflecting the tool's purpose of cleaning mistakes from text.

Screenshot of the Bijar spellchecker gadget in action

The Bijar gadget integrated with Wikipedia's 2010 wikitext editor, showing its options and a list of misspelled words with suggestions.

Screenshot by the project author. Licensed under CC BY-SA 4.0 via Wikimedia Commons.

Features

  • Spellcheck Engine: Identifies potential spelling errors in Central Kurdish text.
  • Kurdish Morphology: Recognizes complex verb tenses, conjugations, and affixes to improve accuracy.
  • Correction Suggestions: Provides a list of suggestions for each identified error.
  • Community Dictionary: Allows users to request new words to be added.
  • Wikipedia Gadget: Integrates directly into the ckb.wikipedia.org editing interface for eligible users.
  • Public Database: Data can be queried directly using Wikimedia's Quarry and Superset tools (database: s57137__bijar_p). See, for example, a query for all simple verbs with their stems and properties.
  • Public API: Offers endpoints for integration with other applications.

Usage on Wikipedia

This tool is used as a gadget on the Central Kurdish Wikipedia. To learn how to enable and use it, please read the official documentation on Wikipedia.

Note

The gadget is currently available only in the 2010 wikitext editor.

How It Works (Backend + Gadget)

For eligible users on ckb.wikipedia.org, the tool provides a complete, semi-automatic workflow.

  1. Activation: An eligible editor enables the Bijar gadget in their MediaWiki preferences.
  2. Analysis: The user clicks a button in the editor, which sends the article's wikitext to the Bijar backend service.
  3. Response: The backend analyzes the text, identifies potential errors, and returns a structured list of these errors along with correction suggestions back to the user.
  4. Review and Correction: The gadget displays the results in a window below the editor. The user can then interact with this list to make corrections:
    • Clicking a misspelled word in the list automatically finds and selects it in the main editor.
    • A dropdown menu next to each word provides a list of correction suggestions to choose from.

Gadget Options

The official gadget has several features and behaviors:

  • Positional Awareness: The gadget identifies words by their start and end positions. If the text is edited manually, these positions can become incorrect. The gadget will show a notification prompting the user to refresh. If the live update option is enabled, it refreshes automatically.
  • User Settings: The gadget UI allows users to configure several options:
    • Live Update: If enabled, the spellcheck is triggered automatically after key presses or edits, keeping results constantly updated, but increasing API requests.
    • Safe Mode: Enabled by default, this mode prevents checking text inside templates, link targets, file names and categories to avoid breaking them. It can be disabled by trusted users, but must be used with caution.
    • Suggestion Controls: Users can set the maximum number of suggestions (1-10) and the Levenshtein distance (1-3) for finding matches.
    • Other Options: Includes toggles for handling bad words and grouping duplicate errors.

Public API

The Bijar webservice provides public API endpoints which can be used in other projects or custom user scripts.

Get Suggestions

This endpoint returns a JSON object containing a list of suggestions for a given word.

URL: GET https://bijar.toolforge.org/api/get_suggestions

Parameters:

Parameter Type Description
word string Required. The word to check.
limit integer Optional. Max number of suggestions. Range: 1-10. Default: 5.
distance integer Optional. Levenshtein distance. Range: 1-3. Default: 2.

Example Request:

https://bijar.toolforge.org/api/get_suggestions?word=کورشی&limit=10&distance=2

Example Response:

{
  "word": "کورشی",
  "distance_used": 2,
  "limit_used": 10,
  "suggestions": [
    "کوردی",
    "کورسی",
    "کورتی",
    "کوشتی",
    "کوێری",
    "کرێشی",
    "کەوشی",
    "کورد",
    "کوردەشی",
    "کورتەشی"
  ]
}

Check Text Block

This endpoint analyzes a block of plain text and returns a JSON list of all found issues.

URL: POST https://bijar.toolforge.org/api/check_text_block

Request Body: (Content-Type: application/json)

Parameter Type Description
text string Required. The block of text to be analyzed.

Usage Examples

JavaScript (fetch)

async function checkText(text) {
  const url = 'https://bijar.toolforge.org/api/check_text_block';
  const response = await fetch(url, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ text })
  });
  return response.json();
}

checkText('چەم بێ چقەڵ نابێت.').then(console.log);

Python (requests)

import requests

def check_text(text):
    url = 'https://bijar.toolforge.org/api/check_text_block'
    response = requests.post(url, json={'text': text})
    return response.json()

print(check_text('چەم بێ چقەڵ نابێت.'))

PHP (file_get_contents)

function check_text($text) {
    $url = 'https://bijar.toolforge.org/api/check_text_block';
    $options = [
        'http' => [
            'method'  => 'POST',
            'header'  => "Content-Type: application/json\r\n" .
                         "User-Agent: Bijar-API-Client\r\n", // A User-Agent is required by Toolforge.
            'content' => json_encode(['text' => $text]),
        ],
    ];
    $context  = stream_context_create($options);
    $response = file_get_contents($url, false, $context);
    return json_decode($response, true);
}

print_r(check_text('چەم بێ چقەڵ نابێت.'));

Example Response:

[
    {
        "word": "چقەڵ",
        "type": "misspelled",
        "start": 7,
        "end": 11
    }
]

Notes:

  • Each object in the response array represents a single found issue.
  • start and end are the character offsets of the word in the original text.
  • The type field indicates the nature of the issue (e.g., misspelled, bad).
  • Wikitext Handling: The API analyzes plain text. For best results when checking wiki articles, it is recommended to first mask syntax (templates, links, etc.) on the client-side before sending the text. The official ckbwiki gadget is a robust reference for this.

Setup

This repository contains the source code for the webservice (backend). Follow the instructions below to set it up for local development or for production on Toolforge.


Local Development

Prerequisites

Before you begin, ensure you have the following software installed on your local machine:

  • Git: A version control system. Download Git
  • Python: Version 3.10 or newer. Download Python
  • MySQL/MariaDB: A local database server (e.g., XAMPP, WAMP, MAMP, or a direct installation).

Instructions

1. Clone the Repository

git clone https://github.com/KurdishWikipedia/bijar.git

2. Create Virtual Environment & Install Dependencies

Open a terminal and create a virtual environment inside the www/python/ directory.

python -m venv www/python/venv

Next, activate the environment.

  • Windows (Command Prompt): .\www\python\venv\Scripts\activate.bat
  • Windows (PowerShell): .\www\python\venv\Scripts\Activate.ps1
  • macOS & Linux (bash/zsh): source www/python/venv/bin/activate

Now, install the required packages:

pip install -r requirements.txt

Finally, deactivate the environment:

deactivate

3. Set Up the Database

  • Start your local MySQL/MariaDB server.
  • Create a new database (e.g., local_database). You can do this through your database program's GUI or with a command-line client:
    CREATE DATABASE local_database CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
  • Database File: The database dump (.sql file) is required. To obtain it, please contact the project maintainer on their Wikipedia user talk page. Once you have the file, import it into the database you just created.

4. Configure Environment Variables

The application requires a local .env file for settings and secrets.

Navigate to the application's source directory:

cd www/python/src

Create the file by copying the sample for your operating system:

# On Windows
copy .env.sample .env

# On macOS & Linux
cp .env.sample .env

Open the new .env file in a text editor and follow the instructions inside to add your local configuration.

5. Generate Word Statistics

Note: This script pre-caches word statistics in a JSON file for the home page, preventing a startup timeout on Toolforge and ensuring the application runs efficiently in both local and production environments.

Activate the virtual environment:

  • Windows (Command Prompt): .\www\python\venv\Scripts\activate.bat
  • Windows (PowerShell): .\www\python\venv\Scripts\Activate.ps1
  • macOS & Linux (bash/zsh): source www/python/venv/bin/activate

With your local database server running, execute the script:

python run.py generate_stats.py

After it completes, deactivate the environment:

deactivate

6. Get the Gadget Source Code

Gadget JS/CSS source code

To test changes locally, you can use a browser extension like Tampermonkey to inject your local JS/CSS files into live Wikipedia pages.

Important Browser Security Note: When developing locally, the gadget runs on https://ckb.wikipedia.org while your Flask server runs on http://127.0.0.1. Modern browsers block this cross-origin request by default (CORS policy). You may need to temporarily disable web security features in your browser to allow flask-cors to work. This is for local development only and should be handled with care.

7. Run the Application

  1. Start your database: Ensure your local database server is running.

  2. Run the development server: From the project's root directory (bijar/), execute the appropriate script for your operating system. This will automatically manage the virtual environment and start the Flask server.

    • On Windows (cmd/PowerShell):
      .\run.bat
    • On macOS, Linux, or Git Bash:
      ./run.sh

The application will now be running at http://127.0.0.1:5000 and http://localhost:5000.

Toolforge Production

(Replace <username>, <tool_name>, and <database_name> with your credentials.)

1. Connect to Toolforge

ssh <username>@login.toolforge.org
become <tool_name>

2. Clone the Repository

git clone https://github.com/KurdishWikipedia/bijar.git .

3. Create Virtual Environment & Install Dependencies

See the official documentation on Python Virtual Environments and Packages on Toolforge for more details.

toolforge webservice python3.13 shell
mkdir -p $HOME/www/python
python3 -m venv $HOME/www/python/venv
source $HOME/www/python/venv/bin/activate
pip install --upgrade pip wheel
pip install -r $HOME/requirements.txt
exit

4. Set Up the Database on Toolforge

See the Toolforge ToolsDB documentation for more information.

# Connect to MariaDB
sql tools
# Create the database
CREATE DATABASE <database_name>;
# Verify creation
SHOW DATABASES;
# Exit the MariaDB prompt
exit

Upload your local_database.sql file from your computer to your tool's home directory. In a new local terminal:

scp local_database.sql <username>@tools-login.wmflabs.org:/data/project/<tool_name>/

From your Toolforge shell, verify the upload and import the data:

# Verify the file exists
ls -l *.sql
# Import the SQL file into your tool's database
mysql --defaults-file=$HOME/replica.my.cnf -h tools.db.svc.wikimedia.cloud <database_name> < /data/project/<tool_name>/local_database.sql
# No output indicates success.

Check that the tables were imported successfully:

sql tools
USE <database_name>;
SHOW TABLES;
# You should see all tables now.
exit

(Optional but recommended) Remove the SQL file after import. Toolforge advises against storing backups permanently on the platform.

cd ~
rm local_database.sql

5. Configure Environment Variables

Create and edit the .env file using the provided sample. The file contains all necessary instructions.

cd www/python/src
cp .env.sample .env
# Edit .env using the instructions inside the file.

Finally, secure the file:

chmod 600 .env

6. Generate Word Statistics

Note: To prevent a startup timeout on Toolforge, this script pre-caches word statistics in a JSON file for the home page.

Activate the virtual environment:

source www/python/venv/bin/activate

Run the script manually to generate the statistics. This process can be slow on Toolforge, depending on the size of the database.

python run.py generate_stats.py

After the script finishes, deactivate the environment:

deactivate

TIP: Since this script's execution is slow on Toolforge, it is more efficient to automate it with a scheduled cron job rather than running it manually.

7. Start the Webservice

toolforge webservice python3.13 start

Backing Up the Database from Toolforge

See the official documentation about backups for details.

Note: Toolforge does not recommend storing backups on the platform permanently.

1. Export: Run this command to create a private SQL dump. The file will be saved in your tool's home ($HOME) directory.

# use umask to make the dump private (use unless the database is public)
toolforge jobs run --command 'umask o-r; ( mariadb-dump --defaults-file=$TOOL_DATA_DIR/replica.my.cnf --host=tools-readonly.db.svc.wikimedia.cloud <database_name> > $TOOL_DATA_DIR/<database_name>-$(date -I).sql )' --image mariadb backup --wait

Verify the file was created using ls -l *.sql from the $HOME directory.

2. Download: From your local PC's terminal, use scp to download the file.

scp <username>@login.toolforge.org:/data/project/<tool_name>/<database_name>-YYYY-MM-DD.sql .

Gadget Source Code

The source code for the user interface (the gadget) is hosted directly on the Central Kurdish Wikipedia.

Contributing

The best place to report bugs, request features, or discuss ideas is the project's talk page on ckb.wikipedia.org.

Alternatively, you can open an issue or submit a pull request on GitHub.

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

An open-source spellchecker for Central Kurdish Wikipedia, delivered as a MediaWiki gadget and powered by a Python/Flask service hosted on Toolforge.

Topics

Resources

License

Stars

Watchers

Forks