A TikTok-sourced Shan/Tai proverbs corpus with automated NLP processing via Telegram bot, Google Sheets, n8n Docker, and ShanNLP.
This project automates the entire workflow of collecting Shan/Tai proverbs from TikTok creators through a Telegram bot, storing them in Google Sheets, processing with NLP tools, and pushing cleaned data back to the repository. It enables corpus building, language preservation, and downstream NLP analysis for low-resource Tai languages.
TikTok Creators
β
Telegram Bot (Collection)
β
Google Sheets (Raw Data Storage)
β
n8n Docker Workflow
βββ Download from Google Sheets
βββ Process with ShanNLP Library
β ββ Tokenization (word_tokenize)
β ββ Text normalization
β ββ Digit/date conversion
β ββ Unicode standardization
βββ Enrich & Clean Proverbs
βββ Git Push to Repository
β
shan-proverbs Repository (Final Corpus)
β
NLP Analysis & Research
.
βββ README.md # This file
βββ proverbs.json # Main proverbs corpus (processed by n8n)
βββ shan_proverbs_extra.json # Extended corpus with metadata
βββ docker-compose.yml # Docker setup for n8n
βββ n8n-workflows/ # n8n workflow JSON files
β βββ telegram-to-repo.json # Main workflow: Google Sheets β ShanNLP β Git Push
βββ data/
βββ exports/
βββ shan_proverbs.csv
βββ shan_proverbs_clean.json
Your Telegram bot collects Shan proverbs shared by TikTok creators and stores raw data in Google Sheets with these fields:
- Proverb text (original Shan)
- Creator name/ID
- TikTok link
- Timestamp
- Optional notes
Data is stored in a Google Sheet with structured columns for easy import into n8n.
Before n8n processes your data, ensure quality with these best practices in your Google Sheet:
- Use data validation to accept only Shan Unicode characters
- Go to
DataβData validation - Create custom formula:
=AND(REGEX(A2, "[\u1000-\u109F]+"))to check for Shan characters - Set error alert: "Please enter valid Shan text"
- Go to
- Check for mixed scripts (avoid English/Burmese in Shan proverb column)
- Mark these columns as mandatory (no empty cells):
- Proverb text (original Shan)
- Creator name/TikTok username
- TikTok link (must start with
https://www.tiktok.com/) - Timestamp
- Use conditional formatting to highlight empty required cells in red
- Validate TikTok links with data validation formula:
=AND(NOT(ISBLANK(C2)), ISNUMBER(SEARCH("tiktok.com", C2))) - Ensure URLs are clickable and formatted consistently
- Remove any shortened URLs; use full TikTok video links
- Regularly run
DataβData cleanupβRemove duplicates - Check for:
- Identical proverb text from same creator
- Duplicate rows with minor spacing differences
- Same content with different timestamps
- Use a helper column with formula:
=TRIM(CLEAN(A2)) - This removes:
- Leading/trailing spaces
- Extra spaces between words
- Non-printing characters
- Copy cleaned values back to original column using paste special (values only)
- Timestamps: Use standard ISO 8601 format
YYYY-MM-DDTHH:MM:SSZ- Google Sheets formula:
=TEXT(NOW(), "YYYY-MM-DDTHH:MM:SSZ")
- Google Sheets formula:
- Creator names: Use Title Case (e.g., "Creator Name" not "CREATOR NAME")
- TikTok links: Remove
?is_copy_url=1&is_from_webapp=v1query parameters
Create a helper column to rate data quality (1-5 scale):
=COUNTIF(B2:F2,"<>")/COUNTA(B2:F2)*5
Filter to show only scores β₯ 4 before n8n processing
- Proverb should be 10-500 characters (avoid single word entries)
- Use formula:
=AND(LEN(A2)>=10, LEN(A2)<=500)
- Use formula:
- Creator name: 2-50 characters
- Use conditional formatting to highlight outliers
Before n8n syncs:
- Sort by
created_atto identify recent entries - Read through for context and accuracy
- Mark suspicious entries with a "Review" flag column
- Fix or delete flagged entries before processing
- Keep a backup sheet of "raw unreviewed data"
- Freeze header row:
ViewβFreezeβ1 row - Alternate row colors: Select data β
FormatβAlternating colorsfor readability - Data validation dropdown: For creator names, use a list of approved creators
- Comments: Add notes next to questionable entries for context
- Version history: Enable to track changes over time
Create a "Cleaned Data" sheet with these formulas:
# Column A - Cleaned Proverb Text
=IF(AND(LEN(TRIM(A_raw))>10, REGEX(TRIM(A_raw), "[\u1000-\u109F]+")), TRIM(A_raw), "")
# Column B - Creator (Titlecase)
=IF(LEN(B_raw)>0, PROPER(TRIM(B_raw)), "")
# Column C - TikTok URL (clean parameters)
=IF(AND(NOT(ISBLANK(C_raw)), ISNUMBER(SEARCH("tiktok.com", C_raw))),
REGEXREPLACE(C_raw, "\\\\?.*", ""), "")
# Column D - Timestamp (ISO format)
=IF(ISDATE(D_raw), TEXT(D_raw, "YYYY-MM-DDTHH:MM:SSZ"), "")
Then copy cleaned data back as values before n8n imports.
Before importing to n8n, add this validation in a "Status" column:
=IF(COUNTIF(A2:D2,"<>")<>4, "MISSING_FIELDS",
IF(LEN(A2)<10, "PROVERB_TOO_SHORT",
IF(ISERROR(SEARCH("tiktok.com", C2)), "INVALID_URL",
"READY_TO_PROCESS")))
Filter to show only "READY_TO_PROCESS" rows before n8n export.
- Daily: Review newly added entries in Google Sheets
- Weekly: Run duplicate check and whitespace cleanup
- Before n8n sync: Apply all validation formulas and filter for "READY_TO_PROCESS"
- After n8n processing: Compare results with original to catch processing errors
- Monthly: Audit data quality and update validation rules as needed
The n8n Docker container runs a scheduled workflow that:
- Authenticates with Google Sheets API
- Downloads all new proverbs from the spreadsheet
- Processes using ShanNLP library:
- Text normalization and Unicode standardization
- Word tokenization using maximal matching or newmm algorithm
- Digit/date/keyboard conversion utilities
- Cleans and deduplicates data
- Enriches with metadata
- Exports to JSON/CSV formats
- Commits & Pushes to GitHub repository
Processed proverbs are stored in this repository for version control and collaboration.
{
"id": "unique-id-001",
"proverb_shan": "",
"definition": "",
"english_translation": "Meaning in English",
"tokens": ["", "", ""],
"source_creator": "TikTok username",
"source_url": "https://www.tiktok.com/@creator/video/...",
"collected_at": "2025-01-04T15:30:00Z",
"processed_at": "2025-01-04T16:45:00Z",
"tags": ["wisdom", "culture", "tradition"]
}- Docker and Docker Compose
- n8n image
- Google Sheets API credentials
- GitHub personal access token (for git push)
- ShanNLP library installed in n8n environment
docker volume create n8n_dataCreate a docker-compose.yml:
version: '3.8'
services:
n8n:
image: docker.n8n.io/n8nio/n8n
container_name: shan-proverbs-n8n
ports:
- "5678:5678"
volumes:
- n8n_data:/home/node/.n8n
- ./data:/home/node/data
environment:
- N8N_EDITOR_BASE_URL=http://localhost:5678
- WEBHOOK_TUNNEL_URL=http://localhost:5678/
networks:
- n8n-network
restart: unless-stopped
volumes:
n8n_data:
networks:
n8n-network:
driver: bridgeStart the service:
docker-compose up -dOpen http://localhost:5678 in your browser
Google Sheets API
- In n8n, create a new credential: Google Sheets OAuth2
- Authenticate with your Google account
- Grant permission to access Google Sheets
GitHub API
- Create GitHub Personal Access Token with
repoandworkflowscopes - Add token to n8n GitHub credentials
- Import
n8n-workflows/telegram-to-repo.json - Configure node parameters:
- Google Sheet ID
- Column mappings
- Repository URL and branch
- ShanNLP processing options
- Set trigger to run on schedule (e.g., daily at 2 AM)
- Or trigger manually from n8n UI
The n8n workflow uses the ShanNLP library for text processing:
from shannlp import word_tokenize
text = ""
# Method 1: Maximal Matching (fast)
tokens = word_tokenize(text, engine='mm')
# Output: ['', '', '']
# Method 2: newmm (PyThaiNLP-based)
tokens = word_tokenize(text, engine='newmm')digit_to_text()- Convert digits to Shan wordsnum_to_shanword()- Convert numbers to Shan textshanword_to_date()- Parse Shan date formatsconvert_years()- Convert between calendar systems (AD, BE, GA, MO)eng_to_shn()- Keyboard conversion (English to Shan)
The main workflow includes these node types:
- Trigger Node: Schedule or manual webhook
- Google Sheets Node: Read data from spreadsheet
- Function Node: JavaScript for data transformation
- HTTP Request Node: Call ShanNLP API (if exposed)
- Code Node: Python execution for ShanNLP processing
- File Write Node: Save processed data
- Git Node: Clone, commit, and push to repository
- Webhook Node: Optional notifications
To use ShanNLP in n8n, you have two options:
Add ShanNLP to the n8n environment by extending the Docker image:
FROM docker.n8n.io/n8nio/n8n:latest
USER root
RUN apt-get update && apt-get install -y python3-pip
RUN pip install shannlp
USER nodeBuild and run:
docker build -t n8n-shannlp .
docker run -it --rm -p 5678:5678 n8n-shannlpRun ShanNLP as a microservice in another container and call it via HTTP from n8n.
Create a .env file for sensitive data:
# Google Sheets
GOOGLE_SHEET_ID=your-sheet-id-here
GOOGLE_API_KEY=your-api-key
# GitHub
GITHUB_TOKEN=your-personal-access-token
GITHUB_REPO_URL=https://github.com/Alexpeain/shan-proverbs.git
GIT_USER_NAME=Alexpeain
GIT_USER_EMAIL=your-email@example.com
# n8n
N8N_EDITOR_BASE_URL=http://localhost:5678-
ShanNLP - NLP tools for Shan language processing
- Repository: https://github.com/Alexpeain/ShanNLP
- Features: Tokenization, digit conversion, date conversion, keyboard conversion
- Inspired by PyThaiNLP
-
Telegram Bot - Collects proverbs from TikTok creators
- Stores data in Google Sheets
- Validates Shan text input
- Open n8n UI at
http://localhost:5678 - Edit the imported workflow
- Test individual nodes
- Export updated workflow as JSON
- Save to
n8n-workflows/directory
- Use ShanNLP utilities for additional text processing
- Add filtering/validation nodes
- Integrate with other APIs or services
- Export to different formats (Parquet, SQLite, etc.)
import json
from shannlp import word_tokenize
# Load processed data
with open('data/exports/shan_proverbs_clean.json') as f:
proverbs = json.load(f)
# Test tokenization
for proverb in proverbs[:5]:
tokens = word_tokenize(proverb['proverb_shan'])
print(f"{proverb['id']}: {tokens}")import json
from collections import Counter
from shannlp import word_tokenize
with open('data/exports/shan_proverbs_clean.json') as f:
proverbs = json.load(f)
# Collect all tokens
all_tokens = []
for proverb in proverbs:
tokens = word_tokenize(proverb['proverb_shan'])
all_tokens.extend(tokens)
# Get frequency
freq = Counter(all_tokens)
print("Top 20 most common words:")
for word, count in freq.most_common(20):
print(f"{word}: {count}")Contributions are welcome! Please:
- Fork the repository
- Create a feature branch:
git checkout -b feature/improve-workflow - Make changes (to JSON files, workflows, or documentation)
- Commit:
git commit -m 'feat: improve n8n workflow for ShanNLP integration' - Push:
git push origin feature/improve-workflow - Open a Pull Request
- Improving tokenization accuracy
- Adding more ShanNLP utilities to the workflow
- Expanding the proverbs corpus
- Creating analysis notebooks
- Improving documentation
This project is for language preservation and educational purposes.
- Code: MIT License
- Data: Respect TikTok creator rights and terms of service
- Processed Data: Available for research and non-commercial use
- n8n Documentation
- Docker Documentation
- ShanNLP Repository
- Google Sheets API
- PyThaiNLP - Parent project inspiration
- Shan Language
Alexpeain - Self-taught developer focused on NLP for Shan/Tai languages and language preservation.
- GitHub: @Alexpeain
- Projects: Myanmar Book Reviews, ShanNLP, DSA NagBot, Accountability Board
- Location: Shan State, Myanmar
Last updated: January 2026
For issues, questions, or suggestions:
- Open an issue on GitHub
- Check existing documentation
- Review n8n workflow logs for debugging
- Consult ShanNLP project for NLP-specific questions