Back to writing
JT Laune, PhD
JT Laune, PhD

Wikilacra

Wikilacra

By Arbitrarily0 - Own work, CC BY-SA 3.0, Wikimedia Commons.

What can we tell about what's going on out in reality based upon its reflection in the world's most up-to-date encyclopedia? I built an end-to-end machine learning project, Wikilacra, that listens to Wikipedia's edit stream and predicts which pages are being edited in response to contemporaneous events.

You can see the code on my GitHub here. I utilized DVC and git for data and code versioning, MLflow for model training, experiment tracking, and deployment, and Docker for containerization. For tracking the server-side event edit stream, I set up a local SQL database and deployed the project to AWS. I'll detail the dataset and models below. An "event" refers to something that is happening currently, like a storm, a death, or a sporting event. A "non-event" is all other editing behavior, like adding details to an article about a historical person, reorganizing and copyediting an article, or even vandalism.

Training Data

Wikipedia releases dumps of its analytics history on a monthly basis. This was a great choice for my purposes, because the data was already built into structured CSV files and I didn't have to make an enormous month-long call to Wikipedia's server-side events (SSE) architecture. But it gave me a major headache later on!

I needed a reasonable number (500-1500) of events to label by hand that would likely have relatively balanced proportions of events/non-events. I used the August 2025 dump from the same month and binned the edit counts into discrete hourly timestamps. I selected pages with 15{\geq}15 edits in an hour from 2{\geq}2 users. This gave me about 1100 events with a fairly even split between events and non-events.

Modeling

The data are time-series, and so lookforward bias was a concern. But I found that treating each page candidate event as independent didn't degrade cross-validation (CV) or test set performance, and so using only historical data from that page was enough to prevent lookforward. This implies that the background editing activity did not play too much of a role in prediction.

This is a classification problem with quantitative and categorical data with only a fair amount of labeled data points. I started with random forests (RF, scikit-learn), linear/non-linear support vector classifiers (SVCs, scikit-learn), and gradient-boosted trees (GBT, xgboost). Then I tried a neural net (NN, pytorch). For the methods that required normalization (SVCs and NNs), I used log\log normalization on counts and tanh\tanh normalization on slopes.

I performed grid search K-fold CV on each method and selected the best performing model in accuracy for each algorithm. I didn't use time-series CV since I am treating each candidate event as independent. The metrics I tested on also included the F1 score, precision, recall, and false positive rate, but I chose accuracy since the classes are balanced and I care equally about false negatives and positives.

Most of the models ended up around 85% accuracy, and so I went with the random forest for its simplicity and fast training time, and because it gives feature importances. The best hyperparameters were deep trees (max_depth=9) with only 100 estimators. I was happy with this performance—I had a relatively small training set, and the class definitions are slightly ambiguous anyways.

Unsurprisingly, the most important feature is if the title of the page has the current year in the title. The second most important feature was the 6-hour rolling mean of the user entropy. The user entropy is calculated per hour via

(user entropy)h=ipilogpi\text{(user entropy)}_h = \sum_{i} p_i\log p_i

where hh denotes the current hour, pi=ni/Np_i=n_i/N, nin_i is the number of edits made by user ii in hour hh, and NN is the total number of edits the page received in hour hh. I.e., pip_i is the proportion of edits made by user ii in that hour. The user entropy is high whenever edits are evenly distributed among the editors—for newsworthy events, it makes sense that entropy would be high, as many editors are making many edits simultaneously.

Deployment

I deployed to an AWS EC2 instance. I used Docker compose to deploy the always-on listening daemon and MLflow server, and the classifier that runs once per hour, which I orchestrated with cron. The classifier service spits out its classifications into a JSON file.

I build my website using Astro and host the static site on GitHub pages, which posed a problem for a dynamic web app like Wikilacra. To solve this, I had ChatGPT vibe-code me a front end in JavaScript/HTML that reads data from the algorithm's JSON output, which I could then drop in as a page in the project. Then, I just rebuild the website (which is very fast, thanks to Astro) and push to the pages repository.

Issues

  1. Ambiguous labels
    "In response to a real event" is a fairly vague specification, and this made both labeling and prediction challenging. Even though I specified concrete rules about timeframes regarding event classification, there were some that did not fit neatly into the box. For example, take the pages for sports seasons. Is the months-long season itself an event? Or only the games? In this specific instance, I chose the season itself to count as the event, but in future projects, I'd stick to a more well-defined response variable.

  2. Training set
    The training data came from monthly Wikipedia data dumps (Training Data). The monthly dumps saved me a lot of time at the beginning of the project and enabled me to access a relatively long interval of data. But the live SSE data stream has a different structure. As I started to build the live part of the project, I realized I had built up a large technical debt by making these choices. Using different data structures between training and predicting ended up being a headache and delayed the project, and in the future I'll look forward to the production data set from the very beginning of a project.

What's next?

During the ML development process, I experimented with adding NLP capabilities via a pre-trained LLM. I did this by training various configurations of BERT classifiers on the page titles and revision text comments of the candidate events, and stacking this output into the tree/SVC/NN classifiers. From my initial testing, this did not improve model performance by much, so finding a way to combine NLP/LLM with the other models would be great.