Code review comments on GitHub contain a wealth of signal about what engineers care about — naming, logic, performance, test coverage — but that signal is buried in unstructured free text. This post builds an SVM classifier to categorize over 30,000 GitHub pull request review comments by the main technical topic each addresses. The dataset, feature engineering approach, and model evaluation are walked through in a Jupyter notebook available on GitHub. The results reveal which topics dominate code review discussions and how that distribution shifts across different types of repositories.
Star
As part of the code review process on GitHub, developers can leave comments on portions of the unified diff of a GitHub pull request. These comments are extremely valuable in facilitating technical discussion amongst developers, and in allowing developers to get feedback on their code submissions.
But what do code reviewers usually discuss in these comments?
In an effort to better understand code reviewing discussions, we’re going to create an SVM classifier to classify over 30 000 GitHub review comments based on the main code-related topic addressed by each comment (e.g. naming, readability, etc.).
Grab the Jupyter Notebook for this experiment on GitHub.

The list of classifications we’re going to incorporate into our classifier are summarized in the table below. This list was developed based on a manual survey of approximately 2000 GitHub review comments I performed on randomly selected, but highly forked Java repositories on GitHub.
The selected categories reflect the most frequently occurring topics encountered in the surveyed review comments. Majority of the categories are related to code level concepts (e.g. variable naming, exception handling); however, certain review comments
that did not naturally fall into any existing categories and were unrelated to the overall goal of code reviewing were placed in
the “other” category.
In situations where a review comment discussed more than one subject, I gave it a classification according
to the topic it spent the most words discussing.
| Category |
Label |
Further Explanation |
Sample Comment |
| Readability |
1 |
Comments related to readability, style, general project conventions. |
“This code looks very convoluted to me” |
| Naming |
2 |
|
“I think foo would be a more appropriate name” |
| Documentation |
3 |
Comments related to licenses, package info, module documentation, commenting. |
“Please add a comment here explaining this logic” |
| Error/Resource Handling |
4 |
Comments related to exception/resource handling, program failure, termination analysis, resource . |
“Forgot to catch a possible exception here” |
| Control Structures/Program Flow |
5 |
Comments related to usage of loops, if-statements, placement of individual lines of code. |
“This if-statement should be moved after the while loop” |
| Visibility/ Access |
6 |
Comments related to access level for classes, fields, methods and local variables. |
“Make this final” |
| Efficiency / Optimization |
7 |
|
“Many unnecessary calls to foo() here” |
| Code Organization/ Refactoring |
8 |
Comments related to extracting code from methods and classes, moving large chunks of code around. |
“Please extract this logic into a separate method” |
| Concurrency |
9 |
Comments related to threads, synchronization, parallelism. |
“This class does not look thread safe” |
| High Level Method Semantics & Design |
10 |
Comments relating to method design and semantics. |
“This method should return a String” |
| High Level Class Semantics & Design |
11 |
Comments relating to class design and semantics. |
“This should extend Foo” |
| Testing |
12 |
|
“is there a test for this?” |
| Other |
13 |
Comments not relating to categories 1-12. |
“Looks good”, “done”, “thanks” |
Loading The Data Set
Now we’ll discuss our SVM text classifier implementation. This experiment represents a typical supervised learning classification exercise.
We’ll start by first loading our training data consisting of two files representing 2000 manually labeled comment-classification pairs. The first file contains a review comment on each
line, while the second file contains manually determined classifications for each corresponding review comment on each line.
with open('review_comments.txt') as f:
review_comments = f.readlines()
with open('review_comments_labels.txt') as g:
classifications = g.readlines()
Read article ↗
11
4 min read
Swagger UI communicates endpoint details well but fails at conveying the shape of a complex API at a glance. When an API exposes dozens of resources and hundreds of operations, developers need a high-level map before they can navigate the detail. This post proposes a table-based documentation format that presents resources, operations, and their relationships in a compact, scannable structure. The approach is complementary to existing spec-driven tooling and can be generated directly from an OpenAPI definition — a live demo built against the GitHub API is included.
Read article ↗
12
7 min read
Most OOP languages claim the label but miss what Alan Kay actually meant when he coined the term. Polymorphism, encapsulation, and inheritance are frequently cited as its pillars — but these exist in functional languages too. Kay’s actual vision was rooted in biology: autonomous objects communicating exclusively through message passing, with no direct access to each other’s internal state. This post traces that original conception through Kay’s early work and asks why almost no modern software actually practices it, and what we lose as a result.
Read article ↗
13
3 min read
Disk space exhaustion is a quiet failure mode — systems degrade gradually until they stop working entirely, often at the worst possible moment. This post shares a bash script that monitors local disk storage levels and reports them to a Slack channel at a configurable interval, color-coded by usage severity. The script uses standard Unix tooling with no additional dependencies and can be dropped onto any instance in minutes. It is designed to be deployed across multiple machines simultaneously and supports configurable alert thresholds so teams can act before things become critical.
Integrations are what takes Slack from a normal online instant messaging and collaboration system to a solution that enables you to centralize all your notifications, from sales to tech support, social media and more, into one searchable place where your team can discuss and take action on each. In this article, I’ll share a simple bash script that reports local disk storage levels to Slack at a continuous time interval. It is easily deployable to multiple instances, highly configurable, and can helps teams take proactive measures in maintaining the operational well-being of their systems.

The script is available on GitHub and can be dropped anywhere on the instance you want to monitor. At a specified interval, it will
post disk storage related information to slack as illustrated above. The drive information is retrieved using the df -h command on Unix systems. Additionally, listed drives on the system are color coded based on how much
storage capacity they have left. Two quick steps are required for getting the integration setup and running.
1 - Create a Slack Webhook Notification:
This will allow the script to post as a bot/integration instead of as yourself (which would
require your personal credentials). First, ensure the Incoming WebHooks app
is installed in your slack organization. Next, click Add Configuration and read the instructions to configure the integration settings
as desired. Copy the value for the Webhook URL field, which will be required in the next step.

2 - Use a time-based job scheduler to run the script:
The job scheduler will execute the script regularly at a time interval based
on how often we want to view the reports. On a Linux environment, the crontab command, which is used to schedule
commands to be executed periodically, is the perfect tool for the job. To create a new cronjob, simply type crontab -e in a command
prompt. New jobs can be installed by adding a new entry to the file with the following syntax:
1 2 3 4 5 /path/to/command arg1 arg2
Read article ↗
14
5 min read
Relational databases were not designed for time series data — as write volumes grow, table cardinality climbs and query performance degrades in ways that are hard to tune around. Purpose-built time series databases like InfluxDB handle this workload efficiently by design, with compression, downsampling, and retention policies built in from the start. This post explains when that tradeoff is worth making and walks through the practical steps of migrating existing Postgres or MySQL records into InfluxDB using Python. It covers schema mapping, batching strategies, and the key differences in querying that will affect any application sitting on top of the new store.
Read article ↗
15
3 min read
Clarpse is a multi-language source code analysis tool designed for extracting deep structural relationships between entities in a codebase — classes, methods, fields, imports, and the connections between them. It exposes these relationships through a clean, language-agnostic API that decouples downstream tooling from any particular compiler or parser. Features like jump-to-definition, find-usages, type inference, and documentation generation can be built on top of Clarpse without re-implementing the underlying language analysis for each supported language. The library currently supports Java and Go, with a design that makes adding additional language backends straightforward.
Read article ↗