Is it SqlManager or SQLManager? Which stereotypical postfix best describes the purpose of my class? Choosing a good class name sometimes feels more challenging than programming.

This article will look at how the most popular Java repositories on GitHub apply class naming conventions in practice. Based on the result from analyzing over 1.3 million class names from over 2,300 repositories, we could draw new inspirations (e.g., postfixes) and bolster arguments for naming controversies with statistical figures.

The Sample Data 

We use the most popular GitHub repositories for the following analysis (based on the number of awarded stars) because they form the backbone of the Java ecosystem (e.g., Spring, Apache Commons, Log4J, Jackson). Many, if not almost all, Java projects depend on these repositories and are inspired by their coding principles. Therefore, the selected sample should have good representativeness.

The GitHub search API makes it easy to get a JSON Array of all repositories which are using Java in descending order of their popularity:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
GET https://api.github.com/search/repositories?sort=stars&order=desc&q=language:java&page=1&per_page=100

{"total_count": 10816023, "incomplete_results": true, "items":
 [ { "id": 20300177,
      "full_name": "google/guava",
      "trees_url": "https://api.github.com/repos/google/guava/git/trees{/sha}",
      "default_branch": "master",
      ... },
    ...
  ] }

With the trees_url, we can get all the files in a repository for its default_branch:

1
2
3
4
5
6
7
GET https://api.github.com/repos/google/guava/git/trees/master?recursive=1

{ "tree": [
  { "path": "guava/src/com/google/common/util/concurrent/SmoothRateLimiter.java" },
  { "path": "guava/src/com/google/common/util/concurrent/Striped.java" },
  ...
] }

Using this method, we can obtain 2,830,728 filenames from 2,353 repositories. (side fact: These repositories contain ~63 GB of data).

From the set of all filenames, we want those that belong to Java sources. Therefore, we filter the names for the ones with the extension .java. The result is a reduced set of 1,387,857 files.

Since we are only working at the filename level and cannot look at the contents of a file, we need other indicators of whether it is a Java source file. A good indicator is to validate that the filename is a valid Java class name. We can check this requirement by using the following two Java methods:

1
SourceVersion.isIdentifier(fileName) && !SourceVersion.isKeyword(fileName)

The final data set consists of 1,369,558 class names (which are 48% of the files in all repositories).

Java Source Sets 

A convention is to organize Java sources into source sets using the directory format /src/$sourceSetName/java/, to distinguish, for example, productive and test code.

In the sample are 76% of the Java files structured in source sets. 94% are under /src/.

The following table shows the most popular source set names (the full list can be found here ):

Source Set NameOccurrences
main747396
test267670
gen10511
generated7945
androidTest5510
net1862
distributedTest1271
internalClusterTest1116
integrationTest1108
it1060
testIntegration875
com797
test.slow668
jmh559
testFixtures526
test-fast488
test-integration376
integration-test314
ext249
acceptanceTest199

Unsurprisingly, main and test are at the top. However, the wide variation in the test source sets is interesting.

Package Names 

Let’s first look at the packages of the classes in the sample. We will use the subset of the classes organized in source sets not to mix up the package name with the outer folders.

The package names have a median of 31 characters (50% are below or above this point) and an average of 32.2. The following diagram shows the overall distribution:

numberOfCharactersInPackageNames

A package name usually consists of several subpackages. The sample has median of 5 subpackages, and an average of 4.7. We can see a more detailed distribution in the following diagram:

numberOfSubPackagesInClassNames

The conversion is that the first two subpackages represent a domain (e.g., org.company). And with each subpackage, the categorization becomes more fine-grained. If we take this format as a basis, we see either no categorization or at least three categories in the distribution. That 1 and 2 categories are rare is very interesting.

Class Name Lengths 

The lengths of the class names range from 1 character (here, we find the whole alphabet, mostly in test sources) up to 136 characters. The prize for the longest name goes to the Kubernetes Java Client with:

1
V1alpha2IssuerSpecAcmeHttp01IngressPodTemplateSpecAffinityNodeAffinityRequiredDuringSchedulingIgnoredDuringExecutionNodeSelectorTerms

If we look at the class names over 60 characters, they are almost exclusively names of test classes containing a detailed test case description. For example:

1
2
3
4
TomcatSessionBackwardsCompatibilityTomcat8WithOldModulesMixedWithCurrentCanDoPutFromCurrentModuleTest
AtomicReferenceArrayAssert_usingRecursiveFieldByFieldElementComparator_with_RecursiveComparisonConfiguration_Test
WildFlyActivationRaWithWMElytronSecurityDomainWorkManagerElytronDisabledTestCase
Assertions_sync_assertThatRuntimeException_with_BDDAssertions_and_WithAssertions_Test

Next, let’s take a look at the distribution of lengths. The median length is 18 characters.0, and the average is 18.72. By plotting the distribution, we can see an excellent normal distribution:

numberOfCharacterInClassNames

Interesting is the peak at 15 characters. This length could be a good sweet spot for a proper class name length.

Character Set 

The Java specification allows the use of almost the entire Unicode spectrum for class names (an exception is, for example, the - character). In contrast, it is interesting that in our sample, 98% use only Latin characters and numbers (matching the regex ^[a-zA-Z0-9]+$). And if we include the two characters _ and $, we get 99.99%.

The obvious interpretation of these numbers is, that the most popular Java GitHub repositories are maintained mainly by an English-speaking developer base.

(It would be interesting to analyze Gitee ’s character set, the Chinese GitHub equivalent. A cursory look at the top Java repositories shows that Chinese gets used for the documentation (e.g., JavaDoc), but the code mainly uses also only Latin characters).

Pascal Case 

A widely used convention is to use the Pascal case notation for class names. Single words are concatenated and separated in this format by capitalizing the first letter. In our sample data, 97.3% of class names follow the strict Pascal case convention (matching the regular expression ^[A-Z][a-z0-9]*([A-Z][a-z0-9]*)*$).

The deviators can be found mainly in the test sources (primarily by using the postfix _Test) and the generated sources (these class names often start with a lowercase character).

Number of Words 

The class names that follow the Pascal notation have a median of 3 words and an average of 3.36. The full distribution can be seen in the following diagram:

numberOfWordsInClassNames

Plurals 

Do we call a class that contains static utility methods XyzUtil or XyzUtils? In our sample set, 11,771 class names have the postfix Util and 16,075 the plural Utils.

The next question is whether we should use the plural form for a component that manages multiple instances of an entity. For example, is it UserManager or UsersManager?

We use the following methodology to make a statistical survey of this difference. First, we create a new class name by inserting a s before each capital letter in the original name. Then we check if the new name occurs more than once in the sample set. This approach gives us a pretty strong result: the singular form of an entity gets used much more frequently than its plural counterpart.

The following table shows a selection of the results:

SingularOccurrencesPluralOccurrences
UserController340UsersController4
UserService306UsersService2
UserRepository167UsersRepository1
UserServiceImpl141UsersServiceImpl1
FileUtil132FilesUtil2
UserApi119UsersApi4
DateUtils95DatesUtils1
OrderService95OrdersService2
TestHelper72TestsHelper4
ViewUtils65ViewsUtils1
OrderServiceImpl60OrdersServiceImpl2
ImageUtils59ImagesUtils1
EventHandler56EventsHandler1
PersonRepository56PersonsRepository2
ResourceUtils56ResourcesUtils4
CommonUtils54CommonsUtils1
BookRepository49BooksRepository2
ColorUtils48ColorsUtils1
ResourceLoader47ResourcesLoader3
ArrayUtil46ArraysUtil2
PluginManager46PluginsManager3

Abbreviations 

An ongoing controversy with class names is the upper and lower case spelling of abbreviations. If we interpret the Pascal Case notation strictly and consider an abbreviation as one word, we would have to capitalize only the first letter (e.g., Sql instead of SQL).

It would require a more significant effort to approach a general statement about the ratio of both spellings in the dataset since we would have to work with a more complex statistical model. Therefore, we will limit the analysis to getting a rough overview of the distribution in the sample.

First, we need a list of all abbreviations in all class names. To find them, we need to search for all the class names’ substrings which contain at least three upper case characters. We then drop the last letter (because this is the beginning of a new word) and get the abbreviation. For example, from MySQLManager, we would find the substring SQLM which then results in the abbreviation SQL. Finally, we reduce the resulting list to the commonly known terms to focus on more accurate results.

We start by counting how many times an uppercase abbreviation occurs in all class names and in how many repositories.

Next, for the lowercase variant, we have to put in more effort. First, we use the regular expression ^.*LOWER_APPR[s]?([A-Z0-9_$]+.*)?$ to find all occurrences of an abbreviation that are followed by a capital letter (e.g., ApiService), number (e.g., Http2Handler), or special character. Also, we allow an optional plural use of the abbreviation,e.g., ApisService. The restriction given by the regular expression is necessary because we want to find Ui in UiManager but not in Uint8Array. However, this is also too restrictive since, e.g., Sqlite would also be filtered out. Therefore, we have to go manually through all class names that contain a lowercase abbreviation but do not match the regular expression and add them to the result.

This approach yields the following overview of total occurrences and in how many repositories they occur:

UpperOccur.Repos.LowerOccur.Repos.
SQL5964228Sql6752246
HTTP62899Http10479621
API1886223Api6839478
DB5457358Db2607277
XML2757173Xml3140315
UI4730374Ui571134
DAO89964Dao2938211
IO2877431Io710116
JDBC96777Jdbc2579170
URL1521252Url1756403
DTO186191Dto127781
VM2382147Vm34237
OS2027222Os41498
SSL1449162Ssl957130
URI482114Uri1591212
HTML69586Html1251188
AST117963Ast42343
LDAP25132Ldap87364
JMS37916Jms73034
CSV38384Csv70595
JVM29159Jvm661103
CI794134Ci4816
FTP31521Ftp49732
HDFS21630Hdfs55644
UDF66119Udf10611
CDI27320Cdi35821
AWS19223Aws41038
ARM39316Arm6915
AES20962Aes15149
APPLE151Apple32034
JAXRS12110Jaxrs19015
APK158Apk29460
ISO19142Iso9938
AMQP618Amqp17516
ASCII7729Ascii15571
CRUD5322Crud15936
AWT18229Awt2915
MIDI502Midi1299
HANA693Hana1004
HDR7813Hdr3815
CDC2511Cdc876
JAXWS754Jaxws154
IAM699Iam196
GMS534Gms148
UTC3015Utc3716

(For interpreting the numbers, it should be clarify again that the methodology used has some imprecision and is not an exact science).

Interestingly, the occurrences between individual abbreviations sometimes vary greatly. On the one hand, we have abbreviations like SQL/Sql that are almost in equilibrium, and on the other hand, some like HTTP/Http show a glaring imbalance.

It’s common to give class names stereotypical prefixes or postfixes.

We can extract over 4700 prefixes using the regular expression ^([A-Z][a-z0-9]+)[A-Z]+.*$ from the class names that follow the Pascal notation. The following table shows the most common ones:

PrefixOccurrencesRepositories
Test41507866
Abstract17321822
Default11733820
Base6852981
User6690579
File6533782
My6462573
Simple6299808
Data5784624
Get5330344
Http5118518
String4338714
Web4105493
Custom3989680
Client3957343
Ifc38531
Class3832416
Multi3616518
Input3596298
Spring3567261
Grid3437146
Json3373477
Message3170423
Local3139455
Mock3040306
Type3038344
Commerce30212
Job2993153
Server2984355
Java2945352
Main29351084
Query2930307
Map2911417
App2865448
Event2851376
Service2851365
Basic2808413
List2792515
Cache2780342
Resource2778390
Rest2739189
Sql2737187
Config2735380
No2722514
Create2684291
Method2652317
Open2650278
Object2613435
Application2587570
Request2586398

Similarly, we can extract over 2200 postfixes using the regular expression ^.*[A-Z][a-z0-9]+([A-Z][a-z0-9]+)$. The most common ones can be seen in the following table:

PrefixOccurrencesRepositories
Test1825961605
Impl28125818
Tests24745386
Factory21902975
Exception197451018
Service17703908
Utils144181241
Handler13984939
Provider13835773
Type12641803
Activity113571203
Manager11182968
Util106911025
Builder10355726
Listener103481055
Controller9119594
Config9056770
Action8363349
Info7958757
View7461917
Event7351511
Adapter71591007
Filter6920658
Helper6866893
Configuration6838506
Request6729482
Context6534583
Mapper6289412
Case6279225
Node6063384
Processor5766516
Description5701147
Fragment5659563
Parser5523616
Model5059473
Response4986401
Repository4958290
Data4827647
Converter4650442
Resource4617292
Generator4545543
Function4534309
Command4512271
Application4490602
Bean4344314
Client4340553
Task4293495
Reader4162418
Source4009475
Result3974603

Recurring Class Names 

Besides commonly used prefixes and postfixes, seeing how many class names reoccur in multiple repositories is also interesting.

The following table shows the most reccuring class names in all repositories in the sample data (we can find a complete list of all class names occurring more than ten times in more than ten repositories here ):

Class nameOccurrencesRepositories
Solution237514
MainActivity1862872
User1225314
Main1078231
Application1064147
ExampleUnitTest853323
Utils836424
Person723149
ApplicationTest687313
Constants674341
Test546106
Util531230
ExampleInstrumentedTest489199
HelloController48743
App486171
Order464125
Client408124
A40155
Foo40181
UserController340121
Node338190
Address338100
Message320195
TestUtils318187
UserService306132
About the Author

Marcel Kliemannel

Software engineer, JVM enthusiast, and technical writer focused on architecture, backend development, security, automation, DevOps, monitoring, and performance.

Related Articles