Advancing data center networking through open access
October 27, 2022
By Ian Evans
A team from UCL creates an open access tool to benchmark and investigate data network systems
In the image above: Christopher Parsonson is a PhD candidate in AI & Networks at University College London (UCL). As part of a team lead by Dr Georgios Zervas, Associate Professor of Optical Networked Systems, he produced an open access framework called TrafPy, which was recently published in the Elsevier journal Optical Switching and Networking. When Christopher Parsonson(opens in new tab/window), a PhD candidate in AI & Networks at UCL, was working on research on optical data center networks, the team he was part of knew they wanted to publish their work in an open access journal. Led by Dr Georgios Zervas(opens in new tab/window), Professor of Optical Networked Systems at UCL, they set out to create an OA tool in an area that would have a massive impact on large-scale artificial intelligence models — and therefore much of modern living. Chris explained:
We realized there was a lack of open access research around data center networks in general, but in particular the starting point for everything in this area is being able to benchmark different systems. There wasn’t an open access tool for doing that. We knew that creating one was key to accelerating progress, validating new systems, and making everything reproducible.
As part of Prof Zervas’s team, Chris led a project to produce an open access framework called TrafPy(opens in new tab/window). Their paper was recently published in the Elsevier journal Optical Switching and Networking(opens in new tab/window), Compatible with any simulation, emulation or experimentation environment, it can be used for standardized benchmarking and for investigating the properties and limitations of network systems. It provides a way to benchmark systems against those developed by other research teams — a crucial element for understanding whether progress is being made.
One of the reasons optical data center networks are so important is that they open the door to a large number of emerging applications, from AI models to genome processing systems. They have the potential to accelerate research in everything from health and technology to entertainment.
“Just recently we were hearing about BigScience’s new large language model(opens in new tab/window), BLOOM(opens in new tab/window), which can generate text in 46 different languages and 13 programming languages,” Chris said. “That’s made up of 176 billion parameters.”
To train a model of that scale, researchers use multiple computers. In the case above, the model is distributed across 1,024 different devices.
“That requires communication across all these devices, which increases the more devices you use,” Chris explained. “That means that the bottleneck is no longer about individual computers but about the network that connects them. Over the last 18 years, there’s been an 8-factor decrease in the number of bytes communicated per flop.
“In other words, the performance of computers is increasing much faster than the rate at which we’re increasing the performance of our data center networks.”
The source of that limitation is that current data centers use electronic switching, meaning they have poor scalability, low bandwidth and high latency. What’s more, they’re energy intensive. Data centers and data transmission networks each account for about 1% of the world’s energy consumption(opens in new tab/window), predicted to rise to 10% to 15% of total energy consumed by 2030. In some cases, Chris said, more than 50% of that power consumption is from the network performing communication tasks between devices:
If we want to scale to next generation applications with brain-scale neural network models, which might have 100 trillion parameters, we need next generation data center networks where the interconnects are all optical. They have much lower latency, take up less space and — because they are passive devices that don’t need cooling — use considerably less power.
The team Chris works on is already working with industry giants such as Microsoft on applications for the technology, and they’re working with people in multiple fields. Because data analysis underpins such a vast range of disciplines, Chris finds himself in conversation with people from a diverse range of areas, from physics to electrical engineering and computer science. That broad reach was one of the reasons they were keen to publish open access:
The more you publish open access, the more likely it is that people in different fields are going to find your work, see the findings you’ve published and apply them in their own research.
Chris noted that the area he works in has lagged behind computer science in terms of research it publishes open access. One of the team’s goals was to redress that balance:
Computer science has a strong culture of open access, and over the past 10 years, it’s clear that that has been a big driver of progress in areas such as machine learning.
Chris pointed to AlexNet(opens in new tab/window), an open-source neural network tool for image recognition that competed in the ImageNet 2012 Challenge(opens in new tab/window). As Chris puts it, AlexNet “absolutely smashed” the other competitors:
That spurred the next decade of interest and investment in machine learning and turned it into one of the biggest fields of research.
“In data center networking, you haven’t seen so much of that development of open access benchmarks, simulations and systems,” he continued. “Even when research is published, the researchers often hold back from sharing the back-end algorithms or datasets they were using, so it’s hard to reproduce the results, to test against old systems and develop new ideas.
“Our work is to try and redress the balance and tackle some of the shortcomings in open access in data center networking.”
With open access as a must, Chris was also looking for a platform that would bring his work to the people who could use it, which was one of his reasons for choosing to publish with Elsevier.
Obviously Elsevier — through ScienceDirect — has a really big audience. And Optical Switching and Networking is a journal that caters to the kind of application areas we’re interested in — next-generation optical data networking centers. They had the open access option, a swift review process and high-quality papers that seemed to get a lot of traction. That was very appealing to us because we needed to make sure that wherever we publish, people end up seeing the work and using the tools that we’ve developed.
Of course, to maintain the high quality of papers a journal like Optical Networking has to have a rigorous peer review process. For some, that can be a slightly nerve-wracking or even frustrating process, but Chris was impressed with the speed and quality of feedback:
I’ve published in a few places recently, and this was extremely, extremely good. Usually, the slightly painful thing about the review process is the length of time it takes to get things done. Here, we submitted and then two-and-a-half months later, we had the reviews back, which is really good.
We changed the paper following that feedback, and then I think a day later it was approved and was then immediately available online for people to read. That’s much faster than my previous experience.
Chris also found the process useful in improving the paper:
Throughout the process, we were in communication with the editors, and they were very helpful in helping us realize how to frame our work. That really helped accelerate the review process.
Now that the work is published and available, Chris and the team see it as potentially providing the same kind of boost for data center research as ImageNet did for machine learning in 2012:
Our hope is that TrafPy will be used as the foundation with which to establish rigorous benchmarks and facilitate the next decade of development in data center networks. We’ve made it completely open-source, so anyone can download it and use it any way they wish. That is the key to accelerating progress.