My Google Internship
Everyone on my team was great to work with. Some of my team members were in other offices. Looking back I am surprised by how easy it was to work together despite the lack of frequent face to face contact. Video conferencing made a big difference though. Conference calls do not quite have the same effect.
It would be hard to talk about my internship without talking about the perks. I had a latte pretty much every day. Google also serves free meals (it's not a myth). Also, I often joined a group of Googlers who play Ultimate Frisbee on the beach a couple times a week.
The best part would be the people I worked with. I went to Switzerland to present a conference paper (not related to Google). I asked if I could work from the Zurich office for a couple days and it was no problem at all. It was really easy to get things done and get 'yeses' from people.
I know for this to be my first internship in 4 years, Google has spoiled my expectations for future jobs. When I go back to Iowa State I know I am going to be annoyed about any bureaucracy and paperwork. I had to complete a make work essay to pass the course I have to sign up for to take this internship (link). Overall I feel incredibly lucky ato have had this opportunity.
Three Minute Thesis
On April 4th 2014 I presented a 3 Minute Thesis at the first ISU GPSS 2014 Graduate And Professional Student Research Conference. You may have heard about the competition. It started at the University of Queensland and spread to other Universities. You can read more about it at threeminutethesis.org. PhD Comics also had a 2 Minute Thesis Contest. The rules of the competition are that you only have 3 minutes to talk (obviously) and you can have one static slide.
My approach was to hit the major points of what my thesis will actually be. Some of the direction of my thesis was still up in the air when I made this presentation. Fortunately, it's been decided it's the same as my video suggests: pushing the performance of sparse matrix vector multiplication or reconfigurable platforms.
Overall, I am OK with how the presentation went. I know I need to work on my public speaking a bit. I pace the stage, awkwardly put my hands in my pockets, and hold invisible objects. I have a couple stutters and false starts. This is mostly nerves and trying to talk quickly. The competition helped me to be able to quickly tell someone what I do and hopefully be able to convey why I think FPGAs (reconfigurable chips) are such a cool computing platform.
By the way you can checkout the other videos here. The winners were: First Place: Anthony Guerra, Second Place: Rosemary Bulyaba, and Popular Choice: Michael Curtis. Anthony's presentation becomes pure gold at the end.
April Fool's Day
Last week I played prank on my professor. I created a fake conference: the High Performance Reconfigurable Computing conference (HPRC) and pretended to submit a paper to it. The message my advisor got was as follows:
Dear Joseph Zambreno,
The website was not too difficult to create. I used a template created by AJ at HTML5UP. I also stole and paraphrased information from legitimate conferences: Memocode, FCCM, HPCA.
Kevin Townsend <firstname.lastname@example.org> submitted the following
paper to HPRC 2014:
NOTCRAP: Nearest neighbOrs Text Classification on ReconfigurAble Platforms
You are listed as one of the authors of this paper. To view the
details of HPRC and the submission visit HPRCc.org.
The HPRC Automated Messenger.
This continued a tradition from last year when I covered Dr. Zambreno's door with a fake wall. I recently found out that April 1 is also recognized by some as International Fun at Work Day. I think it's important to have some fun at work.
Update: I have since taken down the hprcc website.
DAC Testing Environment
I have been preparing for an interview this Tuesday. I have been going over some previous projects I have worked on. I interviewed myself over one recent project that I did last year. Here is that interview:
Could you tell me about a project you worked on previously?
Sure! About a year ago I built an interface for testing a digital to analog converter or DAC for short. This chip took the 16 bit digital in parallel for higher bandwidth. An FPGA was chosen to interface with the chip, because FPGAs have lots of high speed I/O pins. The FPGA was connected to a computer by USB. I also wrote the python program that communicated with the FPGA. In the end Tao, the one who made the chip, could completely test the chip and published a paper (A 15-bit binary-weighted current-steering DAC with ordered element matching).
What was the most challenging aspect of the project?
The most challenging aspect of the DAC project would be the time management aspect of it. The project involved an FPGA connected to a custom asic chip. The pcb board was getting designed at the same time. The first design was done by Yuan Ji and the second by Jingbo Duan By the time the board was done I had the DC operation ready. From then on as Tao needed features I would implement them on the fly. Usually I would have a solution to the problem or implement a new feature within 24 hours. For example he wanted an external clock rather than the internal one. I think originally I was not sure how to get the clock selector working, so I made 2 bitfiles one with the external clock and one without. Eventually all the features Tao wanted were included in the project.
What did you learn?
What I learned in the DAC project was time management and that its hard to gauge other people. I was not sure about Tao's ability to design a working DAC. Some of this was because in my mind analog chip design is harder than digital chip design and programming. I took an analog design design class here at Iowa State my first semester (I got a B+) and it was one of the hardest courses I have ever taken. I failed to realize that people have different strengths and weaknesses. Looking back I think Tao did a good job with the project and found the right people to help the project got completed.
What was the most interesting part of the project?
The most interesting part of the design would be working with the Lattice board. Usually I deal with a machine one building away that I rarely see (the Convey HC-2). I usually never deal with specific non-computation features of FPGAs like dynamic clock selection, phase lock loops and differential signals.
What was the hardest bug you faced during the project?
None of the bugs were too difficult to solve. Getting the clock selection working was challenging. One bug I found was that the clock select would never switch from an always low signal so I had to setup the system so that this would never happen. The problem would not appear if the external clock was connected, however if the external clock pin was unconnected (floating) the problem would appear.
What did you enjoy most about the project?
I enjoyed creating the interface between the python program and the FPGA. Basically the python program sent opcodes/instructions to the FPGA these opcodes controlled the state of the FPGA. The opcodes matched the state value so I could jump from the idle state to any other state the FPGA could go in. It was fun to make the FPGA as simple as possible and do everything complex on the python side. There were a couple of times I couldn't get a complex instruction working so I would use a much of simple instructions to do the same thing except slower.
Did you have any conflicts with Tao during the project?
Good question! Um yes, the conflict I had with Tao was on how much time to spend on the project. I could not make this project a priority, and I said so, because this project wasn't part of my research or part of my research assistantship. So some of my work was a little late or not completely polished.
One particular disagreement was on how much work I should be doing before the basic functionality of the chip is tested. It is tricky to test a qfn chip before making a complete pcb board. Looking back I was probably a little too skeptical about the chip working.
I have been working on a text-classification engine to run on Convey. Instead of starting from scratch, I started with a software package called the Bow Toolkit. When I tried to compile it I was sad that it did not compile. Searching around the web and realizing the code had not been changed in 10 years, I realized that GCC changed enough (got stricter) and caused issues. However, I was able to fix all the "bugs" and compile the software so I could use it. There were about a dozen or so of these issues. Most were little things like declaring functions in functions and how #defines worked.
Rochester Chess Tournament
I recently played in a chess tournament in Rochester. The following was probably my most interesting game:
1. e4 c5 2. Nf3 d6 3. d4 cxd4 4. Nxd4 Nf6 5. Nc3 a6 6. f3 e5 7. Nb3 Be6 8. Nd5 Nxd5 9. exd5 Bf5 10. Bd3 Qf6 11. O-O Bxd3 12. Qxd3 Qg6 13. Qc4 Qf5 14. Na5 b5 15. Qc7 Qd7 16. Qc3 Be7 17. f4 Bf6 18. Qf3 O-O 19. c3 Qc7 20. Nb3 Nd7 21. Be3 exf4 22. Bxf4 Ne5 23. Qh3 Ng6 24. Bd2 Rfe8 25. Rxf6 gxf6 26. Nd4 Qc8 27. Nf5 Re5 28. Nh6+ Kg7 29. Qf3 Qc5+ 30. Kh1 Qxd5 31. Qf2 Rae8 32. Rf1 R8e6 33. Ng4 Re2 34. Bh6+ Kh8 35. Qg3 Qf5 36. Kg1 Qe4 37. h4 Re1
I lost, but I liked my 25th move. I sacked a Rock for a Bishop. The idea was to weaken the king side and the pawn structure. I should have been able to get the pawn back, but I played badly after. After 26 – Qc8 I should have played 27 Qf3. That would have been more interesting as Nf5 which lost a pawn.
Reduce Reuse Recycle: a Conference Publication Methodology
I just came back from the ASAP conference. I presented my work “Reduce Reuse Recycle (R^3): a Design Methodology for Sparse Matrix Vector Multiplication on Reconfigurable Platforms.” There were a few interesting presentations. Some of the ones that interested me were: “A Practical Measure of FPGA Floating Point Acceleration of High Performance Computing”, “Reconfigurable Computing Middleware for Application Portability and Productivity” and “FARHAD: a Fault-Tolerant Power-Aware Hybrid Adder for Add Intensive Applications”
If your interested in my work you can find the paper here. I also have a poster up in Coover hall at ISU. The other papers in the session are similar: ”A Practical Measure of FPGA Floating Point Acceleration of High Performance Computing” and ”Sparse Matrix-Vector Multiply on the Texas Instruments C6678 DSP”. The first one is from the same group that published results that we compare against, Jason Bakos’s lab at the University of South Carolina.
The most interesting part of the project was designing the Intermediator. The Intermediator holds 32 intermediate y vector values. This was needed to fix latency issues, but this turned out to be more useful than that. The design allowed the values in every 16 rows to be traversed in any order. If you like hardware designs you may want to read that section (III-C).
I want to share the paper that is being presented at MEMOCODE 2012: "Shepard: a fast exact match short read aligner." The two major contributors to the paper are Chad Nelson and I. Shepard is a 100 base-par DNA aligner. It is the fastest exact matching implementation to date. Of course you need a $50,000 HC-1 machine to run it. It also needs 32 or 64GB of ram because it has 24GB hash table it has to look up matches on. The implementation is also featured at Convey Computer.
Chad Nelson came up with the idea of using a hash table and the idea behind getting it to work. He also wrote the code to create the hash table. I wrote the alignment architecture and software. This was mostly hash table look ups. This was the program that took less than 1 second to run and process ~300 million reads.
What is interesting is that this platform is the fastest platform for hash tables with small entries (less than 64 bytes). Most ram memory systems use cache lines that are about 64 bytes long. The HC-1 we have has specially designed SG-DIMMs designed for accessing any set of 8 bytes. So with a hash table with 8-byte entries this system will win. There are several times the system wont be able to win. For example, if the hash table can fit on the on board cache of the processor (~4MB) then RAM access is no longer a bottleneck.
Another reason we were able to be this fast is the fact we wrote the hardware (Verilog). Our algorithm requires getting data off chip and processing it. In CPU architectures this is a cache miss and then a dependency on that cache miss. So our code looked like:
Loop 1 billion times:
Dependency on cache miss
I believe this can be fixed to an extent with reorganizing the code so the cache miss and the dependency are farther apart. But I am no expert on how the Intel processor is going to execute instructions. A similar effect happens in other applications. For example sparse matrix vector multiplication. Without any optimizations processors have a hard time getting even %10 of their capability. Even with optimization it is hard to get %50. This is the project I am working on now. Look out for it in the coming months!
My first year in grad school has gone well. This summer I am planning on getting some papers published. I am aiming to get my name on 4! We will see how that goes...
Edit: That proved too ambitious.
This week, I have worked on the Flow Map Algorithm by Cong and Ding in "Practical Problems in VLSI Physical Design Automation." I finished my applications to graduate school, and I had an interview with Vision Solutions.
Next week, I hope to work on learning the hMetis algorithm, and hopefully continue to post on this blog weekly.
Master Habit is a recurring timer written in Java and runs on any phone that supports the MIDP 2.0 profile like most Nokia and Symbian phones. I was unable to find an adequate program to do this, so I made my own. I use it to "build an Exoself" described in Ron Hale-Evans' book "Mind Performance Hacks". The code can be downloaded here, and just the jar and jad files here.
Senior Design Project
My senior design project was a cross-section analyzer. A block was placed on a rotating platform. The platform had 8 sets of infrared LEDs and sensors, which recorded the width at 8 angles. This information was then stored and displayed as a simple raw data and shape display. My responsibility was to create the raw data and shape display.
image of pcb
This project that was completed when I was at the National University of Singapore. it is an 8051 processor written in VHDL. The project will work in ModelSim and synthesizes to fit a Xilinx FPGA. The program is hardcoded in the "int_rom" module. I believe I used port 1 as input and port 2 as output in the program.
I wrote 90% of the code. Some of the code was given to me by my professor Tay Teng Tiow including the adder and divider I am not sure where he got the code.
I use very few tricks in my code (no pipeline). I do not have an exhaustive test of all of the instructions due to time, so I am sure there are a few instructions with errors. The architecture uses a bus to transport data, however that was a pretty bad choice because the synthesizer changed all of my tri-state buffers into muxes so I have really bad timing. I use some tricks to get the direct and indirect data from the ram on the first cycle of an instruction (fetch cycle).
My 8051 code
Their website is only viewable by students but here is some information about it.
Computer Architecture course
Tay Teng Tiow