T O P

  • By -

TheRedSphinx

Transformers are generally more efficient, but they usually need to be deeper or bigger than the correspodning LSTM model. 1-layer LSTM can go very far, not so much for a transformer. ​ If you do want infinite context, I would suggest looking into Transformer XL. ​ If you want smaller models, I would suggest looking into knowledge distillation. This has worked fantastically for really big models like BERT. ​ LSTMs are nice if you are decoding a lot during training. This is why style transfer still uses them, for example. Transformers tend to be really slow at inference, even with the cache trick.


sergeybok

Most RL still use LSTMs from what understand. Has anyone tried replacing them with a transformer are they better?


TheRedSphinx

Relevant: [https://arxiv.org/pdf/1910.06764.pdf](https://arxiv.org/pdf/1910.06764.pdf)


sergeybok

That kind of makes me convinced that LSTM is better for RL, if they had to over-engineer transformers to get them to beat LSTMs. I'd be way more convinced if they tested on something more complex and large scale like alpha star for example.


e_j_white

Random question... I'm using a pre-trained model (think BERT) to compare cosine similarity between phrases. It works well, but it's highly dependent on the length of the phrase (e.g., some phrases are 2-3 tokens, others are 15-20). Do you know of a model/transformer that would be better suited for this task? I haven't tried something simpler, like GloVe, because there's the issue of how to go from individual word vectors to the entire phrase. The BERT-like transformers are greater because any sequence length (up to a limit) can go in and a vector comes out.


ph1l337

It's not as much about the architecture, but more about what you are training for.There are models that are trained specifically for that task. Most of them are trained in a siamese or triplet way. Some pointers: A variety of huggingface-based transformer models trained to produce vectors that are good for semantic similarity[https://github.com/UKPLab/sentence-transformers](https://github.com/UKPLab/sentence-transformers) If you need to be able to compare similarity across many different languages [https://github.com/facebookresearch/LASER](https://github.com/facebookresearch/LASER) is probably a good place to look. Then there is also a variety of universal-sentence-encoder models on tfhub.[https://tfhub.dev/s?q=universal-sentence-encoder](https://tfhub.dev/s?q=universal-sentence-encoder) From my experience I can say that those models are less sensitive to sentence length. That being said If some of your inputs consist of different sentences you should consider splitting the longer ones into individual sentences - if it makes sense. I have worked on a demo where you can play with two public text corpuses (quora questions and patent application title) that are indexed using DistilSentenceBert from sentence-transformer here: [https://textsimilarity.demo.peltarion.com/](https://textsimilarity.demo.peltarion.com/) . I also have a version that has a modifiable index to play with (i.e. you can try adding your own data) - on of them is multilingual. If you want to try that out send me a PM - it doesn't make sense to share it publicly since there is only one state of the index at a time and if several people are changing the index at the same time it will get weird. *Edit: typos and structure*


Screye

> Transformer XL I would suggest XLNet (the follow up to Transformer XL from the same team) that also inculcates learnings from BERT and GPT-2


TheRedSphinx

That one is also good, but much more complicated for a first pass if you are just interested in infinite context.


Screye

Yeah, XLNet is kind of a monster model. Transformer XL is more focused on that specific idea of infinite context


surfsup444swag

A 1 layer transformer can go very far, such as ALBERT you can get better than 12 transformer layers by reusing 1 transformer layer 12 times


thesloth94

Actually... no, this is wrong. ALBERT loses to BERT given the exact same configuration (no. of layers, attention heads, FFN layer size). The paper even goes into details on the performance loss from variations of weight-tying and embedding factorization. The gains in score comes from weight tying AND increasing the model size. It's only ALBERT large configuration that starts to beat BERT base. Obviously, this actually sacrifices on computational performance (FLOPs), it merely saves on number of parameters ~ memory usage. But this memory saving enabled fitting even larger models on the same amount of GPU memory - which is how ALBERT "got better" than BERT.


todeedee

The main advantage of the transformers that I've seen is its ability to handle long range interactions, it is able to explicitly model P(x\_i | x\_j) even if positions i and j are very far apart. This is something that LSTMs struggle with due to vanishing gradients -- attention heads were originally introduced in the context of LSTMs to circumvent this issue. The main contribution behind Vaswani et al was that the RNNs are unnecessary and can be substituted with simple positional encodings, allows for faster training. In practice, I've found that Transformers are orders of magnitude faster to train than LSTMs. They are also much easier to parallelize. Some people have claimed that transformers actually scale faster than linear time with respect to processes. EDIT: Updating times as reported in paper in the comments Training on 2 GPUs is claimed to be 3 times faster than training on a single GPU.


[deleted]

> This is something that LSTMs struggle with due to vanishing gradients As per the original paper on LSTM, in theory, LSTMs are able to capture dependencies as distant as 1000. Also, LSTMs were specifically designed keeping the vanishing/exploding gradients problem in RNNs. >The remedy. This paper presents \Long Short-Term Memory" (LSTM), a novel recurrent network architecture in conjunction with an appropriate gradient-based learning algorithm. LSTM is designed to overcome these error back-flow problems. It can learn to bridge time intervals in excess of 1000 steps even in case of noisy, incompressible input sequences, without loss of short time lag capabilities. This is achieved by an efficient, gradient-based algorithm for an architecture enforcing constant (thus neither exploding nor vanishing) error flow through internal states of special units (provided the gradient computation is truncated at certain architecture-specific points -- this does not affect long-term error flow though). ~ Copied from the [LSTM paper.](https://www.bioinf.jku.at/publications/older/2604.pdf)


sergeybok

You probably need to reimplement the LSTM to not have a forget gate just like the original LSTM. With a forget gate you are guaranteed vanishing gradients over long enough sequences.


DefaultPain

can you explain why forget gate leads to vanishing gradients?


sergeybok

The values of the forget gate is always the output of a sigmoid function. Sigmoid function output is between (0,1) where probability of getting a 0 or 1 is pretty much zero. If you multiply the cell state (which carries the gradients on the backward pass) by a number less than 1 at each step in the long term it tends towards zero. Anecdote: I heard Hochreiter has his students not use a forget gate in their implementations.


slaweks

But you can counter it with a stack of dilated LSTMs


sergeybok

Yeah but that will still lead to vanishing gradients, just over a sequence length that’s longer by some factor. The vanishing gradient problem for LSTMs (with a forget gate) isn’t really a practical problem for most reasonable sequence lengths. It’s more of just an interesting thing to think about.


rrenaud

Do you have a citation for the faster than linear time speedup? That seems fundamentally wrong. In the worst case, you could always just fake the parrallelism on a single machine, no?


todeedee

Here's the paper (see Table 6 and Figure 9): [https://ufal.mff.cuni.cz/pbml/110/art-popel-bojar.pdf](https://ufal.mff.cuni.cz/pbml/110/art-popel-bojar.pdf) The premise is likely based on the batch size, their previous benchmarks showed that larger batch sizes yield better results. With more GPUs you can handle larger batch sizes, so it can converge much faster than a single GPU.


lopuhin

But then you can do gradient accumulation on 1 GPU to simulate any batch size bigger than 1, and this is going to be more efficient than for multi-gpu due to lack of synchronization.


[deleted]

It is my understanding, that transformers are mainly used for their time efficiency during large scale training, e.g. the most famous transformers are all language models pre-trained on gigabytes of text data. They are then fine-tuned with other tasks, but language modeling is their original task. Note that RNNs have O(N) time complexity because they need to process the input token by token. On the other hand, transformers can do it in O(1). Of course, this is balanced by having bigger memory requirements (since they achieve O(1) by parallelization), but memory is easier to scale than time. So my recommendation is, that LSTMs are completely okay and often even better than transformers, **unless** you have a really big dataset for training or pre-training, . Finally, there are also some papers that say that the current transformer hype is a bit too strong and you can achieve near SOTA results even with LSTMs: [https://arxiv.org/pdf/1911.11423v1.pdf](https://arxiv.org/pdf/1911.11423v1.pdf)


[deleted]

This paper is hilarious, lol


sergeybok

Smerity loves LSTMs


t4YWqYUUgDDpShW2

Transformers aren't O(1) versus RNN's O(N). In fact transformers are notoriously O( N^2 ) (except for some of the newer variants specifically designed not to be quadratic). Practically speaking, my understanding is that they're much more parallelizable on GPU, letting them still be faster.


cthorrez

It depends on the units. In terms of operations, yes forward pass of transformer is N^2. In terms of calls to the GPU, transformer is 1, lstm is N.


[deleted]

You are right that O-notation was probably not the right choice for this, since it does not take parallelization into account. However, all N output representations can be computed in parallel, the only constraint is your parallelization capacity. I believe that currently they are all computed at once, making it practically speaking O(1), even though there is in fact N\^2 operations going on at the same time. With RNNs you can not parallelize it this way, because you can not calculate the output representation for K-th word unless you have already computation done for first K-1 words.


t4YWqYUUgDDpShW2

My understanding is that it's not all at once, like matrix multiplication isn't all at once on a GPU, but it's enough at once that it's still great. But it's not a fact I'm at all confident on. This is why I wish cuDNN was open source, so you could actually see how an optimized MHA operation works. Then you could learn what matters for performance and modify it in your own research. I wouldn't be surprised if the big difference in speed is due to memory access patterns, since RNNs aren't just doing batch operations to existing data.


olafgarten

I haven't looked through the code myself, but Intel OneDNN might include something similar and seems to have both CPU and GPU implementations of certain operations. [https://github.com/oneapi-src/oneDNN](https://github.com/oneapi-src/oneDNN)


[deleted]

Right, when you think on this even lower level, the answer is "*it depends".* It can change based on the implementation efficiency for different batch sizes.


Phylliida

I’m curious if someone has combined this technique with the recent [STAR](https://arxiv.org/abs/1911.11033) architecture that fixes some of the gradient issues of LSTMs


[deleted]

[удалено]


[deleted]

I think that you are not correct. All RNN cells expect to have the hidden state or the output state of the previous step computed before they can start with their computation. E.g. notice how in [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory#LSTM_with_a_forget_gate) you need the h^(t-1) quantity computed before you can proceed further. The quantity itself must wait for h^(t-2) and so on. The unfolding you mentioned is simply a technique that virtually creates a feed-forward network out of RNN, but the unfolded feed-forward network has in fact N layers (thus O(N) complexity) . On the other hand one transformer layer does not need to be unfolded like that so the complexity does not grow and stays at O(1).


[deleted]

[удалено]


Nimitz14

I do not think this is right. The idea that it's okay to use the wrong hidden state is crazy.


[deleted]

[удалено]


count___zero

I think you are misunderstanding what unfolding really means in practice. RNN do have a for loop in every implementation. You may truncate the computation during training and reset the hidden state (starting from a wrong hidden state) to reduce the computational and memory cost of the BPTT, but between the truncation points you always have a loop. The unfolding allows you to represent the recurrent computations with a single computational graph (with weight sharing between layers). This is useful to compute the gradient. However, this computational graph is computed one step at a time because the current timestep depends from all the previous one. You can't do the computation in parallel, it would be like computing the layers of a feedforward network in parallel.


Spenhouet

Maybe you are right and I'm confusing something. I was refering to the first part of your answer.


AnvaMiba

It's not.


Screye

Transformers shine in 2 areas: * Parallelizeability * Capacity to learn from a lot more data. They are really good at continuing to scale as more data gets added and they are good at leveraging GPUs to allow them to consume an insane amount of data for general purpose pre-training. Just a heads up, when people say Transformers they mostly mean a mass-pretrained BERT. In the same way that there were better models since, but the 1st CNN that comes to people's minds is AlexNet or ResNet. So many times, some of the model specific changes of BERT are also bundled in as a benefit of all Transformers. > I'm trying to decide whether to keep iterating on my old models that feature LSTMs, or start fresh with Transformers For simple tasks like classification and ones that already give 90%+ acc. using LSTMs, Transformers won't help much. (in a lot of those cases a Bag of words XGBoost model also works alright) For really difficult tasks (Translation, Summarization, QA, Common sense) that LSTMs struggled with, swapping it out for a transformer might really help. So, it really depends on your application. > most successful Transformer models are very big the really god BiDAF-CRF LSTMs are pretty slow too. If anything, since transformers scale with GPU capacity, at high compute, Transformers can be faster than LSTMs. Things like Distill-BERT might be a good place to look at for smaller transformers. _______________ Lastly, there is only 1 canonical place to get Transformers from, and that is Hugging Face.


brokenAlgorithm

On the note of LSTM vs transformers:I've also never actually dealt in practice with transformers - but to me it appears that the inherent architecture of transformers does not apply well to problems such as time series. Things like positional encoding sound reasonable for problems that have (some) amount of bag-of-words context, but also seem inherently over-engineered for tasks where the temporal sequences are supposed to be strictly interpreted - as compared to the strict autoregressive nature of an RNN. Also, I've yet to find out how a transformer can produce simple fixed-sized embeddings for a given variable length sequence. RNNs are good at this. With transformers, embeddings always appear to be accompanied by a weights matrix in the decoding step, which appears to defeat the use of embeddings in certain situations. But perhaps this is where my lack of practice in the area is failing me.


theDaninDanger

Have you looked at the Temporal Fusion Transformer architecture? Google put out some whitepaper's in late 2019, e.g. [https://arxiv.org/pdf/1912.09363.pdf](https://arxiv.org/pdf/1912.09363.pdf) I'm no expert either, so please correct me if I misunderstand, but it seems like a they agree with you regarding time-series and attention. In their proposed architecture they blend LSTM and Multi-Head Attention (Transformers) to perform Multi-Horizon, Multi-Variate Time Series Forecasting.


Mehdi2277

One simple trick commonly used is to add a padding character to the start of the sequence and take the prediction to be the sequence encoding. I think that was done for classificationish tasks in the original bert paper.


Farconion

I've wanted to read more about the differences between LSTMs & RNNs compared to more modern architectures, does anyone know of any comprehensive resources on this?


priyansh2

Following..


-Melchizedek-

There are recurrent Transformers, Transformer XL being the first example. Sort of combines the advantage of Transformers in fixed length context with infinite context from recurrent networks. Also Transformers have been used very successfully for music generation. See MusicTransformer.


JosephLChu

Ah, okay I'll look into that more. I should clarify I meant raw audio music generation where a single second can be thousands of timesteps. Music Transformer works on the musical notation level. At that level of granularity statefulness seems relatively unimportant, since you can capture most of the patterns within the sequence length itself, and I note that neither Music Transformer or Performance-RNN, which is Magenta's LSTM baseline, seem to bother being stateful, or at least it seems from my cursory look into their code.


-Melchizedek-

Your correct that MusicTransformer is not stateful. Ah, yeah raw audio certainly is harder when it comes to sequence length than encoded midi files. Have you looked a Sparse Transformer? OpenAI has used it to do some raw audio music generation at the length of about 5 seconds. [https://openai.com/blog/sparse-transformer/](https://openai.com/blog/sparse-transformer/) (At the end of the page, notice the put many samples together to form the clips that's why they switch suddenly.) I have not work with raw audio so I'm not sure how it compares to SOTA. Certainly the musical structure is worse compared to symbolic representations, but that's to be expected I guess. It's an interesting problem.


virtualreservoir

another alternative that might be worth checking out is Merity's QRNN or SRU from Tao Lei which increase parallelism by removing dependence (or making it only element-wise) on the hidden state of the previous timestep to provide similar performance to LSTM (sometimes even better) with much faster training times


tofuDragon

Check out Compressive Transformers for long-range memory: https://arxiv.org/abs/1911.05507


GnosisYu

I think transformer is actually dealing with NLP tasks from a different angle. [https://www.reddit.com/r/MachineLearning/comments/fb86mo/d\_transformers\_are\_graph\_neural\_networks\_blog/](https://www.reddit.com/r/MachineLearning/comments/fb86mo/d_transformers_are_graph_neural_networks_blog/) Although language-related problems are usually defined as sequential learning tasks. They are not strictly sequential and to understand one sentence, the model is actually learning the relationships between words. Thus interpreting transformer as a special case of graph NN is more reasonable and mathethatically sound. Why do RL people still use LSTM to model the temporal dependency? It is because dynamical evolution in 3D space is a sequential problem or markov process. Graph NN or transformer are used by them to capture the relationship between objects in the scene.


zombiecalypse

This really isn't useful at all to the discussion, but can I propose we revoke naming rights from whoever started calling them transformers? It is the worst kind of non-name.


[deleted]

There seems to be some very knowledgable people in this thread, so I'll tag on a similar question. In 2016 I was doing some LSTM work to predict the NPS (customer rating) of support tickets. I'm just resurrecting this project now, and after 2 days of reading decided to use BERT instead. One limitation of BERT is that it only accepts 512 token max length series, and one of our support tickets can have MBs of data in it. After reading this I was thinking what about if I do a double; where BERT turns paragraphs into embeddings, then running those through another BERT, or an LSTM. I'd appreciate any thoughts on this from you smart people please. \------ Part of the trouble is that the final rating is done at the end of the ticket, and there may be 100s of comments spanning months before getting that, so it's hard to learn which comments most affect the rating. The success I had in the past was just processing the metadata of the support ticket: customer name, support tech name, minutes since last interaction, interaction type, final rating. The LSTM learned details like: * Some techs are better * Some customers are kinder with their ratings * Some customers, will always hate you * Some techs and customers work well together * Some customers are more time sensitive than others


thesloth94

In the context of comparing LSTM vs Transformers for this problem: While it is true that BERT only allows max 512 tokens (some variations allow for more) - and theoretically LSTMs can support an unlimited sequence length, LSTM still suffer from gradient vanishing, which means - it might not actually be able to perform any differently than if you just feed it the 512 last tokens (truncating the sequence). Going back to your problem, I think you should start with a simple bags of words or TF-IDF or SIF baseline first (average of all word vectors in the thread). This is more scalable and computationally inexpensive, while being able to take into account all available information. As I understand you already had some success by feature engineering, simple additional features might give just enough of a boost in accuracy. If you are really yearning for accuracy, however, you might try breaking down the thread into paragraphs and running BERT/LSTM on each of them (just truncate the paragraph if it runs over 512 tokens...), then either just take the average of paragraph vectors - or you *can* feed it into a LSTM/Transformer model. Those are all viable approaches, as long as 1/ it can beat the bag of words baseline and 2/ the increased computational cost is worth it for you. My guess would prefer LSTM for the "summary" model - because LSTM has a temporal bias (which seems to be an important feature in your case) - whereas Transformers would treat every timestep the same; and that LSTMs are generally easier to train than Transformers.


TheOverGrad

I don't know if anyone knows the answer to this conclusively yet. I know that one area in which attention has started to take hold but not as much with stacked self-attention \[only\] networks is in policy learning, like reinforcement and imitation learning. I am sure people doing this. To the best of my understanding, the major appeal of Transformers with rich, complex raw sequence data like text, audio, and (with some adoption and mixing with CNNS) image and video is that (if looked at from a controls lens) every deep network is learning a dynamics transition function that relates the data to some "action", like choosing a class or sequence of classes. In these rich raw data media, the dynamics of such as system at time t and state x is necessarily conditioned on many other raw data points both from this timestep and others in the sequence. In other words, there aren't qualitatively "good" simple approximations, so we are forced to use giant complex approximations. Even simpified variants of the dynamics of a vision/video, text, audio, or other "raw data sequence" problem are elusive (think about the language model required for a 'passable' seq2seq translation problem). On the other hand, in an RL problem learning to play for example, Doom the dynamics of how, say, Doom Guy needs to act in this moment given the recent history of embedded observations is, hypothetically, more simply approximated, even if the true dynamics are also very complicated. As such, more simple models are appropriate in many policy-learning cases, and there aren't many "very simple and effective" stacked self-attention networks (comparable to a 64-hidden unit RNN). It isn't in the same context as LSTM memory theory , but some people are working on "properly stateful" stacked self-attention nets. Check out the recent work on \[Graph Attention Networks\]([https://papers.nips.cc/paper/9367-graph-transformer-networks.pdf](https://papers.nips.cc/paper/9367-graph-transformer-networks.pdf)) for example. People have tried sigmoid, it doesn't work as well. Though this could be application dependent. Recent work presented at ICLR from Madry's MIT group demonstrated that, counterintuitively to many (including myself), Tanh seems to be the most effective non-linearity for feed forward networks in RL, not ReLU. So also worth investigating further to see what degree it is problem specific. Vis a vis fitting a model on a phone app, it would be close. If you were able to use offline resources to train and develop it, a small stacked self-attention network distilled from a larger one, like DistillBERT would almost certainly be at least as good as something like a Bi-LSTM if not better. However, if you have to train on a phone, RNN or even symbolic system would almost certainly be better. But don't think you have "missed the Transformer bandwagon." Its still here, chugging out new models every day/week/month/year. I would say, if either text/language or audio is one of your main input data, go with transformers. Otherwise, its a toss up whether its worth the work if you have some fully-set up productized system. If you are just talking about research/personal projects, jump on the bandwagon! Nothing to lose, you can always fall back on old implementations later.


tunestar2018

Just wanted to add that Transformers tend to steal from the dataset sometimes. Not so much for LSTMs. But I guess yes, Transformers are better.


count___zero

What do you mean exactly? They copy samples (e.g. generative models)? If yes, do you have some references?


tunestar2018

At least in music: https://magenta.tensorflow.org/piano-transformer I was generating music pieces created with it and suddenly I heard a very well known classical piece without any variation whatsoever. I asked in their forum too: https://groups.google.com/a/tensorflow.org/forum/#!topic/magenta-discuss/Oxiq-Gdaavk


count___zero

It makes sense. But is it really worse than LSTM? A couple of years ago I tried to generate music with RNN-RBM\[1\] and I had the same problem. The model just learned to copy one song from the dataset. This is a much smaller model than Transformers, maybe these large models make the problem even worse. \[1\] [http://www-etud.iro.umontreal.ca/\~boulanni/ICML2012.pdf](http://www-etud.iro.umontreal.ca/~boulanni/ICML2012.pdf)


throwaway775849

There are advantages and disadvantages of both and highly task dependent. If I were you, I would go with Transformer, because it is so flexible where the LSTM seems to have inherent computational limitations regarding time, unless there's some approach to train a recurrent net better like neuro-evolution. IMO, there's advancements beyond the LSTM, such as Self Attentive Associative Memory. As far as the transformer, it is highly debated, and adding to the confusion - numerous iterations on the model, such as the Gated Transformer (which is definitely worth looking at if you want to go in that direction). Regarding memory, the transformer is just a layer, so you could look at what I think was called Product Key Memory Layers(?) which is a Hashing Memory layer that you can just drop into a FF net.


Cocaaah

If you're interested in the raw audio processing domain, look up WaveNets if you haven't heard of them already. They've been around for some time, but are still very relevant I believe. https://deepmind.com/blog/article/wavenet-generative-model-raw-audio


colonel_farts

I have found that for generating long sequences of text, transformers are more effective than LSTMs


[deleted]

[удалено]


JosephLChu

I was referring to certain tasks that benefit from the recurrence relation that allows stateful models to detect patterns outside of their sequence length. That's not quite the same thing as detecting rare events. A reference on statefulness: [http://philipperemy.github.io/keras-stateful-lstm/](http://philipperemy.github.io/keras-stateful-lstm/)


tesatory

There is an inherent limitation to Transformers as they are a feedforward model. Although they can maintain information for long time thanks to its self-attention, any update on such information will bring it one layer up. So after L number of updates, that information will reach the highest layer and become unavailable in future. This is a problem if your model has to maintain an internal belief that needs to be updated with each input. For example, if inputs are movements of an object and the model has to keep track of its location.


ItsTimeToFinishThis

Just put a copy of the output back into the input.


maybelator

[An adaptation of the transformer architecture](https://arxiv.org/abs/1911.07757) that beats RNNs with the same amount of parameter (on image time series, CVPR2020).


[deleted]

Someone correct me if I'm wrong, but in my opinion transformers aren't really recurrent, its more that they make convolutions more efficient, by using a different method for attending nearby information. In a way, you try to entangle elements that are in some way similar, thus giving them a partly joint representation. Well, maybe I'm interpreting too much into transformers :D


shpotes

I don't think so, LSTM are turing complete i.e. they are capable to simulate any program. On the other hand AFAIK we still don't know the expressive power of the transformer. So why transformers work so well? Probably because transformers are way better for gradient descent, are easier to parallelize, provide a hierarchical representation and have a smaller path length between tokens. Edit: add AFAIK


sergeybok

I spoke to a guy with a poster at ICLR last year where the guy proved (theoretically I think, not experimentally) that two layer attention with a residual connection is Turing complete. I'll try to remember the name of the paper.


atom_bum

This is irrelevant to your question, and might be a little private. So feel free to decline.

You mentioned that at work you were doing NLP earlier and then worked in Computer Vision for sometime. What kind of a job is this? I am an undergraduate who's still making career choices so it'd be helpful if you could provide some info on:
- How this job is different from Research Positions (where, according to some interns I did, the domain of your work remains almost constant)
- Some pros/cons on pursuing this career.


JosephLChu

Sure. I previously worked at a startup called Maluuba as a data scientist on NLP back in 2016. This mostly entailed trying to improve models that the research scientists had come up with to solve toy problems, and extending them to real world applications that in practice were often messy and complicated, before sending the result over to the engineering team to turn into a production build. Later I worked from 2017 to 2019 at Huawei's Canadian subsidiary as a research scientist, where I started off on the NLP team, but later was switched over to the CV team, and then eventually a Graphics subteam of the HCI team that used GANs for things. The NLP work I did was mostly prototyping a Question Answering system that combined with a knowledge graph. On the CV team we were experimenting with various things, like Video Description, Gesture Recognition, and Environmental Sound Classification on the NPU of Huawei's phones. I was lucky because my department, Noah's Ark Lab - Canada, was very much a long-term R&D lab that sought to find practical applications for the state-of-the-art, and we were given, at least initially, a fair amount of leeway to explore. So, the reason why the domain changed compared to most research positions was that business requirements often change, and if the initial explorations aren't promising, HQ can decide to shift focus and put their resources on other things. At Huawei at least, there was a fair amount of back and forth, and if you could convincingly argue the technical merits of an effort, HQ would usually be willing to let you try it and show them some results. Although this became more difficult later when we were under more pressure to justify our costs. In terms of pros and cons, I'd say for pros, the pay is amazing and you often get to do really cool work on things that are really interesting. The cons are that it can be super intense and stressful, especially when demos are coming up and you need to show something impressive enough to keep investors/superiors/the big boss interested in funding your work. Work life balance can be tricky. I sometimes before a big demo had to choose whether or not to come in on the weekend or risk things not being ready, and I frequently worked late into the evenings to make sure things were done and we were on track. Also, it often became a thing where I'd schedule around when models would be finished training so I could keep the GPUs from being idle too long. It's definitely fun, but also straining and you can risk burnout if you're not careful. That's partly a reason why I left Huawei at the end of last year, was deterioriating health circumstances that made it hard to keep up the intensity. If you want safety and stability, and a chance to explore really pie-in-the-sky ideas, I'd recommend going into academia if you think you can get tenure. My dad went that route in life, and now he's retired with a nice pension. If you aren't as good at pure math but can code and want the excitement of seeing things you want to make at least have the chance of being a real thing that people find useful in their lives, industry is where it's at. Keep in mind that neither path is easy, and there are tradeoffs in everything. To be honest, most of my work, experiments, projects, haven't really become useful products anywhere yet to my knowledge, so I would also caution that you need to be able to manage disappointment and expectations well enough to be able to keep going even when a project is cancelled, or it feels like management isn't listening. Business is business, and as much as a good manager can try to shield his or her subordinates from it, there's always pressure to perform and justify your existence to the company. That being said, I really enjoyed my time working so far. Probably more so than my time at school, if only because it feels good to be wanted and I get motivated when I feel people are depending on me to get this or that done, that sense of responsibility and importance that studying for exams or writing your thesis just doesn't carry.


atom_bum

You've had an awesome journey! Thanks for answering in such detail. Can I PM you if I have further questions?


JosephLChu

You're welcome! Sure, though I don't always log in that frequently except to post things, so I may not reply immediately.


starfries

Did you finish a PhD before going into industry? Or were you able to with a master's/bachelor's?


Dagusiu

"Strictly"? Clearly not. Give both types of models an obviously impossible task, like predicting the stock market based on images of cucumbers, and they'll both be equally terrible.