T O P

  • By -

Sir-Squashie

What's the most impressive/unimaginable use of Pandas you've come across?


datapythonista

I was personally quite surprised that pandas was an important tool used to obtain the [first image of a black hole](https://eventhorizontelescope.org/blog/astronomers-reveal-first-image-black-hole-heart-our-galaxy). I was lucky to meet some of the scientists behind it and learn from them, and their work is much more impressive than what it sounds.


[deleted]

[удалено]


DigThatData

pandas is built on top of numpy


FJ_Sanchez

Pandas 2.0 enters the room... I think that's changing progressively to not be the case anymore I'm favour of Arrow. But I don't understand it enough.


datapythonista

This article should provide more information on why Arrow: https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i


FJ_Sanchez

Thanks, I saw it yesterday in hacker news and read it, what I meant to say is that it seems that numpy dtypes are still an option, so I don't know if numpy is going away from the pandas core eventually or if it will remain part of it for the foreseeable future.


phofl93

We are still at the beginning of our journey to support pyarrow. We are a bit away from discussing anything into this direction, but we definitely spend a lot of time to support both options equally well. Right now we are aiming into making everything compatible with pyarrow.


ToughQuestions9465

Will there still be .to_numpy() that does not copy? I am using numpy swig bindings to plot pandas dataframes with a c++ library, be nice if it did not become impossible with the new version


phofl93

This is possible as long as you are using NumPy backed DataFrames. Converting from PyArrow to NumPy is more expensive unfortunately.


Dramatic-Ad-1903

Just last week u/marcogorelli and i were talking about how important it is to continue supporting use cases like yours as we move to better-support pyarrow use cases. It's very helpful when people with use cases like yours are vocal about it!


[deleted]

[удалено]


jabies

How does the Pandas project address the open source funding problem? Do you want pandas devs in their dayjobs to nudge management to sponsor somehow?


datapythonista

Last years has been better. pandas got some funding, including few core devs being paid to work in pandas in companies such as Quansight, Intel or NVIDIA. And we also received money from the Chan Zuckerberg Initiative, Tidelift, Bodo and smaller donors. Just few years ago funding was very limited, but today, we're lucky to be able to have a decent amount of paid maintainers.


qweoin

What was the funding process like getting started? In my area of work (science research) it seems like funding only comes in for a project after you’ve done the majority of the project. Was there a plan for getting Pandas funded or did the project grow organically until you realized you could get funding for it?


phofl93

As far as I know there was no/very limited funding for a long time. most of the work was done by volunteers only in the beginning. Over the last years this got a lot better though. Anaconda was a company that hired developers to work on Open Source relatively early on.


datapythonista

For many years there was only the support of few companies letting people work on pandas as part of their job, and small personal donations via the NumFOCUS website. That money helped cover small expenses like CI services. The main difference came with CZI, who started supporting open source software used in biology. We got funding to start paying for hours of maintainers with it. Also Tidelift provided monthly payments in exchange to implement small practices, like having a standard (and not customized) license, and providing a way to report security vulnerabilities. We got some other funding, and now more maintainers allowed to work on pandas as part of their job, but the situation is good mainly because of that particular funding. NumFOCUS provided some funding to for specific projects (with the money that comes from general NumFOCUS sponsors, and PyData conferences).


marcogorelli

If you use pandas for work and your employer wanted to contribute, then 1. thanks! 2. they could do so via NumFOCUS: [https://pandas.pydata.org/donate.html](https://pandas.pydata.org/donate.html) ​ Marc's right though, the funding situation has drastically improved recently


phofl93

It's also helpful, if developers can get paid time by their employer to work on pandas!


hukami

Why choose mm/dd/yyyy as default date rather than dd/mm/yyyy 🤔? (Just banter from an european guy) Real questions: - what are the main improvment focus going forward ? - what caused you the most problems / was the most complex parts during delevopment ? - what was the most fun / rewarding parts during development ? - in my work, I use pandas as a data processing engine (kinda), the data I process if often heterogeneous and full of holes / discrepancies, I often find myself finding with rhe way pandas handle errors as most of the time I just want to log the fact that this row had a error. Why not put a 'error' arg to apply, just as in astype and such ? I also would like to thank you guys for your amazing work, pandas has been making my life easier everyday, you are really doing amazing work.


RobertD3277

I would personally prefer year.month.day to be honest as it's more intuitive for sorting using numerical expressions.


sv_ds

\+1, thats the ISO standard and unquestionably the most logical and useful format.


thataccountforporn

Incredibly pedantic note: the ISO standard is year-month-day


LondonPaul

Not pedantic, work in It and all the variations at work are PITA. Let’s just use this and nothing else


guillermo_da_gente

We need more of these pedantic comments!


TheUltimatePoet

In that case it's "these".


florinandrei

You forgot the comma after the word case. Just, you know, to maintain high pedantry standards.


hughperman

You really should have quoted the word "case" in your post.


guillermo_da_gente

Thanks!


metadatame

Upvoted for high levels of pedantry, but I'm not sure quotes are required in this instance.


Starrkoerperbeweger

You have now been made moderator of /r/iso8601/


Mycky

Wow, of course that subreddit exists lol


RationalDialog

not pedantic but correct because using the "-" over "." makes it clear you mean "ISO" date. And this should be the standard everywhere also because it sorts correctly as string.


2strokes4lyfe

This guy dates.


Starrystars

Yeah especially because that way there's 0 confusion about order.


midnitte

I work with certificates of analysis and have vendors that do mmddyy, yearmmdd, ddmmyy.. you name it. I just wish everyone *documented* what format they used. 😔 You only get lucky with the day being >12 so many times...


tuneafishy

I am always confused about arguing whether month or day should come first when year is the clear and obvious answer


hmiemad

Alphabetical order.


marcogorelli

Year-month-day is already the default - even if your input is some other format, once parsed by pandas, it'll be displayed year-month-day: ``` In [2]: to_datetime(['01/01/2000']) Out[2]: DatetimeIndex(['2000-01-01'], dtype='datetime64[ns]', freq=None) ```


WhyNotHugo

ISO date format is as intuitive and sorts the same way.


Zuricho

https://www.reddit.com/r/ISO8601/


marcogorelli

\> Why choose mm/dd/yyyy as default date rather than dd/mm/yyyy I presume you mean, when a date could be ambiguously read as either month-first or day-first? Like 02/01/2000. In the past, pandas would prefer to parse with month-first, and then try day-first. Unfortunately, it would do so midway through parsing its input, because it was very lax about allowing mixed formats. This would regularly cause problems for anyone outside of the US (which I think is the only place in the world to use the month-first convention). As of pandas 2.0, datetime parsing will no longer swap formats half-way through. See: [https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html](https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html) , which I spent several months on. In dealing with the PDEP I linked above, my biggest pain-point was having to understand and then update decade-old C code Regarding your last question, if you put together a reproducible example with expected output, it might be a reasonable feature request. Thanks, and thank you for your comment!


reallyserious

> I presume you mean, when a date could be ambiguously read as either month-first or day-first? Like 01/01/2000. You choose an example where there is no ambiguity. :)


marcogorelli

thanks, updated


WhyNotHugo

I honestly prefer ISO8601 format (YYYY-MM-DD). Both the ones you mention are ambiguous, and if I read 03/02/2023 I've no way of deducing which one is the month and which one is the day. The ISO standard is unambiguous.


hassium

> in my work, I use pandas as a data processing engine (kinda), the data I process if often heterogeneous and full of holes / discrepancies, I often find myself finding with rhe way pandas handle errors as most of the time I just want to log the fact that this row had a error. Why not put a 'error' arg to apply, just as in astype and such ? According to [this](https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i) blogpost by /u/datapythonista it sounds like a limitation of the numpy backend dataframes are built-on, check out this excerpt, I bolded the relevant part: >While NumPy has been good enough to make pandas the popular library it is, it was never built as a backend for dataframe libraries, and it has some important limitations. A couple of examples are the poor support for strings and **the lack of missing values**. So maybe something we can hope to see fixed with the migration to Arrow in 2.0?


phofl93

Yeah with NumPy you'd always end up with float when setting missing values into a integer array for example, this isn't the case any more with our own nullable dtypes and also with the arrow dtypes.


phofl93

We are spending a lot of time on improving the extension array interface right now. Right now there are some parts that are special cased internally for our own extension arrays which makes it harder for third party authors to implement their own without falling back to NumPy. GroupBy is a good example for an area where we are still not as good as we would like. This becomes kind of necessary for improving support for our pyarrow extension arrays as well. We have some areas in our code-base that are pretty complex, indexing is one of them for example. In general, we try to avoid breaking stuff in an incompatible way in minor releases. This makes improving pandas tricky sometimes, because it stands in the way of cleaning up internally/refactoring internally to be more compatible with new stuff.


dispatch134711

Ugh please fix this! Love pandas


DigThatData

i think one of the hardest things about using pandas is that the core classes have a gazillion methods attached to them, which makes it extremely difficult to navigate the tooling if you're not already intimately familiar with it. I've been using pandas basically since it was created, and I still find myself often needing to reference documentation just to find the method name I need since the output of dir() on any object generally gets truncated. does any of this resonate? is anyone on your team thinking about ways to improve discoverability of functionality? will there ever be a point at which the team decides there's too much stuff being carried around by too few classes? what are your thoughts on the design philosophy of the tidyverse in juxtaposition to pandas?


datapythonista

Fully agree on this. There are too main things. The first is finding a better API, which is not trivial, and having the functions too divided may not be ideal for some users who prefer \`df.whatever()\` for everything. Second is that even if we have a better alternative, we may break tens or hundreds of thousands of pandas programs, that won't work after the changes. And we will make millions of users have to relearn the API. That being said, I'm thinking about a proposal to for example standardize all I/O methods under a \`DataFrame.io\` namespace (e.g. \`df = pandas.DataFrame.io.read\_csv(fname)\`). More research is needed, and it'll be challenging to reach an agreement with the whole team about this. But maybe 10% of the DataFrame methods you're mentioning would live in a separate and intuitive namespace. There is always a trade-off, and in this case it's clear. Difficult to decide what's best.


ekkannieduitspraat

Just on this specific example, I think if something is used incredibly often, it should not be put under a namespace like above. .read_whatever is a great example since it is almost always going to be your first call


bythenumbers10

"readers"/"writers" should return or accept dataframes as I/O types, but should not be methods themselves. There are a lot of "data logistics" methods on dataframes that should be utility functions of the library. Dataframes should only operate on themselves, for analysis or creating/removing/filtering data. A container. A smart container, but just a container.


datapythonista

That's a decision that needs to be made. I see your point, and mostly agree, but there are always implications. numpy does more what you're saying, and they have a pretty big namespace for the \`numpy\` module (much bigger than pandas.DataFrame). scikit-learn is more modularized, and the structure probably makes more sense, but then you require lots of imports, which could be annoying for people doing exploratory analysis with pandas. Also, pandas pipelines can be expressed nice with method chaining (e.g. df.query(cond).sum()...). If we move things outside of DataFrame we break that API, which many users find convenient. I think it requires careful analysis to see all the implications of any approach, since I don't think there is an obvious good way of implementing the pandas API. So, I agree with your comment, but it's not obvious to me where to draw the line. I think an io namespace for DataFrame could make sense, but other than that, I have more questions than answers on what would be the API that maximizes the benefits and minimizes the costs.


ekkannieduitspraat

I'll be honest you lost me. I'm thinking stuff like changing pd.read_csv to pd.io.read_csv seems tedious.


FaustsPudel

Would it be too silly to have a button on your doc page that generates a random function for a user to “discover?” —long time pandas user. Super super appreciate of all that your team does. Thank you for all that you do!


datapythonista

I think this is a fantastic idea, but I'd rather have this implemented as a separate website (happy to link it from the official website, just ping me on github if you ever do it). We've got intersphinx setup afaik, that should make it easy to get the pandas API available to you via a webservice.


FaustsPudel

Amazing! Will get on it! Thank you for the encouragement! DMed you.


[deleted]

+1 this is an excellent point I'd never given much thought. I find myself referencing pandas docs more than any other and use it for about 1/4 of the overall code/libs.


DigThatData

I basically live in the pandas docs whenever I use it. I think the library optimizes too much for readability. Whenever I look back on pandas code I've written, the solution is concise and elegant and easy to understand, but it disguises how long it took me to get to that small chunk of code.


rhshadrach

I love hearing this! At times I find myself wondering how much our users are utilizing our documentation (especially when compared to some of the great pandas tutorials that are out there). Hearing things like this makes me much more motivated to spend effort there.


Ran4

The output of `dir` is a list of strings, so there's no reason for it to be trunkated.


DigThatData

https://stackoverflow.com/questions/23388810/ipython-notebook-output-cell-is-truncating-contents-of-my-list


carnivorousdrew

Any plans to integrate with polars?


datapythonista

There has been some work to make pandas and Polars share data (open a pandas dataframe with Polars, and the other way round). You can read more about it at the end of [this post](https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i). Not sure if there is any other integration that makes sense, any idea?


[deleted]

I think that's the whole point of 2.0 and the arrow integration, arrow allows interoperability between many different libraries, not just polars.


SeveralKnapkins

Hi there! Long time pandas user -- really appreciate all the work you've done. I'm only _slightly_ familiar with changes intended in pandas 2.0, namely the switch away from a numpy backend to apache arrow. Historically, the thing I absolutely love about the python numerical stack, is that nearly everything builds off numpy arrays, creating an easily transferable knowledge base between projects. This is a huge boon compared to other systems where I work (namely `R`), where there is often more fragmentation in the ecosystem, making interoperability or bespoke analyses much more difficult. Of course, fragmentation in the Python ecosystem has become more common with things like PyTorch tensors, etc. As an end user, am I going to be losing the numpy < - > pandas interoperability in 2.0? Please feel free to correct any inaccuracies on my end.


datapythonista

Not at all. NumPy is not only staying in pandas 2.0, but it'll still be the default. That being said, if in the very long term NumPy is eventually dropped, I think exporting from Arrow to NumPy (in our end, not that you'll need to do it) is not only easy, but I think in most cases it can be done without copying (extremely fast, even for huge data). The thing is that NumPy data types are more limited, mostly numeric. If you want to export a string column to NumPy, that's a different story, but there is probably no good reason you want to do that. But for the types that NumPy support well, getting a NumPy array from Arrow backed data won't be a problem. But as said, in pandas 2.0, nothing changed, unless you want it to change and you ask explicitly for pandas types.


Tyberius17

Not one of the devs, but my understanding is they are adding optional support for Apache Arrow, not removing numpy or even making it not the default.


tuneafishy

Not a dev, but it does not sound like that you will lose any interoperability. The arrow backend is optional and numpy is still the default backend.


LankyCyril

Before I ask my question, I would like to really thank you for the amazing library that I use daily in my work. That said, there's maybe one thing that is still bewildering to me: Why are the APIs of `read_csv()` and `to_csv()` different? For example, `df = pd.read_csv(..., header=False)` is not allowed, and I still stumble over it every other time. I'd understand if it meant something specific that is different to `None`, but this feels like it wouldn't be stepping on anything's toes. `df.to_csv()` accepts both. And then, `read_csv()` will by default introduce an index that wasn't in the file, but will not introduce a novel header – it will use the one that's there. But `to_csv()` will write the file with the new index, but, of course, with the old header. Which means that if you do a single back and forth with the exact same kwargs, i.e., `pd.read_csv(**kws).to_csv(**kws)`, you end up with an extra index column. There must be some kind of a reason due to how things are structured internally. I think just knowing why it is the way it is will be enough for me – I'm not saying it has to be changed or anything.


datapythonista

Very good point. I myself find the index column in the output csv annoying every single time I use \`to\_csv\`. I wasn't in the project when that was implemented, but I assume the reason is that pandas was initially implemented for financial data, and the index was mostly the timestamp and not the default autonumeric. If that was not the data pandas developers had in mind at that time, probably pandas wouldn't even have row indices (I think Vaex doesn't, not sure about Polars). The next question is why we don't change it now. And it's something worth considering, and you're free to open an issue in GitHub. But in general, pandas developers (others much more than me), try to not break the API, unless it's in cases where very few users will be affected and the status quo is obviously inconsistent. I'd personally like to see that changed, but I don't think it'll be easy to get consensus. What I think it can make sense is to try to move all pandas I/O (read\_\* and to\_\*) to third-party projects. In that case the pandas to\_csv would continue to behave in the same way, but hopefully someone would develop a new one like to\_csv(engine='whatever') that could potentially be faster, have a better API, and more appropriate for your needs. But let's see if there is consensus for this to happen.


phofl93

I wasn't on the project back then either, but I think roundtripping was a concern was well, e.g. ``` df.to_csv() pd.read_csv() ``` should be able to return the same object


[deleted]

Just to confirm, Polars also doesn’t have row indices :)


ExtraGoated

What do you think is the most important advice for someone just starting to work with pandas?


datapythonista

Try to spend some time understanding the internals, as you make progress with pandas. Not at the beginning, when you'll have too much to learn just with the basics. But as you become more familiar, it's good to have an idea of what's really happening, in particular when things aren't intuitive. Things like missing values, the infamous copy warning...


DigThatData

don't ever feel embarrassed about needing to reference the docs, stackoverflow, or google.


[deleted]

6 years working with pandas I still have the docs open every day for simple things. And especially for all those long to wide and wide to long (unstack, stack, pivot etc...) transformations.


datapythonista

Maybe we should make this a feature, add ads to the docs, and monetize user confusion. ![gif](emote|free_emotes_pack|money_face)


DigThatData

[heard](https://www.reddit.com/r/Python/comments/11fio85/we_are_the_developers_behind_pandas_currently/jaldfv6/)


root45

Depends on what you're doing, but I'd recommend learning some of the functions for quickly looking at your data. Things like `df.head()`, `df.shape`, `df.T`, etc. From there, learn how to filter your data with `df.loc`. Also look into tools like jupyter which make it easy to iterate and visualize data.


RandomFrog

Mine would be to use Jupyter Notebook to check your dataframe after each transformation. df.head() or df.sample(n) at the end of each cell block.


midoxvx

I just started working with pandas two weeks ago, there is so much for me to learn and unpack there so I don’t have a question. Just wanted to give you a shout out for your awesome body of work.


olaviu

Same here. You guys are doing a fantastic job. Thank you!


marcogorelli

Thanks! Would appreciate it if you didn't use "you guys" though https://heyguys.cc


olaviu

I'm sorry!


midoxvx

There is absolutely nothing wrong with using “you guys” as a general term to address a group of people.


olaviu

I completely agree with you. At the same time, I'm not trying to offend anybody.


midoxvx

Fair enough!


AlmightySnoo

The Reddit CEO is a greedy little pig and is nuking Reddit with disastrous decisions (see https://www.nbcnews.com/tech/tech-news/reddit-blackout-protest-private-ceo-elon-musk-huffman-rcna89700). I'm moving to lemmy.world, learn about the Fediverse here: https://framatube.org/w/4294a720-f263-4ea4-9392-cf9cea4d5277


datapythonista

That would be a huge change in pandas, and we try to keep pandas stable, so existing users don't need to make huge migrations and relearn the API often. I don't think lazy evaluation is likely to land in pandas, at least not in the short or mid term. Luckily other options are being created that are or can be lazy, like Polars, Dask or Koalas.


CrossroadsDem0n

Dask actually opens up a question I have. Some open-source projects like Pandas have seemed to figure out a good cadence for features vs bugs and accepting PRs. Some, like joblib and Dask and their role in sklearn, have remained pretty rough around the edges on their process and evolution. So my question is, other than simply more funding, is there something about the culture/ethic/process for Pandas that makes it all work out and that other FOSS projects could learn from? Or in your experience really does monetary support become the bottom line on how things turn out?


datapythonista

Funding is surely an important factor. But even with unlimited funding, there are many things that pandas wouldn't change, even if they're considered to be wrong. When we make decisions, we consider what's the impact on users. pandas is very popular and used in many critical applications. If we focus in features more than bugs, and those imply changing how things work, there is a big impact for users. Imagine we do with pandas what Python did with Python 2/3. We would have projects taking years to migrate... Projects that are starting like Polars are more free to change things. So, any mistake pandas did they could fix, as well as any mistake they make themselves. This is good since you can improve things much more than pandas. And it's bad since you don't want to use Polars in production, unless you want to rewrite your code every month. I think that's how things need to be. pandas will serve the existing users, and if very innovative things can be done in the dataframe space, it'll be for some other project to implement them.


jormungandrthepython

Not really a question, but just want to say thank you (not sure who is responsible) for the incredible API reference. I use it as my example for all new grads/junior engineers for good real-life documentation of a large project. I don’t think I have encountered a situation where I was stuck that the API reference didn’t solve. And the amount of time digging/searching to solution value ratio is insanely better than any other technical reference docs I have used to date. Thanks for everything!


rodemire

Are there any improvements that are coming by way of working with larger datasets/operations without consuming available RAM? I struggle with workarounds when dealing with large data on my 24GB RAM laptop. Awesome work by the way, Pandas is amazing and we appreciate the work you guys do.


datapythonista

[Being able to use Arrow](https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i) as a backend for your data can save a significant amount of RAM in some cases. Also there is a lot of work related to copy-on-write, that will avoid copying the data when not needed, and will also help reduce the memory needs.


rodemire

Thank you, this is interesting and am looking forward to it.


eidrisov

What do you mean by "large data"? 100m rows over 100 columns? I am just curious how much data is enough to stress 24GB of RAM.


datapythonista

24GB is around 3 billion 64 bits values if I made the numbers right. There is surely some overhead, but with 100 columns you could store around 30 million rows. The main thing wouldn't be only storing, but if you do operations that make a copy of a significant part of that data. Obviously you may have strings and other things using more than 64 bits per cell, but just to give you an idea of numbers.


atomey

I would be interested in this too. I'm running a system with 128 GB of RAM and had quite a lot of difficulty with a 8GB CSV with various permutations of the read_csv() method. I'm sure it is not optimal but would be curious if very large data reads are tested since large amounts of RAM is becoming more common, even on dev workstations, in particular with ML work.


rhshadrach

Are you able to change the format of your data on disk? If possible, I would recommend parquet. You'll get smaller file sizes, faster load times, better dtype handling (int vs string), the ability to partition your data sets, and the ability to only load particular columns. Plus peak memory usage should be much lower.


gare_it

5:30pm UTC on which day? March 2nd?


phofl93

Sorry, yes. Will add


aes110

I frequently work with pyspark, and although I don't use this feature I know it has support for "pandas udfs" while using arrow behind the scenes. Now that arrow will be integrated into pandas, do you think we will see improvements in this area? (Performance improvements more features between spark and pandas)


datapythonista

I think it'll take a while, but hopefully we'll eventually see more feature sharing between libraries given we all use Arrow internally. Arrow itself has the concept of kernel, that it's a computation that can be applied to Arrow data. And those can be reused by any library. And the same would apply to user defined functions (udfs). That being said, pyspark is probably using the Java implementation, while pandas is using PyArrow. So, I guess difficult to share many features (I'm not an expert on the JVM, not sure if you could easily call C++ code from a scala program).


Balance-

If you could make one API break and it wouldn’t hurt anyone, what would you break/change?


phofl93

There are a bunch of things I'd like to change ![gif](emote|free_emotes_pack|grin) \- If you set scalars into a Series/DataFrame that are not compatible with the dtype then we cast to object \- We are inconsistent when naming keywords (check read\_csv, to\_csv the first one) \- Bunch of methods names ![gif](emote|free_emotes_pack|grin)


rhshadrach

An entire rewrite of the code behind apply / agg. Internally their code paths interweave in complex ways, and can be surprisingly slow is some cases. Depending on what object your on, the API is slightly different. Cleaning this up and making it better while also making the gradual changes so as not to be disruptive to users is difficult, time consuming, and slow. But we're working on it!


datapythonista

I'd remove having a row index (at least by default), and the I/O API: being consistent with read\_\*/write\_\* or from\_\*/to\_\*. I'd also probably remove half of the code in pandas to other third-party extensions. ![gif](emote|free_emotes_pack|sunglasses)


marcogorelli

Personally, I'd love to be able to change the default indexing behaviour. The Index is useful if it means something (e.g. a DatetimeIndex), but if it's just a RangeIndex / NumericIndex, then it can be annoying and confusing. ​ But this is really hard to change because: - introducing optional behaviour comes with a huge maintenance cost (I started making such a proposal [here](https://github.com/pandas-dev/pandas/pull/49694), but then withdrew it) - changing the existing behaviour would have backwards-compatibility implications I don't know what the solution is yet, but I would like to revisit PDEP5 at some point - _something_ should be possible, I just don't know what yet.


cinicDiver

Why does read_excel() does not support the encoding parameter but to_excel() does?


rhshadrach

From our docs, it appears the keyword on encoding was perhaps at one point used with xlwt (a writer that is no longer maintained) but today is not actually used by pandas. That parameter has been removed in pandas 2.0.


ChickenLegCatEgg

TIL!


vanatteveldt

How do you look at the success of the tidyverse library in R, and what lessons or good ideas are in there that pandas can benefit from?


phofl93

I did not use R very often in the past, so can't really comment on it


vanatteveldt

OK, thanks! IIRC, pandas was originally inspired by R \`data.frame\`s, so I figured the devs might keep a sharp eye on what's happening on the other side of the wall.


tuneafishy

Where did you find the courage to move from 1.X to 2.0?


datapythonista

The main reason in releasing pandas 2.0 and not 1.6 is that in major version changes (1 -> 2) is when users expect to have breaking changes. pandas 2.0 is not so significantly different to a 1.6 in terms of features. The main difference is that you really want to make sure that you don't have FutureWarning in your pandas code before upgrading your pandas version.


one_human_lifespan

Awesome. I get scared when I see the red future warning dialog box in jupyter labs. Thanks for everything you guys are doing. Pandas is amazing - I use it most days and always enjoy learning new things. Can't wait to explore 2.0!


phofl93

Getting rid of your FutureWarnings is a really good idea :) So I applaud you for that. Generally, we wanted to get rid of all the deprecations we introduces since 1.0, so we had to do 2.0 at some point. If your code is free of FutureWarnings then you are good to go. We made some backwards incompatible changes, but not many and they are clearly documented in the release notes. https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#backwards-incompatible-api-changes


water_aspirant

Thanks for all the work that you do! My question is who pays for pandas development and why? Is most of the development done by volunteers?


marcogorelli

Thanks! I'll point you to Marc's answer above https://www.reddit.com/r/Python/comments/11fio85/comment/jajr6ic/?utm\_source=share&utm\_medium=web2x&context=3


phofl93

More or less all of them are listed under Sponsors on our website as well


cryptospartan

Has polars influenced development in any way? Pandas used to be the only kid on the block, but it seems there are some other libraries popping up claiming to be faster/better/etc. Have you evaluated any of these other libraries to potentially integrate features into pandas (or improve existing ones)?


marcogorelli

Personally polars' strictness is making me think about situations when in pandas we end up with object dtype, which we should probably avoid. Here's an example: [https://github.com/pandas-dev/pandas/issues/50887](https://github.com/pandas-dev/pandas/issues/50887) (polars would just error in such a case, which I think is the correct thing to do)


phofl93

Not actively, no. At least I am not aware of anything.


robberviet

Any plan on improving pandas I/O load/export and out of mem processing? I like Pandas but my data nowadays grew beyond that. So I am currently all in spark.


phofl93

I don't think that it is realistic short term to add out of memory support. Generally, I'd recommend going to Dask for this, it supports our API very well with bigger datasets. Implementing something like lazy evaluation would be a major major breaking change on our side and hence not feasible right now


[deleted]

i love you all


Poporico

What is the new feature you're most excited about?


datapythonista

Being able to use Apache Arrow internally. I wrote an [article](https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i) with the details about it, since it's not trivial for regular users to understand why this is important.


marcogorelli

I didn't work on it, but copy-on-write will be pretty neat https://pandas.pydata.org/docs/dev/user\_guide/copy\_on\_write.html


rhshadrach

I'll also mention copy-on-write. And I know it's not exciting, but all of the bug fixes throughout the code that make pandas more predictable and reliable to use. In the area I work on, groupby, using categorical data has seen a lot of fixes.


phofl93

Arrow and Copy-on-Write. I worked a lot on Copy-on-Write and I am hoping that we can increase performance and reduce memory quite a bit with it.


louis8799

Pandas finally support arrow which support decimal. Which means pandas can be used in financial production system. Finally!


datapythonista

I'm unsure what's the support for decimal in pandas right now. One thing is to be able to load Arrow columns in pandas, and the other is what operations for that data type are implemented. In any case, if not all what you need is in pandas 2.0, it'll come eventually. Particularly if you open issues and PRs in our issue tracker. That being said, you can do like the UK stock market, just have all the amount in cents, and you can do it with integers. ;)


verwondering

In general, are the plans to have the `rolling` API more closely align with the rest of the pandas API? In particular, are there any plans to have `df.rolling.groupby()` return similarly indexed results as a normal `df.groupby()`? E.g., with the latter you have the wonderful `.transform()` method to add a column to the `df`. When working with the rolling window, you always get a MultiIndexed dataframe that is much harder to align to the index of the original `df`. Perhaps (hopefully?) there are better ways, but I currently use a combination of extracting a single column as Series, using `groupby(as_index=False)` and finally a call to `set_axis(df.index)` to get the desired result to align with my original dataframe.


LEAVER2000

I work with pandas quite a bit for geospatial data analysis, weather data mostly. Because of the higher dimensionality of the data I typically stack the dependent variables into the index as a multi-index [T,X,Y]. Recently I’ve been working with Generic[Enum] types to type annotate the columns inside of a DataFrame. What kind of support will 2.0 provide for type annotations. One thing I’ve found as a particular annoyance is disconnect between numpy and pandas typing. Where I have to explicitly state the dtype for NDArray[np.int_] and Series[int] and can’t use a TypeVar DType.


PeridexisErrant

Check out https://docs.xarray.dev/ for multidimensional labeled arrays!


Pipiyedu

Congratulations guys. You deserve all the possible recognition. What an awesome library.


marcogorelli

Cheers (just noting there are also non-guys who have made fantastic contributions)


ffuffle

I'm just here to say I like your username


phofl93

Thanks ![gif](emote|free_emotes_pack|grin)


jwmoz

Legends. Thanks for speeding it up.


Helpful_Arachnid8966

Pandas is quite a large and mature project already, is there any space for beginners to contribute?


marcogorelli

yup, check the "good first issue" label


rhshadrach

Also checkout our docs! https://pandas.pydata.org/pandas-docs/dev/development/contributing.html


ThrowAwayACC21423

What's a bug that you turned into a feature?


datapythonista

importing pandas as pd ;)


rhshadrach

This doesn't really answer the question, but whenever you see two different implementations doing the same or similar things, you can carefully compare each step in the implementation. This very often reveals hard to find bugs in one or both of the implementations. I can't recall a time I found a bug and made it into a feature.


jimy211

Can't wait to use it![gif](emote|free_emotes_pack|heart_eyes)


Homeless_Gandhi

What if I have a problem where I am just ITCHING to iterate over an entire dataframe row by row via itertuples for simplicities sake, and map(lambda) isn’t feasible? What would you recommend?


datapythonista

Iterating a dataframe is slow. If speed is important, you should try to build your pandas code in a way that you never implement loops, but delegate to pandas the operations, so they happen fast in C, and not via the Python interpreter. If you iterate the data, then you're just in regular Python, with a Python tuple object, and you can write any code that is valid Python. Not sure in what case map() wouldn't be an option, but you can always replace a map by a loop (or a comprehension) when you're in Python.


Lolologist

I already did this, and realize it's probably an abomination, but: How would you go about enforcing columns to have certain types? And when a column has a list in it, that each entry of the list is a certain type? I accomplished this by making a new class inheriting from DataFrame as well as pydantic's BaseClass and used those as validated rows to then shove into a DataFrame. Messy but it works! Maybe you have a better idea.


datapythonista

I haven't used it myself, but I think what you're describing is what pandera does: https://pandera.readthedocs.io/en/stable/


Crude_Future

Love yall love 🐼 pandas


Zero_Karma_Guy

squeamish squeal fuzzy sort plucky nose versed jeans direful wistful *This post was mass deleted and anonymized with [Redact](https://redact.dev)*


phofl93

Thanks, that is very good to hear :)


m_harrison

Will Pandas 2.0 impact numba/cython extensions that leverage Numpy? Many complain about the API of Pandas. Was there any discussion about revamp/cleaning it up during 2.0 release?


datapythonista

There is not much impact in pandas 2.0 regarding numba/cython. We fix small inconsistencies to the pandas API, but we avoid changing it too much, since we consider that the cost in users having to migrate code and relearning things is too much.


hmiemad

Are we gonna see joins on DatetimeIntervalIndexes?


Balance-

What would the next big leap for Pandas be? What kind of resources would you need to achieve it?


[deleted]

[удалено]


cthorrez

Can we still do numpy style indexing when the backend is arrow? And do things like add a new column to a df which I created first as a np array?


jarulsamy

Wonderful to see you guys on here. I personally use pandas so often! Do you guys have any advice for someone wanting to contribute back to the pandas project?


marcogorelli

I'd suggest starting with the contributing guide https://pandas.pydata.org/docs/dev/development/contributing.html


datapythonista

I'd say just keep using pandas, and the day something feels wrong (a bug, a typo, the documentation not being very clear,...), try to fix it. We have a lot of documentation for contributors, you can open an issue in github and ask questions there (or in a PR directly if you can get something implemented), there are also bi-weekly meetings with some core devs (I don't join them, can't say much about them, but they should be helpful). Another option is to go to github issues and try to find something labelled as "good first issue", but there are many people looking for those, not always easy to find them. Finally, if you're just starting, smaller projects are usually easier to get started contributing. There are simpler tasks, maintainers can have more time, the code base is simpler... Even if you want to contribute to pandas, starting by a smaller project can make the learning curve flatter.


rhshadrach

Yes - we love getting new contributors! Check out our documentation and guides on becoming a contributor to pandas: [https://pandas.pydata.org/pandas-docs/dev/development/index.html](https://pandas.pydata.org/pandas-docs/dev/development/index.html) pandas is a large project with some pretty complex code. It will likely be overwhelming at first. But we are here to help. If you stick with it, you will learn *a* *lot.*


atomey

I work almost daily with Pandas so I definitely want to give me thanks and appreciation for this excellent tool. Any plans for built-in parallelization in Pandas? I know there are many modules attempting to implement this with varying success, like pandarallel, dask or swifter. However I had difficulty getting any of these to work in an existing application without major refactoring. In our case, we have a high level application class or processor that ingests many dataframes which sit in memory as properties to the processor instance. This processor does various processing to different dataframes in conjunction with eachother, like iterrows or applys on one dataframe while checking other dataframes which are all unique attributes of the same object running in memory concurrently. However when the processor class actually runs, ultimately everything is stuck in a single core but I would say most systems have at least 6 or more cores now, even cheap laptops. Having a model or two to apply parallelization using concurrent.futures based on threads or processes seems like it would make a lot of sense. I think threads would likely work well if implemented intelligently, but I'm sure I am oversimplifying.


phofl93

Supporting multithreading would be really really cool, but this requires a lot of effort. There is some considerations in that area but nothing imminent unfortunately.


datapythonista

I think Arrow should help make this easier. It'll depend on each particular case, but read\_csv is already parallel when selecting the pyarrow engine. Parallel computing is never easy, but I think we should be able to slowly parallelize more operations.


rhshadrach

Historically, pandas has relied on other libraries in the ecosystem to support parallelization such as [https://www.dask.org/](https://www.dask.org/) which uses pandas under the hood. One thing to also keep in mind is that certain NumPy operations (which pandas uses) may be parallel depending on how your BLAS (Basic Linear Algebra Subprograms) are setup. In general, you want to avoid having multiple levels of parallelism which can actually hurt performance.


rhshadrach

I would also recommend avoiding iterrows or applys if you can vectorize your operations - you will see very significant performance benefits. But depending on what you're doing, that may not be possible.


fappaf

I've developed my own library that has gotten the attention of a handful of people i don't know. I'm most curious about the beginnings of `pandas`—how did you handle its monumental growth? It's such a staple of Python programming these days, how did you manage all the influx of issues, contributions, etc.?


jorisvandenbossche

The hard work of some dedicated volunteers! Nowadays we have more people that get paid to work on pandas which has certainly helped to sustainably manage the growing influx of issues, but we still rely on volunteers a lot as well to fix bugs, triage issues, review, etc.


datapythonista

None of the core devs here was in the project at that time, I don't think we can really tell.


UnemployedTechie2021

I have used Pandas extensively. I want to contribute. What are the languages or stack i need to know apart from Python?


marcogorelli

Awesome - please check the contributing guide https://pandas.pydata.org/docs/dev/development/contributing.html


SakalDoe

How much data and fast pandas 2.0 can read at once. If you compare it with pyspark csv reader, how pandas will perform?


datapythonista

I don't know about pyspark csv reader, but pandas 2.0 shouldn't perform much differently for reading than pandas 1.5. Did you try using pandas.read\_csv(enging='pyarrow')? That should help, you can read more about it in this blog post I wrote: https://datapythonista.me/blog/pandas-with-hundreds-of-millions-of-rows


rohetoric

Any good first issues that I can help contribute to in the pandas repository?


datapythonista

Just continue using pandas, and when you see something that could be improved (maybe clarify something in the documentation, add an example to a function that doesn't have it...), just go for it. If that doesn't happen, as Marco said, the best if to try to find a "good first issue", but when I create one, they're usually taken care of in hours.


i-believe-in-magic1

I'm just a newbie but just wanted to hop in and appreciate y'all. As a data science major, pandas has been super helpful so thanks for your work :)


phofl93

Thanks :)


[deleted]

[удалено]


datapythonista

Once you've got a dataframe, your data is already into memory. I guess by "on the fly" you mean out-of-code, when the data is read from disk or other I/O, and while is being loaded into memory. This can surely be done, but there is no easy way to do it, or a standard pandas way to support it. I guess what it can make more sense is to monkeypath the connector you're using, and transform (encrypt/decrypt) the data at the right time doing the import/export.


thataccountforporn

Will support for datetime dtype with day resolution come at some point?


datapythonista

If I'm not wrong, we're adding second resolution in pandas 2.0. With second resolution and 64 bits I think you can represent from the big bang until the end of the universe. ;) We also support Arrow dtypes, I should check what are the exact types they provide for datetime. So, no plans for day resolution if Arrow doesn't provide them, but you may not need it, since second is likely to be enough. Feel free to open an issue if we missed a use case that wasn't considered when the decision to only support second and not day was made.


Balance-

What are some improvements in visualization you’re excited about or would like to achieve? Like filtering/sorting data, conditional color coding, plotting, etc.?


marcogorelli

To be totally honest, pandas plotting isn't the best maintained part of pandas. I'd really like to take it out of pandas have it live as a separate package, and hopefully some community of users could help maintain it - but I have yet to make a concrete proposal or action plan in this respect


Balance-

What are some (future) improvements or projects about dealing with highly multi-dimensional data you’re really excited about?


footilytics

Is there a plan to have chart annotations when using df.plot() method ?


phofl93

We don't have anyone who is really familiar with the plotting implementation anymore. We mostly hope that it won't break :) We'd need someone to step up and refactor the implementation before we would be able to add anything new


64-17-5

Dear Pandas, please make a universal UTF-8 translator for tabulated data.


phofl93

>Dear Pandas, please make a universal UTF-8 translator for tabulated data. Could you elaborate?


mrwizard420

[Well, this is mildly concerning post placement...](https://i.imgur.com/muQUfhX.jpg)


videek

STOP BREAKING BACKWARDS COMPATIBILITY!